Skip to content

Char splitting should split characters from the charset

Specific character like <language> should not be split into ["<", "l", "a", ...] but kept together instead.

Instead, we should build a regex based on the tokens from the charset and use it like word/line separators