Char splitting should split characters from the charset
Specific character like <language> should not be split into ["<", "l", "a", ...] but kept together instead.
Instead, we should build a regex based on the tokens from the charset and use it like word/line separators