Char splitting should split characters from the charset
Specific character like <language>
should not be split into ["<", "l", "a", ...]
but kept together instead.
Instead, we should build a regex based on the tokens from the charset and use it like word/line separators