Charset should only include training characters
The charset is currently built based on train/validation/test labels (see here).
- It should only include characters that are included in the training set
- Any character in the validation/test sets that is not included in the charset should be mapped to a special unknown token(for example
⁇
)