Skip to content

Charset should only include training characters

The charset is currently built based on train/validation/test labels (see here).

  • It should only include characters that are included in the training set
  • Any character in the validation/test sets that is not included in the charset should be mapped to a special unknown token(for example )