Skip to content

Charset should only include frequent training characters

Refs #190 (closed)

Any character appearing less than N times in the training set should be mapped to a special unknown token in the validation/test sets.

@starride If a token appears rarely, do we also need to update the training set (to remove/replace this token)?