Charset should only include frequent training characters
Refs #190 (closed)
Any character appearing less than N
times in the training set should be mapped to a special unknown token in the validation/test sets.
@starride If a token appears rarely, do we also need to update the training set (to remove/replace this token)?