[Extraction] The "train" set must be extracted first
During extraction, if the train
set is not first (if, for example, we have val,test,train
instead of train,val,test
in dataset.sets
then the self.charset
variable will be empty and all the characters in the sets before the train
set will be replaced by the unknown token.
You can use sets=",".join([VAL_NAME,TRAIN_NAME,TEST_NAME])
to see the tests failed