Move the unknown token replacement step to download
After #239 (closed) is done, we may have an issue with the unknown token.
Example:
- dataset A has no
P
occurence in its train/val sets but has some intest
set. - all occurences of
P
in thetest
set are replaced during dataset extraction - dataset B is being merged with dataset A
- dataset B has occurences of
P
in its train/val sets. The characterP
should no longer be considered unknown. - we have no way of reverting that change
That's why the change must be done during download
, where the dataset is finite.