Move the unknown token replacement step to download
After #239 (closed) is done, we may have an issue with the unknown token.
Example:
- dataset A has no
Poccurence in its train/val sets but has some intestset. - all occurences of
Pin thetestset are replaced during dataset extraction - dataset B is being merged with dataset A
- dataset B has occurences of
Pin its train/val sets. The characterPshould no longer be considered unknown. - we have no way of reverting that change
That's why the change must be done during download, where the dataset is finite.