Skip to content

Move the unknown token replacement step to download

After #239 (closed) is done, we may have an issue with the unknown token.

Example:

  • dataset A has no P occurence in its train/val sets but has some in test set.
  • all occurences of P in the test set are replaced during dataset extraction
  • dataset B is being merged with dataset A
  • dataset B has occurences of P in its train/val sets. The character P should no longer be considered unknown.
  • we have no way of reverting that change

That's why the change must be done during download, where the dataset is finite.