Skip to content

Split by page

Solene Tarride requested to merge split-by-page into master

Issue:

  • currently, lines are shuffled to create splits => a single page contains training/validation/test lines
  • as a result, the model sees samples from each page during training
  • this makes the evaluation on the test set optimistic

How to fix it:

  • split by page, so that we have entire pages in the validation/test set

What I did:

  • split by page when creating the partitions
  • fix tests
Edited by Solene Tarride

Merge request reports