Split by page
Issue:
- currently, lines are shuffled to create splits => a single page contains training/validation/test lines
- as a result, the model sees samples from each page during training
- this makes the evaluation on the test set optimistic
How to fix it:
- split by page, so that we have entire pages in the validation/test set
What I did:
- split by page when creating the partitions
- fix tests
Edited by Solene Tarride