Skip to content

Add a command for dataset sanity check

I would like to create a new command to check a dataset before training.

New command:

pylaia-htr-dataset check --dataset my_dataset/ --model experiment/model

Some checks:

  • check for missing images
  • make sure image height is equal to fixed_input_height if fixed_input_height is not None
  • make sure image width is large enough
  • make sure symbols in the train|val|test.txt appear in syms.txt
  • check the dataset structure
# Images
├── images
    ├── train/
    ├── val/
    └── test/
# Image ids (used for prediction)
├── train_ids.txt
├── val_ids.txt
├── test_ids.txt
# Tokenized transcriptions (used for training)
├── train.txt
├── val.txt
# Transcriptions (used for evaluation)
├── train_text.txt
├── val_text.txt
├── test_text.txt
# Symbol list
└── syms.txt

Output:

  • logs problematic images
  • Markdown file with dataset analysis/statistics
Edited by Solene Tarride