Add a command for dataset sanity check
I would like to create a new command to check a dataset before training.
New command:
pylaia-htr-dataset check --dataset my_dataset/ --model experiment/model
Some checks:
- check for missing images
- make sure image height is equal to
fixed_input_heightiffixed_input_heightis notNone - make sure image width is large enough
- make sure symbols in the
train|val|test.txtappear insyms.txt - check the dataset structure
# Images
├── images
├── train/
├── val/
└── test/
# Image ids (used for prediction)
├── train_ids.txt
├── val_ids.txt
├── test_ids.txt
# Tokenized transcriptions (used for training)
├── train.txt
├── val.txt
# Transcriptions (used for evaluation)
├── train_text.txt
├── val_text.txt
├── test_text.txt
# Symbol list
└── syms.txt
Output:
- logs problematic images
- Markdown file with dataset analysis/statistics
Edited by Solene Tarride