Add a command for dataset sanity check

I would like to create a new command to check a dataset before training.

New command:

pylaia-htr-dataset check --dataset my_dataset/ --model experiment/model

Some checks:

check for missing images
make sure image height is equal to fixed_input_height if fixed_input_height is not None
make sure image width is large enough
make sure symbols in the train|val|test.txt appear in syms.txt
check the dataset structure

# Images
├── images
    ├── train/
    ├── val/
    └── test/
# Image ids (used for prediction)
├── train_ids.txt
├── val_ids.txt
├── test_ids.txt
# Tokenized transcriptions (used for training)
├── train.txt
├── val.txt
# Transcriptions (used for evaluation)
├── train_text.txt
├── val_text.txt
├── test_text.txt
# Symbol list
└── syms.txt

Output:

logs problematic images
Markdown file with dataset analysis/statistics

Edited Apr 16, 2024 by Solene Tarride