Export parameters json
Context
We would like to log parameters used to generate a dataset. For instance:
- which folder/corpus ids are used ?
- what are the splits ?
- what extraction mode was used for training ? (we should run the workers with the same extraction mode)
- when was the dataset generated ? by who ?
To do this, I want to order the arguments and export them. I would also like to save the splits that were created.
Implementation
Three main things were implemented:
- Ordering arguments into data classes. I feel like this makes the code easier to read, as there are ~30 arguments in this project.
- Possibility to run this code using a configuration file
kaldi-data-generator --config config.yaml
- Exporting a
config.yaml
file + a more detailedparameters.json
file after runningkaldi-data-generator
Example
From this YAML configuration:
---
format: kaldi
dataset_name: balsac
out_dir: my_balsac_kaldi
common:
cache_dir: "/tmp/kaldi_data_generator_solene/cache/"
log_parameters: true
image:
extraction_mode: deskew_min_area_rect
max_deskew_angle: 45
split:
train_ratio: 0.8
test_ratio: 0.1
select:
pages:
- 18c1d2d9-72e8-4f7a-a866-78b59dd407dd
- 901b9c27-1cbe-44ea-94a0-d9c783f17905
- db9dd27c-e96c-43c2-bf29-991212243453
- b87999e2-3733-43b1-b8ef-0a297f90bf0f
- 7fe3d786-068f-48c9-ae63-86db2f986c4c
- 4fc61e75-4a11-42e3-b317-348451629bda
- 3e7e37c2-d0cc-41b3-8d8c-6de0bbc69012
- 63b6e80b-a825-4068-a12a-d12e3edf5f80
- b11decff-1c07-4c51-a5be-401974ea55ea
- 735cdde6-e540-4dbd-b271-2206e2498156
filter:
transcription_type: text_line
Running this command:
kaldi-data-generator --config config.yaml -o my_balsac_kraken --format kraken --image.extraction_mode polygon
Will produce the two files attached:
Edited by Solene Tarride