Skip to content

Export parameters json

Solene Tarride requested to merge export-parameters-json into master

Context

We would like to log parameters used to generate a dataset. For instance:

  • which folder/corpus ids are used ?
  • what are the splits ?
  • what extraction mode was used for training ? (we should run the workers with the same extraction mode)
  • when was the dataset generated ? by who ?

To do this, I want to order the arguments and export them. I would also like to save the splits that were created.

Implementation

Three main things were implemented:

  • Ordering arguments into data classes. I feel like this makes the code easier to read, as there are ~30 arguments in this project.
  • Possibility to run this code using a configuration file kaldi-data-generator --config config.yaml
  • Exporting a config.yaml file + a more detailed parameters.json file after running kaldi-data-generator

Example

From this YAML configuration:

---
format: kaldi
dataset_name: balsac
out_dir: my_balsac_kaldi
common:
  cache_dir: "/tmp/kaldi_data_generator_solene/cache/"
  log_parameters: true
image:
  extraction_mode: deskew_min_area_rect
  max_deskew_angle: 45
split:
  train_ratio: 0.8
  test_ratio: 0.1
select:
  pages:
  - 18c1d2d9-72e8-4f7a-a866-78b59dd407dd
  - 901b9c27-1cbe-44ea-94a0-d9c783f17905
  - db9dd27c-e96c-43c2-bf29-991212243453
  - b87999e2-3733-43b1-b8ef-0a297f90bf0f
  - 7fe3d786-068f-48c9-ae63-86db2f986c4c
  - 4fc61e75-4a11-42e3-b317-348451629bda
  - 3e7e37c2-d0cc-41b3-8d8c-6de0bbc69012
  - 63b6e80b-a825-4068-a12a-d12e3edf5f80
  - b11decff-1c07-4c51-a5be-401974ea55ea
  - 735cdde6-e540-4dbd-b271-2206e2498156
filter:
  transcription_type: text_line

Running this command:

kaldi-data-generator --config config.yaml -o my_balsac_kraken --format kraken --image.extraction_mode polygon

Will produce the two files attached:

Edited by Solene Tarride

Merge request reports