Kaldi training data generator
This script downloads pages with transcriptions from Arkindex and converts data to Kaldi format. It also generates reproducible train, val and test splits.
Usage
Installation
Install it as a package:
virtualenv -p python3 .env
source .env/bin/activate
pip install -e .
Environment variables
ARKINDEX_API_TOKEN
and ARKINDEX_API_URL
environment variables must be defined.
You can create an alias by adding this line to your ~/.bashrc
:
alias set_demo='export ARKINDEX_API_URL=https://demo.arkindex.org/;export ARKINDEX_API_TOKEN=my_api_token'
Then run:
source ~/.bashrc
set_demo
Arguments
Use help to list possible parameters (or read atr_data_generator/arguments.py
)
atr-data-generator --help
You can also set the arguments using a JSON or YAML configuration file:
---
dataset_name: balsac
out_dir: my_balsac_kaldi
common:
cache_dir: "/tmp/atr_data_generator_solene/cache/"
log_parameters: true
image:
extraction_mode: deskew_min_area_rect
max_deskew_angle: 45
split:
train_ratio: 0.8
test_ratio: 0.1
select:
pages:
- 18c1d2d9-72e8-4f7a-a866-78b59dd407dd
- 901b9c27-1cbe-44ea-94a0-d9c783f17905
- db9dd27c-e96c-43c2-bf29-991212243453
- b87999e2-3733-43b1-b8ef-0a297f90bf0f
- 7fe3d786-068f-48c9-ae63-86db2f986c4c
- 4fc61e75-4a11-42e3-b317-348451629bda
- 3e7e37c2-d0cc-41b3-8d8c-6de0bbc69012
- 63b6e80b-a825-4068-a12a-d12e3edf5f80
- b11decff-1c07-4c51-a5be-401974ea55ea
- 735cdde6-e540-4dbd-b271-2206e2498156
filter:
transcription_type: text_line
In this case, run:
atr-data-generator --config config.yaml
Every run will export a config.yaml
file and a param.json
that can be used to reproduce the data generation.
Examples
📝 these corpus ids are from https://demo.arkindex.org/, use
set_demo
Kaldi format
With page ids
atr-data-generator --dataset_name my_balsac --out_dir balsac --select.pages [18c1d2d9-72e8-4f7a-a866-78b59dd407dd,901b9c27-1cbe-44ea-94a0-d9c783f17905,db9dd27c-e96c-43c2-bf29-991212243453]
With volumes ids
atr-data-generator --dataset_name my_balsac --out_dir balsac --select.volumes [1d5a26d8-6a3e-45ed-bbb6-5a33d09782aa,46a3426f-86d4-45f1-bd57-0de43cd63efd,85207944-2230-4b76-a98f-735a11506743]
With corpus ids
atr-data-generator --dataset_name my_balsac --out_dir balsac --select.corpora [135eb31f-2c33-4ae3-be4e-2ae9adfd7c75] --select.volume_type page
The script creates 3 directories Lines
, Transcriptions
, Partitions
in the specified out_dir
.
The contents of these directories must be copied (or symlinked) to the corresponding directories in data/local/
of kaldi recipe.
TODO
- Pylaia format
- DAN format
- Resize image (fixed height, fixed_width, rescale...)