Newer
Older
### Kaldi training data generator
This script downloads pages with transcriptions from Arkindex
and converts data to Kaldi format. It also generates train, val and test splits.
`ARKINDEX_API_TOKEN` and `ARKINDEX_API_URL` environment variables must be defined.
Install necessary dependencies
```bash
virtualenv -p python3 .env
source .env/bin/activate
pip install -r requirements.txt
```
```bash
python kaldi_data_generator.py --help
```
Simple example:
```bash
python kaldi_data_generator.py --dataset_name my_balsac --out_dir /tmp/balsac/ --volumes 8f4005e9-1921-47b0-be7b-e27c7fd29486 d2f7c563-1622-4721-bd51-96fab97189f7
```
Polygon example:
```bash
python kaldi_data_generator.py --dataset_name my_balsac2 --extraction_mode polygon --out_dir /tmp/balsac/ --pages 50e1c3c0-2fe9-4216-805e-1a2fd2e7e9f4
```
The script creates 3 directories `Lines`, `Transcriptions`, `Partitions` in the specified `out_dir`.
The contents of these directories must be copied (or symlinked) to the corresponding directories in `data/local/` of kaldi recipe.