Skip to content
Snippets Groups Projects
README.md 1.09 KiB
Newer Older
Martin's avatar
Martin committed
### Kaldi training data generator

This script downloads pages with transcriptions from Arkindex 
and converts data to Kaldi format. It also generates train, val and test splits. 

Martin's avatar
Martin committed
### Using the script
Martin's avatar
Martin committed

`ARKINDEX_API_TOKEN` and `ARKINDEX_API_URL` environment variables must be defined.

Install necessary dependencies
```bash
virtualenv -p python3 .env
source .env/bin/activate
pip install -r requirements.txt
```

Martin's avatar
Martin committed
Use help to list possible parameters:
Martin's avatar
Martin committed
```bash
python kaldi_data_generator.py --help
```

Simple example:
```bash
python kaldi_data_generator.py --dataset_name my_balsac --out_dir /tmp/balsac/ --volumes 8f4005e9-1921-47b0-be7b-e27c7fd29486  d2f7c563-1622-4721-bd51-96fab97189f7 
```
Martin's avatar
Martin committed

Polygon example:
```bash
python kaldi_data_generator.py --dataset_name my_balsac2 --extraction_mode polygon --out_dir /tmp/balsac/ --pages 50e1c3c0-2fe9-4216-805e-1a2fd2e7e9f4
```
Martin Maarand's avatar
Martin Maarand committed

The script creates 3 directories `Lines`, `Transcriptions`, `Partitions` in the specified `out_dir`. 
The contents of these directories must be copied (or symlinked) to the corresponding directories in `data/local/` of kaldi recipe.