Newer
Older
This script downloads pages with transcriptions from Arkindex
and converts data to Kaldi format or kraken format. It also generates train, val and test splits.
`ARKINDEX_API_TOKEN` and `ARKINDEX_API_URL` environment variables must be defined.
Install necessary dependencies
```bash
virtualenv -p python3 .env
source .env/bin/activate
pip install -r requirements.txt
```
There is also an option that skips all vertical transcriptions and it is `--skip_vertical_lines`
python kaldi_data_generator.py -f kaldi --dataset_name my_balsac --out_dir /tmp/balsac/ --volumes 8f4005e9-1921-47b0-be7b-e27c7fd29486 d2f7c563-1622-4721-bd51-96fab97189f7
python kaldi_data_generator.py -f kaldi --dataset_name cz --out_dir /tmp/home_cz/ --corpora 1ed45e94-9108-4029-a529-9abe37f55ba0
python kaldi_data_generator.py -f kaldi --dataset_name my_balsac2 --extraction_mode polygon --out_dir /tmp/balsac/ --pages 50e1c3c0-2fe9-4216-805e-1a2fd2e7e9f4
The script creates 3 directories `Lines`, `Transcriptions`, `Partitions` in the specified `out_dir`.
The contents of these directories must be copied (or symlinked) to the corresponding directories in `data/local/` of kaldi recipe.
### Kraken format
simple examples:
```
$ python3 kaldi_data_generator.py -f kraken -o <output_dir> --volumes <volume_id> --no_split
```
For instance to download the 4 sets from IAM (2 validation set on Arkindex) in 3 directories :
```
$ python3 kaldi_data_generator.py -f kraken -o iam_training --volumes e7a95479-e5fc-4b20-830c-0c6e38bf8f72 --no_split
$ python3 kaldi_data_generator.py -f kraken -o iam_validation --volumes edc78ee1-09e0-4671-806b-5fc0392707d9 --no_split
$ python3 kaldi_data_generator.py -f kraken -o iam_validation --volumes fefbbfca-a6dd-4e00-8797-0d4628cb024d --no_split
$ python3 kaldi_data_generator.py -f kraken -o iam_test --volumes 0ce2b631-01d7-49bf-b213-ceb6eae74a9b --no_split
```