Skip to content
Snippets Groups Projects
user avatar
Teklia Bot authored
4463f050
History

Kaldi training data generator

This script downloads pages with transcriptions from Arkindex and converts data to Kaldi format. It also generates reproducible train, val and test splits.

Usage

Installation

Install it as a package:

virtualenv -p python3 .env
source .env/bin/activate
pip install -e .

Environment variables

ARKINDEX_API_TOKEN and ARKINDEX_API_URL environment variables must be defined.

You can create an alias by adding this line to your ~/.bashrc:

alias set_demo='export ARKINDEX_API_URL=https://demo.arkindex.org/;export ARKINDEX_API_TOKEN=my_api_token'

Then run:

source ~/.bashrc
set_demo

Arguments

Use help to list possible parameters (or read atr_data_generator/arguments.py)

atr-data-generator --help

You can also set the arguments using a JSON or YAML configuration file:

---
dataset_name: balsac
out_dir: my_balsac_kaldi
common:
  cache_dir: "/tmp/atr_data_generator_solene/cache/"
  log_parameters: true
image:
  extraction_mode: deskew_min_area_rect
  max_deskew_angle: 45
split:
  train_ratio: 0.8
  test_ratio: 0.1
select:
  pages:
  - 18c1d2d9-72e8-4f7a-a866-78b59dd407dd
  - 901b9c27-1cbe-44ea-94a0-d9c783f17905
  - db9dd27c-e96c-43c2-bf29-991212243453
  - b87999e2-3733-43b1-b8ef-0a297f90bf0f
  - 7fe3d786-068f-48c9-ae63-86db2f986c4c
  - 4fc61e75-4a11-42e3-b317-348451629bda
  - 3e7e37c2-d0cc-41b3-8d8c-6de0bbc69012
  - 63b6e80b-a825-4068-a12a-d12e3edf5f80
  - b11decff-1c07-4c51-a5be-401974ea55ea
  - 735cdde6-e540-4dbd-b271-2206e2498156
filter:
  transcription_type: text_line

In this case, run:

atr-data-generator --config config.yaml

Every run will export a config.yaml file and a param.json that can be used to reproduce the data generation.

Examples

📝 these corpus ids are from https://demo.arkindex.org/, use set_demo

Kaldi format

With page ids

atr-data-generator --dataset_name my_balsac --out_dir balsac --select.pages [18c1d2d9-72e8-4f7a-a866-78b59dd407dd,901b9c27-1cbe-44ea-94a0-d9c783f17905,db9dd27c-e96c-43c2-bf29-991212243453]

With volumes ids

atr-data-generator --dataset_name my_balsac --out_dir balsac --select.volumes [1d5a26d8-6a3e-45ed-bbb6-5a33d09782aa,46a3426f-86d4-45f1-bd57-0de43cd63efd,85207944-2230-4b76-a98f-735a11506743]

With corpus ids

atr-data-generator --dataset_name my_balsac --out_dir balsac --select.corpora [135eb31f-2c33-4ae3-be4e-2ae9adfd7c75] --select.volume_type page

The script creates 3 directories Lines, Transcriptions, Partitions in the specified out_dir. The contents of these directories must be copied (or symlinked) to the corresponding directories in data/local/ of kaldi recipe.

TODO

  • Pylaia format
  • DAN format
  • Resize image (fixed height, fixed_width, rescale...)