From b3a1a22ff439428461ad995c6f107f9a2dae7701 Mon Sep 17 00:00:00 2001 From: Yoann Schneider <yschneider@teklia.com> Date: Wed, 19 Jul 2023 17:24:34 +0200 Subject: [PATCH] Version 0.2.0-dev1 --- README.md | 84 ++------------------------------------------------- VERSION | 2 +- docs/dev.md | 4 ++- docs/index.md | 2 +- 4 files changed, 7 insertions(+), 85 deletions(-) diff --git a/README.md b/README.md index 6acfbc2..deec70b 100644 --- a/README.md +++ b/README.md @@ -4,17 +4,10 @@ This script downloads pages with transcriptions from Arkindex and converts data to ATR format. It also generates reproducible train, val and test splits. -## Usage +A documentation is available at https://teklia.gitlab.io/atr/data-generator/. -### Installation -Install it as a package: -```bash -virtualenv -p python3 .env -source .env/bin/activate -pip install -e ./document-processing -e . -``` -### Environment variables +## Environment variables `ARKINDEX_API_TOKEN` and `ARKINDEX_API_URL` environment variables must be defined. You can create an alias by adding this line to your `~/.bashrc`: @@ -27,76 +20,3 @@ Then run: source ~/.bashrc set_demo ``` - -### Arguments - -Use help to list possible parameters (or read [`atr_data_generator/arguments.py`](atr_data_generator/arguments.py)) -```bash -atr-data-generator --help -``` - -You can also set the arguments using a JSON or YAML configuration file: -```yaml ---- -dataset_name: balsac -out_dir: my_balsac_kaldi -common: - cache_dir: "/tmp/atr_data_generator_solene/cache/" - log_parameters: true -image: - extraction_mode: deskew_min_area_rect - max_deskew_angle: 45 -split: - train_ratio: 0.8 - test_ratio: 0.1 -select: - pages: - - 18c1d2d9-72e8-4f7a-a866-78b59dd407dd - - 901b9c27-1cbe-44ea-94a0-d9c783f17905 - - db9dd27c-e96c-43c2-bf29-991212243453 - - b87999e2-3733-43b1-b8ef-0a297f90bf0f - - 7fe3d786-068f-48c9-ae63-86db2f986c4c - - 4fc61e75-4a11-42e3-b317-348451629bda - - 3e7e37c2-d0cc-41b3-8d8c-6de0bbc69012 - - 63b6e80b-a825-4068-a12a-d12e3edf5f80 - - b11decff-1c07-4c51-a5be-401974ea55ea - - 735cdde6-e540-4dbd-b271-2206e2498156 -filter: - transcription_type: text_line -``` -In this case, run: -```sh -atr-data-generator --config config.yaml -``` - -Every run will export a `config.yaml` file and a `param.json` that can be used to reproduce the data generation. - -## Examples - -> :pencil: these corpus ids are from https://demo.arkindex.org/, use `set_demo` - -### Kaldi format - -#### With page ids -```bash -atr-data-generator --dataset_name my_balsac --out_dir balsac --select.pages [18c1d2d9-72e8-4f7a-a866-78b59dd407dd,901b9c27-1cbe-44ea-94a0-d9c783f17905,db9dd27c-e96c-43c2-bf29-991212243453] -``` - -#### With volumes ids -```bash -atr-data-generator --dataset_name my_balsac --out_dir balsac --select.volumes [1d5a26d8-6a3e-45ed-bbb6-5a33d09782aa,46a3426f-86d4-45f1-bd57-0de43cd63efd,85207944-2230-4b76-a98f-735a11506743] -``` - -#### With corpus ids -```bash -atr-data-generator --dataset_name my_balsac --out_dir balsac --select.corpora [135eb31f-2c33-4ae3-be4e-2ae9adfd7c75] --select.volume_type page -``` - -The script creates 3 directories `Lines`, `Transcriptions`, `Partitions` in the specified `out_dir`. -The contents of these directories must be copied (or symlinked) to the corresponding directories in `data/local/` of kaldi recipe. - - -## TODO -* Pylaia format -* DAN format -* Resize image (fixed height, fixed_width, rescale...) \ No newline at end of file diff --git a/VERSION b/VERSION index 70426f8..a872945 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.2.0-dev +0.2.0-dev1 diff --git a/docs/dev.md b/docs/dev.md index fd2043e..5bb12f5 100644 --- a/docs/dev.md +++ b/docs/dev.md @@ -3,7 +3,9 @@ ## Base setup * Use a virtualenv (e.g. with virtualenvwrapper `mkvirtualenv -a . atr-data-gen`) -* Install atr-data-generator as a package (e.g. `pip install -e .`) +* Install atr-data-generator as a package + * The `teklia-document-processing` library is setup via git submodule. Please run `git submodule update --init`. + * Then install both packages via `pip install ./document-processing -e .` ## Unit tests diff --git a/docs/index.md b/docs/index.md index fb43263..ab64e97 100644 --- a/docs/index.md +++ b/docs/index.md @@ -4,6 +4,6 @@ Create datasets from [Arkindex](https://demo.arkindex.org), a platform developed After installing this Python package, you will gain access to the `atr-data-generator` command. To learn more about it and its subcommands, run `atr-data-generator --help`. -Both subcommands use a YAML configuration file, provided via the `--config` parameter. More details about the structure of this configuration file are available in the respective section. +Both subcommands use a YAML configuration file, provided via the `--config` parameter. More details about the structure of this configuration file are available in the respective section. Every run will export both a `config.yaml` file and a `param.json` file that can be used to reproduce the data generation. See the [Development](dev.md) section to learn how to contribute to this project. -- GitLab