Version 0.2.0-dev1

b3a1a22f · Yoann Schneider · b16862aa · b3a1a22f · b3a1a22f · b3a1a22f
Unverified Commit b3a1a22f authored 1 year ago by Yoann Schneider
--- a/README.md
+++ b/README.md
@@ -4,17 +4,10 @@ This script downloads pages with transcriptions from Arkindex
 and converts data to ATR format.
 It also generates reproducible train, val and test splits.

-## Usage
+A documentation is available at https://teklia.gitlab.io/atr/data-generator/.

-### Installation
-Install it as a package:
-```bash
-virtualenv -p python3 .env
-source .env/bin/activate
-pip install -e ./document-processing -e .
-```

-### Environment variables
+## Environment variables
 `ARKINDEX_API_TOKEN` and `ARKINDEX_API_URL` environment variables must be defined.

 You can create an alias by adding this line to your `~/.bashrc`:
@@ -27,76 +20,3 @@ Then run:
 source ~/.bashrc
 set_demo
 ```
-
-### Arguments
-
-Use help to list possible parameters (or read [`atr_data_generator/arguments.py`](atr_data_generator/arguments.py))
-```bash
-atr-data-generator --help
-```
-
-You can also set the arguments using a JSON or YAML configuration file:
-```yaml
---
-dataset_name: balsac
-out_dir: my_balsac_kaldi
-common:
-  cache_dir: "/tmp/atr_data_generator_solene/cache/"
-  log_parameters: true
-image:
-  extraction_mode: deskew_min_area_rect
-  max_deskew_angle: 45
-split:
-  train_ratio: 0.8
-  test_ratio: 0.1
-select:
-  pages:
-  - 18c1d2d9-72e8-4f7a-a866-78b59dd407dd
-  - 901b9c27-1cbe-44ea-94a0-d9c783f17905
-  - db9dd27c-e96c-43c2-bf29-991212243453
-  - b87999e2-3733-43b1-b8ef-0a297f90bf0f
-  - 7fe3d786-068f-48c9-ae63-86db2f986c4c
-  - 4fc61e75-4a11-42e3-b317-348451629bda
-  - 3e7e37c2-d0cc-41b3-8d8c-6de0bbc69012
-  - 63b6e80b-a825-4068-a12a-d12e3edf5f80
-  - b11decff-1c07-4c51-a5be-401974ea55ea
-  - 735cdde6-e540-4dbd-b271-2206e2498156
-filter:
-  transcription_type: text_line
-```
-In this case, run:
-```sh
-atr-data-generator --config config.yaml
-```
-
-Every run will export a `config.yaml` file and a `param.json` that can be used to reproduce the data generation.
-
-## Examples
-
-> :pencil: these corpus ids are from https://demo.arkindex.org/, use `set_demo`
-
-### Kaldi format
-
-#### With page ids
-```bash
-atr-data-generator --dataset_name my_balsac --out_dir balsac --select.pages [18c1d2d9-72e8-4f7a-a866-78b59dd407dd,901b9c27-1cbe-44ea-94a0-d9c783f17905,db9dd27c-e96c-43c2-bf29-991212243453]
-```
-
-#### With volumes ids
-```bash
-atr-data-generator --dataset_name my_balsac --out_dir balsac --select.volumes [1d5a26d8-6a3e-45ed-bbb6-5a33d09782aa,46a3426f-86d4-45f1-bd57-0de43cd63efd,85207944-2230-4b76-a98f-735a11506743]
-```
-
-#### With corpus ids
-```bash
-atr-data-generator --dataset_name my_balsac --out_dir balsac --select.corpora [135eb31f-2c33-4ae3-be4e-2ae9adfd7c75] --select.volume_type page
-```
-
-The script creates 3 directories `Lines`, `Transcriptions`, `Partitions` in the specified `out_dir`.
-The contents of these directories must be copied (or symlinked) to the corresponding directories in `data/local/` of kaldi recipe.
-
-
-## TODO
-* Pylaia format
-* DAN format
-* Resize image (fixed height, fixed_width, rescale...)
\ No newline at end of file
--- a/VERSION
+++ b/VERSION
-0.2.0-dev
+0.2.0-dev1
--- a/docs/dev.md
+++ b/docs/dev.md
@@ -3,7 +3,9 @@
 ## Base setup

 * Use a virtualenv (e.g. with virtualenvwrapper `mkvirtualenv -a . atr-data-gen`)
-* Install atr-data-generator as a package (e.g. `pip install -e .`)
+* Install atr-data-generator as a package
+  * The `teklia-document-processing` library is setup via git submodule. Please run `git submodule update --init`.
+  * Then install both packages via `pip install ./document-processing -e .`

 ## Unit tests


--- a/docs/index.md
+++ b/docs/index.md
@@ -4,6 +4,6 @@ Create datasets from [Arkindex](https://demo.arkindex.org), a platform developed

 After installing this Python package, you will gain access to the `atr-data-generator` command. To learn more about it and its subcommands, run `atr-data-generator --help`.

-Both subcommands use a YAML configuration file, provided via the `--config` parameter. More details about the structure of this configuration file are available in the respective section.
+Both subcommands use a YAML configuration file, provided via the `--config` parameter. More details about the structure of this configuration file are available in the respective section. Every run will export both a `config.yaml` file and a `param.json` file that can be used to reproduce the data generation.

 See the [Development](dev.md) section to learn how to contribute to this project.