From b3a1a22ff439428461ad995c6f107f9a2dae7701 Mon Sep 17 00:00:00 2001
From: Yoann Schneider <yschneider@teklia.com>
Date: Wed, 19 Jul 2023 17:24:34 +0200
Subject: [PATCH] Version 0.2.0-dev1

---
 README.md     | 84 ++-------------------------------------------------
 VERSION       |  2 +-
 docs/dev.md   |  4 ++-
 docs/index.md |  2 +-
 4 files changed, 7 insertions(+), 85 deletions(-)

diff --git a/README.md b/README.md
index 6acfbc2..deec70b 100644
--- a/README.md
+++ b/README.md
@@ -4,17 +4,10 @@ This script downloads pages with transcriptions from Arkindex
 and converts data to ATR format.
 It also generates reproducible train, val and test splits.
 
-## Usage
+A documentation is available at https://teklia.gitlab.io/atr/data-generator/.
 
-### Installation
-Install it as a package:
-```bash
-virtualenv -p python3 .env
-source .env/bin/activate
-pip install -e ./document-processing -e .
-```
 
-### Environment variables
+## Environment variables
 `ARKINDEX_API_TOKEN` and `ARKINDEX_API_URL` environment variables must be defined.
 
 You can create an alias by adding this line to your `~/.bashrc`:
@@ -27,76 +20,3 @@ Then run:
 source ~/.bashrc
 set_demo
 ```
-
-### Arguments
-
-Use help to list possible parameters (or read [`atr_data_generator/arguments.py`](atr_data_generator/arguments.py))
-```bash
-atr-data-generator --help
-```
-
-You can also set the arguments using a JSON or YAML configuration file:
-```yaml
----
-dataset_name: balsac
-out_dir: my_balsac_kaldi
-common:
-  cache_dir: "/tmp/atr_data_generator_solene/cache/"
-  log_parameters: true
-image:
-  extraction_mode: deskew_min_area_rect
-  max_deskew_angle: 45
-split:
-  train_ratio: 0.8
-  test_ratio: 0.1
-select:
-  pages:
-  - 18c1d2d9-72e8-4f7a-a866-78b59dd407dd
-  - 901b9c27-1cbe-44ea-94a0-d9c783f17905
-  - db9dd27c-e96c-43c2-bf29-991212243453
-  - b87999e2-3733-43b1-b8ef-0a297f90bf0f
-  - 7fe3d786-068f-48c9-ae63-86db2f986c4c
-  - 4fc61e75-4a11-42e3-b317-348451629bda
-  - 3e7e37c2-d0cc-41b3-8d8c-6de0bbc69012
-  - 63b6e80b-a825-4068-a12a-d12e3edf5f80
-  - b11decff-1c07-4c51-a5be-401974ea55ea
-  - 735cdde6-e540-4dbd-b271-2206e2498156
-filter:
-  transcription_type: text_line
-```
-In this case, run:
-```sh
-atr-data-generator --config config.yaml
-```
-
-Every run will export a `config.yaml` file and a `param.json` that can be used to reproduce the data generation.
-
-## Examples
-
-> :pencil: these corpus ids are from https://demo.arkindex.org/, use `set_demo`
-
-### Kaldi format
-
-#### With page ids
-```bash
-atr-data-generator --dataset_name my_balsac --out_dir balsac --select.pages [18c1d2d9-72e8-4f7a-a866-78b59dd407dd,901b9c27-1cbe-44ea-94a0-d9c783f17905,db9dd27c-e96c-43c2-bf29-991212243453]
-```
-
-#### With volumes ids
-```bash
-atr-data-generator --dataset_name my_balsac --out_dir balsac --select.volumes [1d5a26d8-6a3e-45ed-bbb6-5a33d09782aa,46a3426f-86d4-45f1-bd57-0de43cd63efd,85207944-2230-4b76-a98f-735a11506743]
-```
-
-#### With corpus ids
-```bash
-atr-data-generator --dataset_name my_balsac --out_dir balsac --select.corpora [135eb31f-2c33-4ae3-be4e-2ae9adfd7c75] --select.volume_type page
-```
-
-The script creates 3 directories `Lines`, `Transcriptions`, `Partitions` in the specified `out_dir`.
-The contents of these directories must be copied (or symlinked) to the corresponding directories in `data/local/` of kaldi recipe.
-
-
-## TODO
-* Pylaia format
-* DAN format
-* Resize image (fixed height, fixed_width, rescale...)
\ No newline at end of file
diff --git a/VERSION b/VERSION
index 70426f8..a872945 100644
--- a/VERSION
+++ b/VERSION
@@ -1 +1 @@
-0.2.0-dev
+0.2.0-dev1
diff --git a/docs/dev.md b/docs/dev.md
index fd2043e..5bb12f5 100644
--- a/docs/dev.md
+++ b/docs/dev.md
@@ -3,7 +3,9 @@
 ## Base setup
 
 * Use a virtualenv (e.g. with virtualenvwrapper `mkvirtualenv -a . atr-data-gen`)
-* Install atr-data-generator as a package (e.g. `pip install -e .`)
+* Install atr-data-generator as a package
+  * The `teklia-document-processing` library is setup via git submodule. Please run `git submodule update --init`.
+  * Then install both packages via `pip install ./document-processing -e .`
 
 ## Unit tests
 
diff --git a/docs/index.md b/docs/index.md
index fb43263..ab64e97 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -4,6 +4,6 @@ Create datasets from [Arkindex](https://demo.arkindex.org), a platform developed
 
 After installing this Python package, you will gain access to the `atr-data-generator` command. To learn more about it and its subcommands, run `atr-data-generator --help`.
 
-Both subcommands use a YAML configuration file, provided via the `--config` parameter. More details about the structure of this configuration file are available in the respective section.
+Both subcommands use a YAML configuration file, provided via the `--config` parameter. More details about the structure of this configuration file are available in the respective section. Every run will export both a `config.yaml` file and a `param.json` file that can be used to reproduce the data generation.
 
 See the [Development](dev.md) section to learn how to contribute to this project.
-- 
GitLab