Skip to content
Snippets Groups Projects
Unverified Commit b3a1a22f authored by Yoann Schneider's avatar Yoann Schneider :tennis:
Browse files

Version 0.2.0-dev1

parent b16862aa
No related branches found
No related tags found
No related merge requests found
Pipeline #74931 passed
......@@ -4,17 +4,10 @@ This script downloads pages with transcriptions from Arkindex
and converts data to ATR format.
It also generates reproducible train, val and test splits.
## Usage
A documentation is available at https://teklia.gitlab.io/atr/data-generator/.
### Installation
Install it as a package:
```bash
virtualenv -p python3 .env
source .env/bin/activate
pip install -e ./document-processing -e .
```
### Environment variables
## Environment variables
`ARKINDEX_API_TOKEN` and `ARKINDEX_API_URL` environment variables must be defined.
You can create an alias by adding this line to your `~/.bashrc`:
......@@ -27,76 +20,3 @@ Then run:
source ~/.bashrc
set_demo
```
### Arguments
Use help to list possible parameters (or read [`atr_data_generator/arguments.py`](atr_data_generator/arguments.py))
```bash
atr-data-generator --help
```
You can also set the arguments using a JSON or YAML configuration file:
```yaml
---
dataset_name: balsac
out_dir: my_balsac_kaldi
common:
cache_dir: "/tmp/atr_data_generator_solene/cache/"
log_parameters: true
image:
extraction_mode: deskew_min_area_rect
max_deskew_angle: 45
split:
train_ratio: 0.8
test_ratio: 0.1
select:
pages:
- 18c1d2d9-72e8-4f7a-a866-78b59dd407dd
- 901b9c27-1cbe-44ea-94a0-d9c783f17905
- db9dd27c-e96c-43c2-bf29-991212243453
- b87999e2-3733-43b1-b8ef-0a297f90bf0f
- 7fe3d786-068f-48c9-ae63-86db2f986c4c
- 4fc61e75-4a11-42e3-b317-348451629bda
- 3e7e37c2-d0cc-41b3-8d8c-6de0bbc69012
- 63b6e80b-a825-4068-a12a-d12e3edf5f80
- b11decff-1c07-4c51-a5be-401974ea55ea
- 735cdde6-e540-4dbd-b271-2206e2498156
filter:
transcription_type: text_line
```
In this case, run:
```sh
atr-data-generator --config config.yaml
```
Every run will export a `config.yaml` file and a `param.json` that can be used to reproduce the data generation.
## Examples
> :pencil: these corpus ids are from https://demo.arkindex.org/, use `set_demo`
### Kaldi format
#### With page ids
```bash
atr-data-generator --dataset_name my_balsac --out_dir balsac --select.pages [18c1d2d9-72e8-4f7a-a866-78b59dd407dd,901b9c27-1cbe-44ea-94a0-d9c783f17905,db9dd27c-e96c-43c2-bf29-991212243453]
```
#### With volumes ids
```bash
atr-data-generator --dataset_name my_balsac --out_dir balsac --select.volumes [1d5a26d8-6a3e-45ed-bbb6-5a33d09782aa,46a3426f-86d4-45f1-bd57-0de43cd63efd,85207944-2230-4b76-a98f-735a11506743]
```
#### With corpus ids
```bash
atr-data-generator --dataset_name my_balsac --out_dir balsac --select.corpora [135eb31f-2c33-4ae3-be4e-2ae9adfd7c75] --select.volume_type page
```
The script creates 3 directories `Lines`, `Transcriptions`, `Partitions` in the specified `out_dir`.
The contents of these directories must be copied (or symlinked) to the corresponding directories in `data/local/` of kaldi recipe.
## TODO
* Pylaia format
* DAN format
* Resize image (fixed height, fixed_width, rescale...)
\ No newline at end of file
0.2.0-dev
0.2.0-dev1
......@@ -3,7 +3,9 @@
## Base setup
* Use a virtualenv (e.g. with virtualenvwrapper `mkvirtualenv -a . atr-data-gen`)
* Install atr-data-generator as a package (e.g. `pip install -e .`)
* Install atr-data-generator as a package
* The `teklia-document-processing` library is setup via git submodule. Please run `git submodule update --init`.
* Then install both packages via `pip install ./document-processing -e .`
## Unit tests
......
......@@ -4,6 +4,6 @@ Create datasets from [Arkindex](https://demo.arkindex.org), a platform developed
After installing this Python package, you will gain access to the `atr-data-generator` command. To learn more about it and its subcommands, run `atr-data-generator --help`.
Both subcommands use a YAML configuration file, provided via the `--config` parameter. More details about the structure of this configuration file are available in the respective section.
Both subcommands use a YAML configuration file, provided via the `--config` parameter. More details about the structure of this configuration file are available in the respective section. Every run will export both a `config.yaml` file and a `param.json` file that can be used to reproduce the data generation.
See the [Development](dev.md) section to learn how to contribute to this project.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment