# Dataset extraction

## Description

Use the `teklia-dan dataset extract` command to extract a dataset from an Arkindex export database (SQLite format). This will generate the images and the labels needed to train a DAN model.

| Parameter                        | Description                                                                                                                                                                                                                          | Type            | Default                              |
| -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------- | ------------------------------------ |
| `database`                       | Path to an Arkindex export database in SQLite format.                                                                                                                                                                                | `Path`          |                                      |
| `--parent`                       | UUID of the folder to import from Arkindex. You may specify multiple UUIDs.                                                                                                                                                          | `str` or `uuid` |                                      |
| `--element-type`                 | Type of the elements to extract. You may specify multiple types.                                                                                                                                                                     | `str`           |                                      |
| `--parent-element-type`          | Type of the parent element containing the data.                                                                                                                                                                                      | `str`           | `page`                               |
| `--output`                       | Folder where the data will be generated.                                                                                                                                                                                             | `Path`          |                                      |
| `--load-entities`                | Extract text with their entities. Needed for NER tasks.                                                                                                                                                                              | `bool`          | `False`                              |
| `--entity-separators`            | Removes all text that does not appear in an entity or in the list of given ordered characters. If several separators follow each other, keep only the first to appear in the list. Do not give any arguments to keep the whole text. | `str`           | (see [dedicated section](#examples)) |
| `--tokens`                       | Mapping between starting tokens and end tokens. Needed for NER tasks.                                                                                                                                                                | `Path`          |                                      |
| `--use-existing-split`           | Use the specified folder IDs for the dataset split.                                                                                                                                                                                  | `bool`          |                                      |
| `--train-folder`                 | ID of the training folder to import from Arkindex.                                                                                                                                                                                   | `uuid`          |                                      |
| `--val-folder`                   | ID of the validation folder to import from Arkindex.                                                                                                                                                                                 | `uuid`          |                                      |
| `--test-folder`                  | ID of the training folder to import from Arkindex.                                                                                                                                                                                   | `uuid`          |                                      |
| `--transcription-worker-version` | Filter transcriptions by worker_version. Use `manual` for manual filtering.                                                                                                                                                          | `str` or `uuid` |                                      |
| `--entity-worker-version`        | Filter transcriptions entities by worker_version. Use `manual` for manual filtering                                                                                                                                                  | `str` or `uuid` |                                      |
| `--train-prob`                   | Training set split size                                                                                                                                                                                                              | `float`         | `0.7`                                |
| `--val-prob`                     | Validation set split size                                                                                                                                                                                                            | `float`         | `0.15`                               |
| `--max-width`                    | Images larger than this width will be resized to this width.                                                                                                                                                                         | `int`           |                                      |
| `--max-height`                   | Images larger than this height will be resized to this height.                                                                                                                                                                       | `int`           |                                      |

The `--tokens` argument expects a YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively.

```yaml
INTITULE: # Type of the entity on Arkindex
  start: ⓘ # Starting token for this entity
  end: Ⓘ # Optional ending token for this entity
DATE:
  start: ⓓ
  end: Ⓓ
COTE_SERIE:
  start: ⓢ
  end: Ⓢ
ANALYSE_COMPL.:
  start: ⓒ
  end: Ⓒ
PRECISIONS_SUR_COTE:
  start: ⓟ
  end: Ⓟ
COTE_ARTICLE:
  start: ⓐ
  end: Ⓐ
CLASSEMENT:
  start: ⓛ
  end: Ⓛ
```

## Examples

### HTR and NER data from one source

To extract HTR+NER data from **pages** from a folder, you have to define an end token for each entity and use the following command:

```shell
teklia-dan dataset extract \
    database.sqlite \
    --parent folder_uuid \
    --element-type page \
    --output data \
    --load-entities \
    --tokens tokens.yml
```

with `tokens.yml` compliant with the format described before.

### HTR and NER data from multiple source

To do the same but only use the data from three folders, you have to define an end token for each entity and the commands becomes:

```shell
teklia-dan dataset extract \
    database.sqlite \
    --parent folder1_uuid folder2_uuid folder3_uuid \
    --element-type page \
    --output data \
    --load-entities \
    --tokens tokens.yml
```

### HTR and NER data with an existing split

To use the data from three folders as **training**, **validation** and **testing** dataset respectively, you have to define a end token for each entity and the commands becomes:

```shell
teklia-dan dataset extract \
    database.sqlite \
    --use-existing-split \
    --train-folder train_folder_uuid \
    --val-folder val_folder_uuid \
    --test-folder test_folder_uuid \
    --element-type page \
    --output data \
    --load-entities \
    --tokens tokens.yml
```

### HTR from multiple element types with some parent filtering

To extract HTR data from **annotations** and **text_zones** from a folder, but only keep those that are children of **single_pages**, you have to define an end token for each entity and use the following command:

```shell
teklia-dan dataset extract \
    database.sqlite \
    --parent folder_uuid \
    --element-type text_zone annotation \
    --parent-element-type single_page \
    --output data
```

### NER data

To extract NER data and keep breaklines and spaces between entities, use the following command:

```shell
teklia-dan dataset extract \
    [...]
    --load-entities \
    --entity-separators $'\n' " " \
    --tokens tokens.yml
```

If several separators follow each other, it will keep only one, ideally a breakline if there is one, otherwise a space. If you change the order of the `--entity-separators` parameters, then it will keep a space if there is one, otherwise a breakline.