extract.md

# Dataset extraction

## Description

Use the `teklia-dan dataset extract` command to extract a dataset from an Arkindex export database (SQLite format). This will:

- Create a mapping of the elements (identified by its ID) to the image information and the ground-truth transcription (with NER tokens if needed) (in the `split.json` file),

| Parameter                         | Description                                                                                                                                                                                                                                                               | Type            | Default |
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------- | ------- |
| `database`                        | Path to an Arkindex export database in SQLite format.                                                                                                                                                                                                                     | `pathlib.Path`  |         |
| `--dataset-id `                   | ID of the dataset to extract from Arkindex.                                                                                                                                                                                                                               | `uuid`          |         |
| `--element-type`                  | Type of the elements to extract. You may specify multiple types.                                                                                                                                                                                                          | `str`           |         |
| `--output`                        | Folder where the data will be generated.                                                                                                                                                                                                                                  | `pathlib.Path`  |         |
| `--entity-separators`             | Removes all text that does not appear in an entity or in the list of given ordered characters. If several separators follow each other, keep only the first to appear in the list. Do not give any arguments to keep the whole text (see [dedicated section](#examples)). | `str`           |         |
| `--tokens`                        | Mapping between starting tokens and end tokens to extract text with their entities.                                                                                                                                                                                       | `pathlib.Path`  |         |
| `--transcription-worker-versions` | Filter transcriptions by worker_version. Use `manual` for manual filtering.                                                                                                                                                                                               | `str` or `uuid` |         |
| `--entity-worker-versions`        | Filter transcriptions entities by worker_version. Use `manual` for manual filtering                                                                                                                                                                                       | `str` or `uuid` |         |
| `--transcription-worker-runs`     | Filter transcriptions by worker_runs. Use `manual` for manual filtering.                                                                                                                                                                                                  | `str` or `uuid` |         |
| `--entity-worker-runs`            | Filter transcriptions entities by worker_runs. Use `manual` for manual filtering                                                                                                                                                                                          | `str` or `uuid` |         |
| `--keep-spaces`                   | Transcriptions are trimmed by default. Use this flag to disable this behaviour.                                                                                                                                                                                           | `bool`          | `False` |
| `--allow-empty`                   | Elements with no transcriptions are skipped by default. This flag disables this behaviour.                                                                                                                                                                                | `bool`          | `False` |

The `--tokens` argument expects a YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively. This file can be generated by the `teklia-dan dataset tokens` command. More details in the [dedicated page](./tokens.md).

```yaml
INTITULE: # Type of the entity on Arkindex
  start: ⓘ # Starting token for this entity
  end: Ⓘ # Optional ending token for this entity
DATE:
  start: ⓓ
  end: Ⓓ
COTE_SERIE:
  start: ⓢ
  end: Ⓢ
ANALYSE_COMPL.:
  start: ⓒ
  end: Ⓒ
PRECISIONS_SUR_COTE:
  start: ⓟ
  end: Ⓟ
COTE_ARTICLE:
  start: ⓐ
  end: Ⓐ
CLASSEMENT:
  start: ⓛ
  end: Ⓛ
```

## Examples

### HTR and NER data

To use the data from three folders as **training**, **validation** and **testing** dataset respectively, please use the following:

```shell
teklia-dan dataset extract \
    database.sqlite \
    --dataset-id dataset_uuid \
    --element-type page \
    --output data \
    --tokens tokens.yml
```

If the model should predict entities only and not the text surrounding them, the `--entity-separators` parameter can be used to list the only characters allowed in the transcription outside of entities. Only one of them will be used between entities, the priority is parsed through the order of the characters.

Here is an example of transcription with entities, on two lines:

<div class="entities-block highlight">
    The
    <span type="adj">great</span>
    king
    <span type="name">Charles</span>
    III has eaten <br />with
    <span type="person">us</span>
    .
</div>

Here is the extraction with `--entity-separators=" "`:

<div class="entities-block highlight">
    <span type="adj">great</span>
    <span type="name">Charles</span>
    <span type="person">us</span>
</div>

Here is the extraction with `--entity-separators="\n" " "`:

<div class="entities-block highlight">
    <span type="adj">great</span>
    <span type="name">Charles</span>
    <br />
    <span type="person">us</span>
</div>

The order of the argument is important. If the whitespaces are more important than the linebreaks, i.e. `--entity-separators=" " "\n"`, the extraction will result in:

<div class="entities-block highlight">
    <span type="adj">great</span>
    <span type="name">Charles</span>
    <span type="person">us</span>
</div>

### HTR from multiple element types

To extract HTR data from **annotations** and **text_zones** from each folder, but only keep those that are children of **single_pages**, please use the following:

```shell
teklia-dan dataset extract \
    database.sqlite \
    --dataset-id dataset_uuid \
    --element-type text_zone annotation \
    --output data
```