Manon Blanco · Manon Blanco
--- a/docs/usage/datasets/extract.md

+ 38

− 23
+++ b/docs/usage/datasets/extract.md

+ 38

− 23
 @@ -4,25 +4,26 @@

 Use the `teklia-dan dataset extract` command to extract a dataset from an Arkindex export database (SQLite format). This will generate the images and the labels needed to train a DAN model.

-| Parameter                        | Description                                                                         | Type            | Default |
-| -------------------------------- | ----------------------------------------------------------------------------------- | --------------- | ------- |
-| `database`                       | Path to an Arkindex export database in SQLite format.                               | `Path`          |         |
-| `--parent`                       | UUID of the folder to import from Arkindex. You may specify multiple UUIDs.         | `str` or `uuid` |         |
-| `--element-type`                 | Type of the elements to extract. You may specify multiple types.                    | `str`           |         |
-| `--parent-element-type`          | Type of the parent element containing the data.                                     | `str`           | `page`  |
-| `--output`                       | Folder where the data will be generated.                                            | `Path`          |         |
-| `--load-entities`                | Extract text with their entities. Needed for NER tasks.                             | `bool`          | `False` |
-| `--tokens`                       | Mapping between starting tokens and end tokens. Needed for NER tasks.               | `Path`          |         |
-| `--use-existing-split`           | Use the specified folder IDs for the dataset split.                                 | `bool`          |         |
-| `--train-folder`                 | ID of the training folder to import from Arkindex.                                  | `uuid`          |         |
-| `--val-folder`                   | ID of the validation folder to import from Arkindex.                                | `uuid`          |         |
-| `--test-folder`                  | ID of the training folder to import from Arkindex.                                  | `uuid`          |         |
-| `--transcription-worker-version` | Filter transcriptions by worker_version. Use `manual` for manual filtering.         | `str` or `uuid` |         |
-| `--entity-worker-version`        | Filter transcriptions entities by worker_version. Use `manual` for manual filtering | `str` or `uuid` |         |
-| `--train-prob`                   | Training set split size                                                             | `float`         | `0.7`   |
-| `--val-prob`                     | Validation set split size                                                           | `float`         | `0.15`  |
-| `--max-width`                    | Images larger than this width will be resized to this width.                        | `int`           |         |
-| `--max-height`                   | Images larger than this height will be resized to this height.                      | `int`           |         |
+| Parameter                        | Description                                                                                                                                                                                                                          | Type            | Default                              |
+| -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------- | ------------------------------------ |
+| `database`                       | Path to an Arkindex export database in SQLite format.                                                                                                                                                                                | `Path`          |                                      |
+| `--parent`                       | UUID of the folder to import from Arkindex. You may specify multiple UUIDs.                                                                                                                                                          | `str` or `uuid` |                                      |
+| `--element-type`                 | Type of the elements to extract. You may specify multiple types.                                                                                                                                                                     | `str`           |                                      |
+| `--parent-element-type`          | Type of the parent element containing the data.                                                                                                                                                                                      | `str`           | `page`                               |
+| `--output`                       | Folder where the data will be generated.                                                                                                                                                                                             | `Path`          |                                      |
+| `--load-entities`                | Extract text with their entities. Needed for NER tasks.                                                                                                                                                                              | `bool`          | `False`                              |
+| `--entity-separators`            | Removes all text that does not appear in an entity or in the list of given ordered characters. If several separators follow each other, keep only the first to appear in the list. Do not give any arguments to keep the whole text. | `str`           | (see [dedicated section](#examples)) |
+| `--tokens`                       | Mapping between starting tokens and end tokens. Needed for NER tasks.                                                                                                                                                                | `Path`          |                                      |
+| `--use-existing-split`           | Use the specified folder IDs for the dataset split.                                                                                                                                                                                  | `bool`          |                                      |
+| `--train-folder`                 | ID of the training folder to import from Arkindex.                                                                                                                                                                                   | `uuid`          |                                      |
+| `--val-folder`                   | ID of the validation folder to import from Arkindex.                                                                                                                                                                                 | `uuid`          |                                      |
+| `--test-folder`                  | ID of the training folder to import from Arkindex.                                                                                                                                                                                   | `uuid`          |                                      |
+| `--transcription-worker-version` | Filter transcriptions by worker_version. Use `manual` for manual filtering.                                                                                                                                                          | `str` or `uuid` |                                      |
+| `--entity-worker-version`        | Filter transcriptions entities by worker_version. Use `manual` for manual filtering                                                                                                                                                  | `str` or `uuid` |                                      |
+| `--train-prob`                   | Training set split size                                                                                                                                                                                                              | `float`         | `0.7`                                |
+| `--val-prob`                     | Validation set split size                                                                                                                                                                                                            | `float`         | `0.15`                               |
+| `--max-width`                    | Images larger than this width will be resized to this width.                                                                                                                                                                         | `int`           |                                      |
+| `--max-height`                   | Images larger than this height will be resized to this height.                                                                                                                                                                       | `int`           |                                      |

 The `--tokens` argument expects a YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively.

 @@ -54,7 +55,7 @@ CLASSEMENT:

 ### HTR and NER data from one source

-To extract HTR+NER data from **pages** from a folder, use the following command:
+To extract HTR+NER data from **pages** from a folder, you have to define an end token for each entity and use the following command:

 ```shell
 teklia-dan dataset extract \
 @@ -70,7 +71,7 @@ with `tokens.yml` compliant with the format described before.

 ### HTR and NER data from multiple source

-To do the same but only use the data from three folders, the commands becomes:
+To do the same but only use the data from three folders, you have to define an end token for each entity and the commands becomes:

 ```shell
 teklia-dan dataset extract \
 @@ -84,7 +85,7 @@ teklia-dan dataset extract \

 ### HTR and NER data with an existing split

-To use the data from three folders as **training**, **validation** and **testing** dataset respectively, the commands becomes:
+To use the data from three folders as **training**, **validation** and **testing** dataset respectively, you have to define a end token for each entity and the commands becomes:

 ```shell
 teklia-dan dataset extract \
 @@ -101,7 +102,7 @@ teklia-dan dataset extract \

 ### HTR from multiple element types with some parent filtering

-To extract HTR data from **annotations** and **text_zones** from a folder, but only keep those that are children of **single_pages**, use the following command:
+To extract HTR data from **annotations** and **text_zones** from a folder, but only keep those that are children of **single_pages**, you have to define an end token for each entity and use the following command:

 ```shell
 teklia-dan dataset extract \
 @@ -111,3 +112,17 @@ teklia-dan dataset extract \
    --parent-element-type single_page \
    --output data
 ```
+
+### NER data
+
+To extract NER data and keep breaklines and spaces between entities, use the following command:
+
+```shell
+teklia-dan dataset extract \
+    [...]
+    --load-entities \
+    --entity-separators $'\n' " " \
+    --tokens tokens.yml
+```
+
+If several separators follow each other, it will keep only one, ideally a breakline if there is one, otherwise a space. If you change the order of the `--entity-separators` parameters, then it will keep a space if there is one, otherwise a breakline.