Dataset extraction
Description
Use the teklia-dan dataset extract
command to extract a dataset from an Arkindex export database (SQLite format). This will generate the images and the labels needed to train a DAN model.
Parameter | Description | Type | Default |
---|---|---|---|
database |
Path to an Arkindex export database in SQLite format. | Path |
|
--parent |
UUID of the folder to import from Arkindex. You may specify multiple UUIDs. |
str or uuid
|
|
--element-type |
Type of the elements to extract. You may specify multiple types. | str |
|
--parent-element-type |
Type of the parent element containing the data. | str |
page |
--output |
Folder where the data will be generated. | Path |
|
--load-entities |
Extract text with their entities. Needed for NER tasks. | bool |
False |
--entity-separators |
Removes all text that does not appear in an entity or in the list of given ordered characters. If several separators follow each other, keep only the first to appear in the list. Do not give any arguments to keep the whole text. | str |
(see dedicated section) |
--tokens |
Mapping between starting tokens and end tokens. Needed for NER tasks. | Path |
|
--use-existing-split |
Use the specified folder IDs for the dataset split. | bool |
|
--train-folder |
ID of the training folder to import from Arkindex. | uuid |
|
--val-folder |
ID of the validation folder to import from Arkindex. | uuid |
|
--test-folder |
ID of the training folder to import from Arkindex. | uuid |
|
--transcription-worker-version |
Filter transcriptions by worker_version. Use manual for manual filtering. |
str or uuid
|
|
--entity-worker-version |
Filter transcriptions entities by worker_version. Use manual for manual filtering |
str or uuid
|
|
--train-prob |
Training set split size | float |
0.7 |
--val-prob |
Validation set split size | float |
0.15 |
--max-width |
Images larger than this width will be resized to this width. | int |
|
--max-height |
Images larger than this height will be resized to this height. | int |
The --tokens
argument expects a YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively.
INTITULE: # Type of the entity on Arkindex
start: ⓘ # Starting token for this entity
end: Ⓘ # Optional ending token for this entity
DATE:
start: ⓓ
end: Ⓓ
COTE_SERIE:
start: ⓢ
end: Ⓢ
ANALYSE_COMPL.:
start: ⓒ
end: Ⓒ
PRECISIONS_SUR_COTE:
start: ⓟ
end: Ⓟ
COTE_ARTICLE:
start: ⓐ
end: Ⓐ
CLASSEMENT:
start: ⓛ
end: Ⓛ
Examples
HTR and NER data from one source
To extract HTR+NER data from pages from a folder, you have to define an end token for each entity and use the following command:
teklia-dan dataset extract \
database.sqlite \
--parent folder_uuid \
--element-type page \
--output data \
--load-entities \
--tokens tokens.yml
with tokens.yml
compliant with the format described before.
HTR and NER data from multiple source
To do the same but only use the data from three folders, you have to define an end token for each entity and the commands becomes:
teklia-dan dataset extract \
database.sqlite \
--parent folder1_uuid folder2_uuid folder3_uuid \
--element-type page \
--output data \
--load-entities \
--tokens tokens.yml
HTR and NER data with an existing split
To use the data from three folders as training, validation and testing dataset respectively, you have to define a end token for each entity and the commands becomes:
teklia-dan dataset extract \
database.sqlite \
--use-existing-split \
--train-folder train_folder_uuid \
--val-folder val_folder_uuid \
--test-folder test_folder_uuid \
--element-type page \
--output data \
--load-entities \
--tokens tokens.yml
HTR from multiple element types with some parent filtering
To extract HTR data from annotations and text_zones from a folder, but only keep those that are children of single_pages, you have to define an end token for each entity and use the following command:
teklia-dan dataset extract \
database.sqlite \
--parent folder_uuid \
--element-type text_zone annotation \
--parent-element-type single_page \
--output data
NER data
To extract NER data and keep breaklines and spaces between entities, use the following command:
teklia-dan dataset extract \
[...]
--load-entities \
--entity-separators $'\n' " " \
--tokens tokens.yml
If several separators follow each other, it will keep only one, ideally a breakline if there is one, otherwise a space. If you change the order of the --entity-separators
parameters, then it will keep a space if there is one, otherwise a breakline.