Dataset extraction
Description
Use the teklia-dan dataset extract
command to extract a dataset from an Arkindex export database (SQLite format). This will :
- Generate the images of each element (in the
images/
folder), - Create the mapping of the images (identified by its path) to the ground-truth transcription (with NER tokens if needed) (in the
labels.json
file), - Store the set of characters encountered in the dataset (in the
charset.pkl
file).
If an image download fails for whatever reason, it won't appear in the transcriptions file. The reason will be printed to stdout at the end of the process. Before trying to download the image, it checks that it wasn't downloaded previously. It is thus safe to run this command twice if a few images failed.
Parameter | Description | Type | Default |
---|---|---|---|
database |
Path to an Arkindex export database in SQLite format. | Path |
|
--element-type |
Type of the elements to extract. You may specify multiple types. | str |
|
--parent-element-type |
Type of the parent element containing the data. | str |
page |
--output |
Folder where the data will be generated. | Path |
|
--entity-separators |
Removes all text that does not appear in an entity or in the list of given ordered characters. If several separators follow each other, keep only the first to appear in the list. Do not give any arguments to keep the whole text. | str |
(see dedicated section) |
--tokens |
Mapping between starting tokens and end tokens to extract text with their entities. | Path |
|
--train-folder |
ID of the training folder to import from Arkindex. | uuid |
|
--val-folder |
ID of the validation folder to import from Arkindex. | uuid |
|
--test-folder |
ID of the training folder to import from Arkindex. | uuid |
|
--transcription-worker-version |
Filter transcriptions by worker_version. Use manual for manual filtering. |
str or uuid
|
|
--entity-worker-version |
Filter transcriptions entities by worker_version. Use manual for manual filtering |
str or uuid
|
|
--max-width |
Images larger than this width will be resized to this width. | int |
|
--max-height |
Images larger than this height will be resized to this height. | int |
|
--keep-spaces |
Transcriptions are trimmed by default. Use this flag to disable this behaviour. | bool |
False |
--image-format |
Images will be saved under this format. | str |
.jpg |
The --tokens
argument expects a YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively.
INTITULE: # Type of the entity on Arkindex
start: ⓘ # Starting token for this entity
end: Ⓘ # Optional ending token for this entity
DATE:
start: ⓓ
end: Ⓓ
COTE_SERIE:
start: ⓢ
end: Ⓢ
ANALYSE_COMPL.:
start: ⓒ
end: Ⓒ
PRECISIONS_SUR_COTE:
start: ⓟ
end: Ⓟ
COTE_ARTICLE:
start: ⓐ
end: Ⓐ
CLASSEMENT:
start: ⓛ
end: Ⓛ
Examples
HTR and NER data
To use the data from three folders as training, validation and testing dataset respectively, please use the following:
teklia-dan dataset extract \
database.sqlite \
--train-folder train_folder_uuid \
--val-folder val_folder_uuid \
--test-folder test_folder_uuid \
--element-type page \
--output data \
--tokens tokens.yml
HTR from multiple element types
To extract HTR data from annotations and text_zones from each folder, but only keep those that are children of single_pages, please use the following:
teklia-dan dataset extract \
database.sqlite \
--train-folder train_folder_uuid \
--val-folder val_folder_uuid \
--test-folder test_folder_uuid \
--element-type text_zone annotation \
--parent-element-type single_page \
--output data
HTR + NER data
To extract NER data and keep line breaks and spaces between entities, use the following command:
teklia-dan dataset extract \
[...]
--entity-separators $'\n' " " \
--tokens tokens.yml
If several separators follow each other, it will keep only one, ideally a line break if there is one, otherwise a space. If you change the order of the --entity-separators
parameters, then it will keep a space if there is one, otherwise a line break.