Dataset extraction
Description
Use the teklia-dan dataset extract
command to extract a dataset from an Arkindex export database (SQLite format). This will:
- Create a mapping of the elements (identified by its ID) to the image information and the ground-truth transcription (with NER tokens if needed) (in the
split.json
file), - Store the set of characters encountered in the dataset (in the
charset.pkl
file), - Generate the resources needed to build a n-gram language model at character, subword or word-level with kenlm (in the
language_model/
folder).
Parameter | Description | Type | Default |
---|---|---|---|
database |
Path to an Arkindex export database in SQLite format. | pathlib.Path |
|
--dataset-id |
ID of the dataset to extract from Arkindex. | uuid |
|
--element-type |
Type of the elements to extract. You may specify multiple types. | str |
|
--output |
Folder where the data will be generated. | pathlib.Path |
|
--entity-separators |
Removes all text that does not appear in an entity or in the list of given ordered characters. If several separators follow each other, keep only the first to appear in the list. Do not give any arguments to keep the whole text (see dedicated section). | str |
|
--unknown-token |
Token to use to replace character in the validation/test sets that is not included in the training set. | str |
⁇ |
--tokens |
Mapping between starting tokens and end tokens to extract text with their entities. | pathlib.Path |
|
--transcription-worker-version |
Filter transcriptions by worker_version. Use manual for manual filtering. |
str or uuid
|
|
--entity-worker-version |
Filter transcriptions entities by worker_version. Use manual for manual filtering |
str or uuid
|
|
--keep-spaces |
Transcriptions are trimmed by default. Use this flag to disable this behaviour. | bool |
False |
--allow-empty |
Elements with no transcriptions are skipped by default. This flag disables this behaviour. | bool |
False |
--subword-vocab-size |
Size of the vocabulary used to train the sentencepiece subword tokenizer used to train the optional language model. | int |
1000 |
The --tokens
argument expects a YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively. This file can be generated by the teklia-dan dataset tokens
command. More details in the dedicated page.
INTITULE: # Type of the entity on Arkindex
start: ⓘ # Starting token for this entity
end: Ⓘ # Optional ending token for this entity
DATE:
start: ⓓ
end: Ⓓ
COTE_SERIE:
start: ⓢ
end: Ⓢ
ANALYSE_COMPL.:
start: ⓒ
end: Ⓒ
PRECISIONS_SUR_COTE:
start: ⓟ
end: Ⓟ
COTE_ARTICLE:
start: ⓐ
end: Ⓐ
CLASSEMENT:
start: ⓛ
end: Ⓛ
Examples
HTR and NER data
To use the data from three folders as training, validation and testing dataset respectively, please use the following:
teklia-dan dataset extract \
database.sqlite \
--dataset-id dataset_uuid \
--element-type page \
--output data \
--tokens tokens.yml
If the model should predict entities only and not the text surrounding them, the --entity-separators
parameter can be used to list the only characters allowed in the transcription outside of entities. Only one of them will be used between entities, the priority is parsed through the order of the characters.
Here is an example of transcription with entities, on two lines:
with us .
Here is the extraction with --entity-separators=" "
:
Here is the extraction with --entity-separators="\n" " "
:
us
The order of the argument is important. If the whitespaces are more important than the linebreaks, i.e. --entity-separators=" " "\n"
, the extraction will result in:
HTR from multiple element types
To extract HTR data from annotations and text_zones from each folder, but only keep those that are children of single_pages, please use the following:
teklia-dan dataset extract \
database.sqlite \
--dataset-id dataset_uuid \
--element-type text_zone annotation \
--output data