Skip to content
Snippets Groups Projects

Dataset extraction

Description

Use the teklia-dan dataset extract command to extract a dataset from an Arkindex export database (SQLite format). This will generate the images and the labels needed to train a DAN model.

Parameter Description Type Default
database Path to an Arkindex export database in SQLite format. Path
--parent UUID of the folder to import from Arkindex. You may specify multiple UUIDs. str or uuid
--element-type Type of the elements to extract. You may specify multiple types. str
--parent-element-type Type of the parent element containing the data. str page
--output Folder where the data will be generated. Path
--load-entities Extract text with their entities. Needed for NER tasks. bool False
--entity-separators Removes all text that does not appear in an entity or in the list of given ordered characters. If several separators follow each other, keep only the first to appear in the list. Do not give any arguments to keep the whole text. str (see dedicated section)
--tokens Mapping between starting tokens and end tokens. Needed for NER tasks. Path
--use-existing-split Use the specified folder IDs for the dataset split. bool
--train-folder ID of the training folder to import from Arkindex. uuid
--val-folder ID of the validation folder to import from Arkindex. uuid
--test-folder ID of the training folder to import from Arkindex. uuid
--transcription-worker-version Filter transcriptions by worker_version. Use manual for manual filtering. str or uuid
--entity-worker-version Filter transcriptions entities by worker_version. Use manual for manual filtering str or uuid
--train-prob Training set split size float 0.7
--val-prob Validation set split size float 0.15
--max-width Images larger than this width will be resized to this width. int
--max-height Images larger than this height will be resized to this height. int

The --tokens argument expects a YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively.

INTITULE: # Type of the entity on Arkindex
  start:  # Starting token for this entity
  end:  # Optional ending token for this entity
DATE:
  start: 
  end: 
COTE_SERIE:
  start: 
  end: 
ANALYSE_COMPL.:
  start: 
  end: 
PRECISIONS_SUR_COTE:
  start: 
  end: 
COTE_ARTICLE:
  start: 
  end: 
CLASSEMENT:
  start: 
  end: 

Examples

HTR and NER data from one source

To extract HTR+NER data from pages from a folder, you have to define an end token for each entity and use the following command:

teklia-dan dataset extract \
    database.sqlite \
    --parent folder_uuid \
    --element-type page \
    --output data \
    --load-entities \
    --tokens tokens.yml

with tokens.yml compliant with the format described before.

HTR and NER data from multiple source

To do the same but only use the data from three folders, you have to define an end token for each entity and the commands becomes:

teklia-dan dataset extract \
    database.sqlite \
    --parent folder1_uuid folder2_uuid folder3_uuid \
    --element-type page \
    --output data \
    --load-entities \
    --tokens tokens.yml

HTR and NER data with an existing split

To use the data from three folders as training, validation and testing dataset respectively, you have to define a end token for each entity and the commands becomes:

teklia-dan dataset extract \
    database.sqlite \
    --use-existing-split \
    --train-folder train_folder_uuid \
    --val-folder val_folder_uuid \
    --test-folder test_folder_uuid \
    --element-type page \
    --output data \
    --load-entities \
    --tokens tokens.yml

HTR from multiple element types with some parent filtering

To extract HTR data from annotations and text_zones from a folder, but only keep those that are children of single_pages, you have to define an end token for each entity and use the following command:

teklia-dan dataset extract \
    database.sqlite \
    --parent folder_uuid \
    --element-type text_zone annotation \
    --parent-element-type single_page \
    --output data

NER data

To extract NER data and keep breaklines and spaces between entities, use the following command:

teklia-dan dataset extract \
    [...]
    --load-entities \
    --entity-separators $'\n' " " \
    --tokens tokens.yml

If several separators follow each other, it will keep only one, ideally a breakline if there is one, otherwise a space. If you change the order of the --entity-separators parameters, then it will keep a space if there is one, otherwise a breakline.