Skip to content
Snippets Groups Projects

Dataset extraction

Description

Use the teklia-dan dataset extract command to extract a dataset from an Arkindex export database (SQLite format). This will :

  • Generate the images of each element (in the images/ folder),
  • Create the mapping of the images (identified by its path) to the ground-truth transcription (with NER tokens if needed) (in the labels.json file),
  • Store the set of characters encountered in the dataset (in the charset.pkl file).

If an image download fails for whatever reason, it won't appear in the transcriptions file. The reason will be printed to stdout at the end of the process. Before trying to download the image, it checks that it wasn't downloaded previously. It is thus safe to run this command twice if a few images failed.

Parameter Description Type Default
database Path to an Arkindex export database in SQLite format. Path
--element-type Type of the elements to extract. You may specify multiple types. str
--parent-element-type Type of the parent element containing the data. str page
--output Folder where the data will be generated. Path
--entity-separators Removes all text that does not appear in an entity or in the list of given ordered characters. If several separators follow each other, keep only the first to appear in the list. Do not give any arguments to keep the whole text. str (see dedicated section)
--tokens Mapping between starting tokens and end tokens to extract text with their entities. Path
--train-folder ID of the training folder to import from Arkindex. uuid
--val-folder ID of the validation folder to import from Arkindex. uuid
--test-folder ID of the training folder to import from Arkindex. uuid
--transcription-worker-version Filter transcriptions by worker_version. Use manual for manual filtering. str or uuid
--entity-worker-version Filter transcriptions entities by worker_version. Use manual for manual filtering str or uuid
--max-width Images larger than this width will be resized to this width. int
--max-height Images larger than this height will be resized to this height. int
--keep-spaces Transcriptions are trimmed by default. Use this flag to disable this behaviour. bool False
--image-format Images will be saved under this format. str .jpg

The --tokens argument expects a YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively.

INTITULE: # Type of the entity on Arkindex
  start:  # Starting token for this entity
  end:  # Optional ending token for this entity
DATE:
  start: 
  end: 
COTE_SERIE:
  start: 
  end: 
ANALYSE_COMPL.:
  start: 
  end: 
PRECISIONS_SUR_COTE:
  start: 
  end: 
COTE_ARTICLE:
  start: 
  end: 
CLASSEMENT:
  start: 
  end: 

Examples

HTR and NER data

To use the data from three folders as training, validation and testing dataset respectively, please use the following:

teklia-dan dataset extract \
    database.sqlite \
    --train-folder train_folder_uuid \
    --val-folder val_folder_uuid \
    --test-folder test_folder_uuid \
    --element-type page \
    --output data \
    --tokens tokens.yml

HTR from multiple element types

To extract HTR data from annotations and text_zones from each folder, but only keep those that are children of single_pages, please use the following:

teklia-dan dataset extract \
    database.sqlite \
    --train-folder train_folder_uuid \
    --val-folder val_folder_uuid \
    --test-folder test_folder_uuid \
    --element-type text_zone annotation \
    --parent-element-type single_page \
    --output data

HTR + NER data

To extract NER data and keep line breaks and spaces between entities, use the following command:

teklia-dan dataset extract \
    [...]
    --entity-separators $'\n' " " \
    --tokens tokens.yml

If several separators follow each other, it will keep only one, ideally a line break if there is one, otherwise a space. If you change the order of the --entity-separators parameters, then it will keep a space if there is one, otherwise a line break.