Skip to content
Snippets Groups Projects

Dataset extraction

Description

Use the teklia-dan dataset extract command to extract a dataset from an Arkindex export database (SQLite format). This will:

  • Create a mapping of the elements (identified by its ID) to the image information and the ground-truth transcription (with NER tokens if needed) (in the split.json file),
Parameter Description Type Default
database Path to an Arkindex export database in SQLite format. pathlib.Path
--dataset-id ID of the dataset to extract from Arkindex. uuid
--element-type Type of the elements to extract. You may specify multiple types. str
--output Folder where the data will be generated. pathlib.Path
--entity-separators Removes all text that does not appear in an entity or in the list of given ordered characters. If several separators follow each other, keep only the first to appear in the list. Do not give any arguments to keep the whole text (see dedicated section). str
--tokens Mapping between starting tokens and end tokens to extract text with their entities. pathlib.Path
--transcription-worker-versions Filter transcriptions by worker_version. Use manual for manual filtering. str or uuid
--entity-worker-versions Filter transcriptions entities by worker_version. Use manual for manual filtering str or uuid
--transcription-worker-runs Filter transcriptions by worker_runs. Use manual for manual filtering. str or uuid
--entity-worker-runs Filter transcriptions entities by worker_runs. Use manual for manual filtering str or uuid
--keep-spaces Transcriptions are trimmed by default. Use this flag to disable this behaviour. bool False
--allow-empty Elements with no transcriptions are skipped by default. This flag disables this behaviour. bool False

The --tokens argument expects a YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively. This file can be generated by the teklia-dan dataset tokens command. More details in the dedicated page.

INTITULE: # Type of the entity on Arkindex
  start:  # Starting token for this entity
  end:  # Optional ending token for this entity
DATE:
  start: 
  end: 
COTE_SERIE:
  start: 
  end: 
ANALYSE_COMPL.:
  start: 
  end: 
PRECISIONS_SUR_COTE:
  start: 
  end: 
COTE_ARTICLE:
  start: 
  end: 
CLASSEMENT:
  start: 
  end: 

Examples

HTR and NER data

To use the data from three folders as training, validation and testing dataset respectively, please use the following:

teklia-dan dataset extract \
    database.sqlite \
    --dataset-id dataset_uuid \
    --element-type page \
    --output data \
    --tokens tokens.yml

If the model should predict entities only and not the text surrounding them, the --entity-separators parameter can be used to list the only characters allowed in the transcription outside of entities. Only one of them will be used between entities, the priority is parsed through the order of the characters.

Here is an example of transcription with entities, on two lines:

The great king Charles III has eaten
with us .

Here is the extraction with --entity-separators=" ":

great Charles us

Here is the extraction with --entity-separators="\n" " ":

great Charles
us

The order of the argument is important. If the whitespaces are more important than the linebreaks, i.e. --entity-separators=" " "\n", the extraction will result in:

great Charles us

HTR from multiple element types

To extract HTR data from annotations and text_zones from each folder, but only keep those that are children of single_pages, please use the following:

teklia-dan dataset extract \
    database.sqlite \
    --dataset-id dataset_uuid \
    --element-type text_zone annotation \
    --output data