Skip to content
Snippets Groups Projects

Dataset tokens

Description

Use the teklia-dan dataset tokens command generate a YAML file containing entities and their token(s) to train a DAN model.

Parameter Description Type Default
entities Path to a YAML file containing the extracted entities. pathlib.Path
--end-tokens Whether to generate end tokens along with starting tokens. bool False
--output-file Path to a YAML file to save the entities and their token(s). bool tokens.yml

The entities argument expects a YAML-formatted file with the list of entity names. This file can be generated by the teklia-dan dataset entities command. More details in the dedicated page.

entities:
  - INTITULE
  - DATE
  - ANALYSE_COMPL.
  - PRECISIONS_SUR_COTE
  - COTE_ARTICLE
  - CLASSEMENT

Examples

Start tokens

teklia-dan dataset tokens \
    entities.yml

This command will create a tokens.yml YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively.

INTITULE: # Type of the entity on Arkindex
  start:  # Starting token for this entity
  end: ''
DATE:
  start: 
  end: ''
ANALYSE_COMPL.:
  start: 
  end: ''
PRECISIONS_SUR_COTE:
  start: 
  end: ''
COTE_ARTICLE:
  start: 
  end: ''
CLASSEMENT:
  start: 
  end: ''

Start tokens + End tokens

teklia-dan dataset tokens \
    entities.yml \
    --end-tokens

This command will create a tokens.yml YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively.

INTITULE: # Type of the entity on Arkindex
  start:  # Starting token for this entity
  end:  # Ending token for this entity
DATE:
  start: 
  end: 
ANALYSE_COMPL.:
  start: 
  end: 
PRECISIONS_SUR_COTE:
  start: 
  end: 
COTE_ARTICLE:
  start: 
  end: 
CLASSEMENT:
  start: 
  end: