# Dataset tokens ## Description Use the `teklia-dan dataset tokens` command generate a YAML file containing entities and their token(s) to train a DAN model. | Parameter | Description | Type | Default | | --------------- | ------------------------------------------------------------ | -------------- | ------------ | | `entities` | Path to a YAML file containing the extracted entities. | `pathlib.Path` | | | `--end-tokens` | Whether to generate end tokens along with starting tokens. | `bool` | `False` | | `--output-file` | Path to a YAML file to save the entities and their token(s). | `bool` | `tokens.yml` | The `entities` argument expects a YAML-formatted file with the list of entity names. This file can be generated by the `teklia-dan dataset entities` command. More details in the [dedicated page](./entities.md). ```yaml entities: - INTITULE - DATE - ANALYSE_COMPL. - PRECISIONS_SUR_COTE - COTE_ARTICLE - CLASSEMENT ``` ## Examples ### Start tokens ```shell teklia-dan dataset tokens \ entities.yml ``` This command will create a `tokens.yml` YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively. ```yaml INTITULE: # Type of the entity on Arkindex start: Ⓐ # Starting token for this entity end: '' DATE: start: Ⓑ end: '' ANALYSE_COMPL.: start: Ⓒ end: '' PRECISIONS_SUR_COTE: start: Ⓓ end: '' COTE_ARTICLE: start: Ⓔ end: '' CLASSEMENT: start: Ⓕ end: '' ``` ### Start tokens + End tokens ```shell teklia-dan dataset tokens \ entities.yml \ --end-tokens ``` This command will create a `tokens.yml` YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively. ```yaml INTITULE: # Type of the entity on Arkindex start: Ⓐ # Starting token for this entity end: Ⓑ # Ending token for this entity DATE: start: Ⓒ end: Ⓓ ANALYSE_COMPL.: start: Ⓔ end: Ⓕ PRECISIONS_SUR_COTE: start: Ⓖ end: Ⓗ COTE_ARTICLE: start: Ⓘ end: Ⓙ CLASSEMENT: start: Ⓚ end: Ⓛ ```