# Dataset tokens

## Description

Use the `teklia-dan dataset tokens` command generate a YAML file containing entities and their token(s) to train a DAN model.

| Parameter       | Description                                                  | Type           | Default      |
| --------------- | ------------------------------------------------------------ | -------------- | ------------ |
| `entities`      | Path to a YAML file containing the extracted entities.       | `pathlib.Path` |              |
| `--end-tokens`  | Whether to generate end tokens along with starting tokens.   | `bool`         | `False`      |
| `--output-file` | Path to a YAML file to save the entities and their token(s). | `bool`         | `tokens.yml` |

The `entities` argument expects a YAML-formatted file with the list of entity names. This file can be generated by the `teklia-dan dataset entities` command. More details in the [dedicated page](./entities.md).

```yaml
entities:
  - INTITULE
  - DATE
  - ANALYSE_COMPL.
  - PRECISIONS_SUR_COTE
  - COTE_ARTICLE
  - CLASSEMENT
```

## Examples

### Start tokens

```shell
teklia-dan dataset tokens \
    entities.yml
```

This command will create a `tokens.yml` YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively.

```yaml
INTITULE: # Type of the entity on Arkindex
  start: Ⓐ # Starting token for this entity
  end: ''
DATE:
  start: Ⓑ
  end: ''
ANALYSE_COMPL.:
  start: Ⓒ
  end: ''
PRECISIONS_SUR_COTE:
  start: Ⓓ
  end: ''
COTE_ARTICLE:
  start: Ⓔ
  end: ''
CLASSEMENT:
  start: Ⓕ
  end: ''
```

### Start tokens + End tokens

```shell
teklia-dan dataset tokens \
    entities.yml \
    --end-tokens
```

This command will create a `tokens.yml` YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively.

```yaml
INTITULE: # Type of the entity on Arkindex
  start: Ⓐ # Starting token for this entity
  end: Ⓑ # Ending token for this entity
DATE:
  start: Ⓒ
  end: Ⓓ
ANALYSE_COMPL.:
  start: Ⓔ
  end: Ⓕ
PRECISIONS_SUR_COTE:
  start: Ⓖ
  end: Ⓗ
COTE_ARTICLE:
  start: Ⓘ
  end: Ⓙ
CLASSEMENT:
  start: Ⓚ
  end: Ⓛ
```