Convert
Use the teklia-dan convert
command to convert DAN predictions to BIO format. This is also the code used during evaluation.
BIO format
The BIO (or IOB) format is a representation used for the Named Entity Recognition task.
Description
This command is meant to be used on DAN predictions. Make sure the predict command has been used first. The first argument of this command is the path to a folder with the predictions in JSON format. The other required arguments are described in the table below.
Parameter | Description | Type | Default |
---|---|---|---|
--output |
Where BIO files are saved. Will be created if missing | pathlib.Path |
|
--tokens |
Mapping between starting tokens and end tokens to extract text with their entities. | pathlib.Path |
!!! note
The --tokens
argument is the same file used during dataset extraction, generated by the tokens subcommand.
Examples
Take a simple prediction from DAN.
{
"text": "Ⓐ27 aout 1858\nⒶ27 aout 1858\nⒶ27 aout 1858\nⒶ28 aout 1858\nⒶ30 aout 1858",
"confidences": {},
"language_model": {},
"objects": [...]
}
With this tokens map:
Date:
start: Ⓐ
end:
Then you can create the corresponding BIO file using
teklia-dan convert predictions --tokens tokens.yml --output bio
The folder pointed by --output
will be created if missing. This command will generate one BIO file per JSON prediction, under the same name.
27 B-Date
aout I-Date
1858 I-Date
27 B-Date
aout I-Date
1858 I-Date
27 B-Date
aout I-Date
1858 I-Date
28 B-Date
aout I-Date
1858 I-Date
30 B-Date
aout I-Date
1858 I-Date