Spaces between `B-` entities should be labelled as `O`
Ref: https://redmine.teklia.com/issues/5503#note-3
There is an issue with the Document.char_labels
method which impacts the atr-ner-eval nerval
evaluation command.
There is an error with character-level labels when two B-
tokens of the same entity follow each other. Since the B-
tokens are separate entities, the space between them should be labelled O
.
- Current output
>>> doc = Document("dog B-Animal\ncat B-Animal")
>>> doc.chars
['d', 'o', 'g', ' ', 'c', 'a', 't']
>>> doc.char_labels
['B-Animal', 'I-Animal', 'I-Animal', 'B-Animal', 'B-Animal', 'I-Animal', 'I-Animal']
- Expected output
>>> doc = Document("dog B-Animal\ncat B-Animal")
>>> doc.chars
['d', 'o', 'g', ' ', 'c', 'a', 't']
>>> doc.char_labels
['B-Animal', 'I-Animal', 'I-Animal', 'O', 'B-Animal', 'I-Animal', 'I-Animal']
Edited by Solene Tarride