New module to convert DAN NER prediction to BIO
We need a new module dan.bio
where we would have functions able to convert NER encoded strings (with tokens) to BIO.
Basically, we would need a function that
- signature
convert(text: str, ner_tokens: Dict[str, EntityType]) -> str
- split on
\s
- iterate over the split tokens, each will get a IOB-token
- apply the algorithm described below
Algorithm
entity_types: list[str] = [] # Encountered entity types
tokens: list[str]
# List of tokens split based on charset
started: bool = False
# Whether we are inside an entity
iob_string: str = ""
# Full IOB formatted string
while token:= next(tokens): if is_NER_starting_token: entity_types.append(token)
if has_ending_tokens:
# Stop any current entity type
started = False
continue
elif has_ending_tokens and is_NER_ending_token:
# Make sure the token is the closing of the current entity
assert entity_types[-1] == starting_token
# Remove from queue
entity_types.pop()
# if there is no more entity, you remove start
# else we continue parent entity
started = bool(entity_types)
continue
# Not a NER token
# If there is no current entity type
if not entity_types:
iob_string += f"{token} O"
continue
# There is at least one entity type
if started:
# We are inside an entity
iob_string += f"{token} I-{ner_token}"
else:
# We are starting an entity
started = True
iob_string += f"{token} B-{ner_token}"
This should somewhat work, even for nested (not fully) entities, with/without ending tokens.
Our NER evaluation lib (nerval) explains what the BIO format is.
Edited by Yoann Schneider