New module to convert DAN NER prediction to BIO

We need a new module dan.bio where we would have functions able to convert NER encoded strings (with tokens) to BIO.

Basically, we would need a function that

signature convert(text: str, ner_tokens: Dict[str, EntityType]) -> str
split on \s
iterate over the split tokens, each will get a IOB-token
apply the algorithm described below

Algorithm

entity_types: list[str] = [] # Encountered entity types

tokens: list[str]

# List of tokens split based on charset

started: bool = False

# Whether we are inside an entity

iob_string: str = ""

# Full IOB formatted string

while token:= next(tokens): if is_NER_starting_token: entity_types.append(token)
   if has_ending_tokens:
       # Stop any current entity type
       started = False
   continue
elif has_ending_tokens and is_NER_ending_token:
   # Make sure the token is the closing of the current entity
   assert entity_types[-1] == starting_token
   
   # Remove from queue
   entity_types.pop()

   # if there is no more entity, you remove start
   # else we continue parent entity
   started = bool(entity_types)

   continue

# Not a NER token

# If there is no current entity type
if not entity_types:
   iob_string += f"{token} O"
   continue

# There is at least one entity type
if started:
   # We are inside an entity
   iob_string += f"{token} I-{ner_token}"
else:
   # We are starting an entity
   started = True
   iob_string += f"{token} B-{ner_token}"

This should somewhat work, even for nested (not fully) entities, with/without ending tokens.

Our NER evaluation lib (nerval) explains what the BIO format is.

Edited Nov 21, 2023 by Yoann Schneider