Offset issue during dataset extraction using start_token only
Dataset extraction currently handles named entities using both start_token
and end_token
. However, we observed that training with only a start_token
was simpler and more efficient. As a result, the extraction code should be able to handle the case where end_token
is None
.
Currently, the code expects two tokens (start_token
and end_token
), so if end_token
is None
, the offset is wrong and shifts the whole text.
current output: Dit dia rebere de ⒽBⓝernat DeⒽvⓢant pageⒽsⓞ del regne Ⓗdⓛe françⓛaⒽ habⒽiⓛtant en lo Prat ab MariⒽaⓛ viuda deⓌ ⓝBarthomⓣeⓌu Pages moⓝrⓄi en BaraⓢⓄⓛⓌ
desired output: Dit dia rebere de ⒽⓝBernat ⒽⓢDevant Ⓗⓞpages del Ⓗⓛregne ⓛⒽde Ⓗⓛfrança habitant en lo ⒽⓛPrat ab ⓌⓝMaria ⓣⓌviuda de ⓝⓄBarthomeu ⓢⓄPages mori en ⓛⓌBara