Skip to content

Offset issue during dataset extraction using start_token only

Dataset extraction currently handles named entities using both start_token and end_token. However, we observed that training with only a start_token was simpler and more efficient. As a result, the extraction code should be able to handle the case where end_token is None.

Currently, the code expects two tokens (start_token and end_token), so if end_token is None, the offset is wrong and shifts the whole text.

current output: Dit dia rebere de ⒽBⓝernat DeⒽvⓢant pageⒽsⓞ del regne Ⓗdⓛe françⓛaⒽ habⒽiⓛtant en lo Prat ab MariⒽaⓛ viuda deⓌ ⓝBarthomⓣeⓌu Pages moⓝrⓄi en BaraⓢⓄⓛⓌ
desired output: Dit dia rebere de ⒽⓝBernat ⒽⓢDevant Ⓗⓞpages del Ⓗⓛregne ⓛⒽde Ⓗⓛfrança habitant en lo ⒽⓛPrat ab ⓌⓝMaria ⓣⓌviuda de ⓝⓄBarthomeu ⓢⓄPages mori en ⓛⓌBara