Do not edit NER text
Refs https://redmine.teklia.com/issues/5614
Here and there, we replace \n
by spaces. But when the worker searches for this text, it will not be found in the original transcription. And then all the following entities will have an offset of 0. This could potentially create duplicates entities.
We therefore need to keep the original text.
Process: https://arkindex.teklia.com/process/c88df85f-5d03-4edb-b48a-afe93e6f1d28/0
Example of logs
2023-12-15 11:04:59,618 INFO/arkindex_worker: Processing single_page 10-1 (13091945-3c21-4ac7-973f-0a24de014b4f) (205/2658)
2023-12-15 11:04:59,756 INFO/worker_dan.worker: Downloading image...
2023-12-15 11:05:00,601 INFO/dan.ocr.predict.inference: Loading images...
2023-12-15 11:05:00,644 INFO/dan.ocr.predict.inference: Images preprocessed!
2023-12-15 11:05:00,644 INFO/dan.ocr.predict.inference: Predicting...
2023-12-15 11:05:18,442 INFO/dan.ocr.predict.inference: Images processed
2023-12-15 11:05:18,962 INFO/dan.ocr.predict.inference: Prediction parsing...
2023-12-15 11:05:18,973 INFO/dan.ocr.predict.inference: Saving JSON prediction in /tmp/tmp6byyuyqi.teklia/tmp8jutilh6.json
2023-12-15 11:05:18,990 INFO/worker_dan.worker: Creating transcription on page
2023-12-15 11:05:19,149 WARNING/arkindex_worker: Failed running worker on element 13091945-3c21-4ac7-973f-0a24de014b4f: AssertionError('entities should be unique')