Skip to content

Support empty lines in BIO parser

Nerval evaluation fails on Socface images due to empty lines.

Evaluation:   1%| | 1/126 [00:04<09:08,  4.39s/it, values={'cer': nan, 'cer_no_token': nan, 'wer': 0.0, 
2024-02-08 10:03:08,759 INFO/dan.ocr.evaluate: Evaluating on set `test`
Evaluation:   1%| | 1/109 [00:22<40:30, 22.51s/it, values={'cer': 0.0599, 'cer_no_token': 0.0651, 'wer': 

#### DAN evaluation

| Split | CER (HTR-NER) | CER (HTR) | WER (HTR-NER) | WER (HTR) | WER (HTR no punct) | NER  |
|:-----:|:-------------:|:---------:|:-------------:|:---------:|:------------------:|:----:|
|  val  |      nan      |    nan    |      0.0      |    0.0    |        0.0         | nan  |
|  test |      5.99     |    6.51   |     17.24     |   16.14   |       16.14        | 1.84 |

#### Nerval evaluation
Traceback (most recent call last):
  File "/gpfsdswork/projects/rech/rxm/ulb79yw/nerval/nerval/parse.py", line 45, in parse_line
    assert match_iob, f"Line {line} does not match IOB regex"
AssertionError: Line  does not match IOB regex

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/gpfsdswork/projects/rech/rxm/ulb79yw/dan/env/bin/teklia-dan", line 8, in <module>
    sys.exit(main())
  File "/gpfsdswork/projects/rech/rxm/ulb79yw/dan/dan/cli.py", line 26, in main
    status = args.pop("func")(**args)
  File "/gpfsdswork/projects/rech/rxm/ulb79yw/dan/dan/ocr/evaluate.py", line 221, in run
    eval(0, config, nerval_threshold, mlflow_logging)
  File "/gpfsdswork/projects/rech/rxm/ulb79yw/dan/dan/ocr/evaluate.py", line 193, in eval
    eval_nerval(
  File "/gpfsdswork/projects/rech/rxm/ulb79yw/dan/dan/ocr/evaluate.py", line 126, in eval_nerval
    ground_truths = inferences_to_parsed_bio("ground_truth")
  File "/gpfsdswork/projects/rech/rxm/ulb79yw/dan/dan/ocr/evaluate.py", line 121, in inferences_to_parsed_bio
    return parse_bio(bio_values)
  File "/gpfsdswork/projects/rech/rxm/ulb79yw/nerval/nerval/parse.py", line 75, in parse_bio
    word, label = parse_line(index, line)
  File "/gpfsdswork/projects/rech/rxm/ulb79yw/nerval/nerval/parse.py", line 49, in parse_line
    raise Exception(f"The file is not in BIO format: check line {index} ({line})")
Exception: The file is not in BIO format: check line 0 ()
Edited by Mélodie Boillet