Evaluate predictions with nerval
Depends #229 (closed) ner/nerval#11 (closed)
We'll use the new dan.bio
module to evaluate batches. We will support a new metric name ner
to trigger that new computation. Only trigger this behaviour when we have NER tokens.
We will compute scores for each prediction, during evaluation, and store them. We will compute averages per split and display them nicely in a single markdown table.
We need scores per split, per entity and macro-averaged. To evaluate, we will use the code from ner/nerval#11 (closed).
Store predictions during evaluation step (key = "str_x"
, also keep "str_y"
) and return them so that we can store them at the upper-level.
For each split,
- compute list of BIO string for prediction ("str_x")
- compute the same for GT ("str_y")
- call nerval.evaluate.evaluate with nerval.parse.parse_bio called on each list
Expose nerval threshold as new CLI argument, defaults to 0.3
.
Edited by Yoann Schneider