Skip to content

Add an argument for entity tokens during prediction

Entity tokens are currently hard-coded during prediction. These tokens are specific to the CICR dataset:

index = [pos for pos, char in enumerate(text) if char in ["", "", "", ""]]

We should add a new --tokens argument that takes the tokens.yaml file. Here is an example of tokens.yaml:

---
Accusations:
  start: "ⓐ"
  end: ""
Ages:
  start: "Ⓐ"
  end: ""
Arrêtés de la chambre:
  start: "Ⓣ"
  end: ""
Dates des arrêts:
  start: "Ⓓ"
  end: ""
Demeures:
  start: "ⓓ"
  end: ""
Juridictions:
  start: "Ⓙ"
  end: ""
N° de carton:
  start: "Ⓒ"
  end: ""
N° de registre:
  start: "Ⓡ"
  end: ""
Noms des accusés (Prénoms):
  start: "Ⓝ"
  end: ""
Noms des accusés (Prénoms) epouse:
  start: "Ⓔ"
  end: ""
Peines prononcées par les sentences:
  start: "Ⓟ"
  end: ""
Qualités:
  start: "Ⓠ"
  end: ""