Add an argument for entity tokens during prediction
Entity tokens are currently hard-coded during prediction. These tokens are specific to the CICR dataset:
index = [pos for pos, char in enumerate(text) if char in ["ⓝ", "ⓟ", "ⓓ", "ⓡ"]]
We should add a new --tokens
argument that takes the tokens.yaml
file. Here is an example of tokens.yaml
:
---
Accusations:
start: "ⓐ"
end: ""
Ages:
start: "Ⓐ"
end: ""
Arrêtés de la chambre:
start: "Ⓣ"
end: ""
Dates des arrêts:
start: "Ⓓ"
end: ""
Demeures:
start: "ⓓ"
end: ""
Juridictions:
start: "Ⓙ"
end: ""
N° de carton:
start: "Ⓒ"
end: ""
N° de registre:
start: "Ⓡ"
end: ""
Noms des accusés (Prénoms):
start: "Ⓝ"
end: ""
Noms des accusés (Prénoms) epouse:
start: "Ⓔ"
end: ""
Peines prononcées par les sentences:
start: "Ⓟ"
end: ""
Qualités:
start: "Ⓠ"
end: ""