Create two charsets for HTR and NER tokens
teklia-dan dataset format
currently creates a charset.pkl
file that contains every character and NER token. It would be useful to create two different charsets:
-
charset_htr.pkl
containing only characters, punctuation, etc -
charset_ner.pkl
containing only NER tokens
To do that, we could to add a new --tokens
argument to the format subcommand for entity token mapping.
Other things to update:
- Prediction (loading the charset)
- DAN worker (loading the charset)
Edited by Solene Tarride