Skip to content

Create two charsets for HTR and NER tokens

teklia-dan dataset format currently creates a charset.pkl file that contains every character and NER token. It would be useful to create two different charsets:

  • charset_htr.pkl containing only characters, punctuation, etc
  • charset_ner.pkl containing only NER tokens

To do that, we could to add a new --tokens argument to the format subcommand for entity token mapping.

Other things to update:

  • Prediction (loading the charset)
  • DAN worker (loading the charset)
Edited by Solene Tarride