Skip to content
Snippets Groups Projects

Charset should only include training characters

Merged Manon Blanco requested to merge training-charset into main
@@ -18,6 +18,7 @@ If an image download fails for whatever reason, it won't appear in the transcrip
| `--output` | Folder where the data will be generated. | `Path` | |
| `--load-entities` | Extract text with their entities. Needed for NER tasks. | `bool` | `False` |
| `--entity-separators` | Removes all text that does not appear in an entity or in the list of given ordered characters. If several separators follow each other, keep only the first to appear in the list. Do not give any arguments to keep the whole text. | `str` | (see [dedicated section](#examples)) |
| `--unknown-token` | Token to use to replace character in the validation/test sets that is not included in the training set. | `str` | `?` |
| `--tokens` | Mapping between starting tokens and end tokens. Needed for NER tasks. | `Path` | |
| `--train-folder` | ID of the training folder to import from Arkindex. | `uuid` | |
| `--val-folder` | ID of the validation folder to import from Arkindex. | `uuid` | |
Loading