Compute dataset statistics after extraction/formatting
Write a script to compute and display dataset statistics.
- compute min/mean/max number of tokens for each page (useful to set
max_char_prediction
) - compute min/mean/max occurrences of each entity (useful to detect potential transcription issues)
The script could be called with teklia-dan dataset analyze