Skip to content

Compute dataset statistics after extraction/formatting

Write a script to compute and display dataset statistics.

  • compute min/mean/max number of tokens for each page (useful to set max_char_prediction)
  • compute min/mean/max occurrences of each entity (useful to detect potential transcription issues)

The script could be called with teklia-dan dataset analyze