Skip to content
Snippets Groups Projects

DAN: a Segmentation-free Document Attention Network for Handwritten Document Recognition

Documentation

For more details about this package, make sure to see the documentation available at https://teklia.gitlab.io/atr/dan/.

Installation

To use DAN in your own scripts, install it using pip:

pip install -e .

Inference

To apply DAN to an image, one needs to first add a few imports and to load an image. Note that the image should be in RGB.

import cv2
from dan.predict import DAN

image = cv2.cvtColor(cv2.imread(IMAGE_PATH), cv2.COLOR_BGR2RGB)

Then one can initialize and load the trained model with the parameters used during training.

model_path = 'model.pt'
params_path = 'parameters.yml'
charset_path = 'charset.pkl'

model = DAN('cpu')
model.load(model_path, params_path, charset_path, mode="eval")

To run the inference on a GPU, one can replace cpu by the name of the GPU. In the end, one can run the prediction:

text, confidence_scores = model.predict(image, confidences=True)

Training

This package provides three subcommands. To get more information about any subcommand, use the --help option.

Data extraction from Arkindex

Use the teklia-dan dataset extract command to extract a dataset from Arkindex. This will generate the images and the labels needed to train a DAN model. The available arguments are

Parameter Description Type Default
--parent UUID of the folder to import from Arkindex. You may specify multiple UUIDs. str/uuid
--element-type Type of the elements to extract. You may specify multiple types. str
--parent-element-type Type of the parent element containing the data. str page
--output Folder where the data will be generated. Path
--load-entities Extract text with their entities. Needed for NER tasks. bool False
--tokens Mapping between starting tokens and end tokens. Needed for NER tasks. Path
--use-existing-split Use the specified folder IDs for the dataset split. bool
--train-folder ID of the training folder to import from Arkindex. uuid
--val-folder ID of the validation folder to import from Arkindex. uuid
--test-folder ID of the training folder to import from Arkindex. uuid
--transcription-worker-version Filter transcriptions by worker_version. Use ‘manual’ for manual filtering. str/uuid
--entity-worker-version Filter transcriptions entities by worker_version. Use ‘manual’ for manual filtering str/uuid
--train-prob Training set split size float 0.7
--val-prob Validation set split size float 0.15

The --tokens argument expects a file with the following format.

---
INTITULE:
  start: 
  end: 
DATE:
  start: 
  end: 
COTE_SERIE:
  start: 
  end: 
ANALYSE_COMPL.:
  start: 
  end: 
PRECISIONS_SUR_COTE:
  start: 
  end: 
COTE_ARTICLE:
  start: 
  end: 
CLASSEMENT:
  start: 
  end: 

To extract HTR+NER data from pages from this folder, use the following command:

teklia-dan dataset extract \
    --parent 665e84ea-bd97-4912-91b0-1f4a844287ff \
    --element-type page \
    --output data \
    --load-entities \
    --tokens tokens.yml

with tokens.yml having the content described just above.

To do the same but only use the data from three folders, the commands becomes:

teklia-dan dataset extract \
    --parent 2275529a-1ec5-40ce-a516-42ea7ada858c af9b38b5-5d95-417d-87ec-730537cb1898 6ff44957-0e65-48c5-9d77-a178116405b2 \
    --element-type page \
    --output data \
    --load-entities \
    --tokens tokens.yml

To use the data from three folders as training, validation and testing dataset respectively, the commands becomes:

teklia-dan dataset extract \
    --use-existing-split \
    --train-folder 2275529a-1ec5-40ce-a516-42ea7ada858c
    --val-folder af9b38b5-5d95-417d-87ec-730537cb1898 \
    --test-folder 6ff44957-0e65-48c5-9d77-a178116405b2 \
    --element-type page \
    --output data \
    --load-entities \
    --tokens tokens.yml

To extract HTR data from annotations and text_zones from this folder that are children of single_pages, use the following command:

teklia-dan dataset extract \
    --parent 48852284-fc02-41bb-9a42-4458e5a51615 \
    --element-type text_zone annotation \
    --parent-element-type single_page \
    --output data

Dataset formatting

Use the teklia-dan dataset format command to format a dataset. This will generate two important files to train a DAN model:

  • labels.json
  • charset.pkl The available arguments are
Parameter Description Type Default
--dataset Path to the folder containing the dataset. str/uuid
--image-format Format under which the images were generated. str
--keep-spaces Transcriptions are trimmed by default. Use this flag to disable this behaviour. str
teklia-dan dataset format \
    --dataset path/to/dataset \
    --image-format png

The created files will be stored at the root of your dataset.

Model training

teklia-dan train with multiple arguments.

Remarks (for pre-training and training)

All hyperparameters are specified and editable in the training scripts (meaning are in comments).
Evaluation is performed just after training ending (training is stopped when the maximum elapsed time is reached or after a maximum number of epoch as specified in the training script).
The outputs files are split into two subfolders: "checkpoints" and "results".
"checkpoints" contains model weights for the last trained epoch and for the epoch giving the best valid CER.
"results" contains tensorboard log for loss and metrics as well as text file for used hyperparameters and results of evaluation.

Synthetic data generation

teklia-dan generate with multiple arguments