Training configuration
All hyperparameters are specified and editable in the training scripts (meaning are in comments). This page introduces some useful keys and their description.
Dataset parameters
Parameter | Description | Type | Default |
dataset_name |
Name of the dataset. | str |
dataset_level |
Level of the dataset. Should be named after the element type. | str |
dataset_variant |
Variant of the dataset. Usually empty for HTR datasets, "_sem" for HTR+NER datasets. |
str |
dataset_path |
Path to the dataset. | str |
dataset_params.config.dataset_manager |
Dataset manager class. | custom class | OCRDatasetManager |
dataset_params.config.dataset_class |
Dataset class. | custom class | OCRDataset |
dataset_params.config.datasets |
Dataset dictionary with the dataset name as key and dataset path as value. | dict |
dataset_params.config.load_in_memory |
Load all images in CPU memory. | str |
True |
dataset_params.config.worker_per_gpu |
Number of parallel processes per gpu for data loading. | int |
4 |
dataset_params.config.height_divisor |
Factor to reduce the width of the feature vector before feeding the decoder. | int |
8 |
dataset_params.config.width_divisor |
Factor to reduce the height of the feature vector before feeding the decoder. | int |
32 |
dataset_params.config.padding_value |
Image padding value. | int |
0 |
dataset_params.config.padding_token |
Transcription padding value. | int |
None |
dataset_params.config.charset_mode |
Whether to add end-of-transcription and start-of-transcription tokens to charset. | str |
seq2seq |
dataset_params.config.constraints |
Whether to add end-of-transcription and start-of-transcription tokens in labels. | list |
["add_eot", "add_sot"] |
dataset_params.config.normalize |
Normalize with mean and variance of training dataset. | bool |
True |
dataset_params.config.preprocessings |
List of pre-processing functions to apply to input images. | list |
(see dedicated section) |
dataset_params.config.augmentation |
Configuration for data augmentation. | dict |
(see dedicated section) |
dataset_params.config.synthetic_data |
Configuration to generate synthetic data. | dict |
(see dedicated section) |
Data preprocessing
Preprocessing is applied before training the network (see dan/manager/
The following transformations are implemented:
- DPI adjustment
"type": "dpi",
"source": 300,
"target": 150,
- Convert to grayscale
"type": "to_grayscaled"
- Convert to RGB
"type": "to_RGB"
- Resize to a fixed height
"type": "fixed_height",
"fixed_height": 1000,
- Resize to a maximum size
"type": "resize",
"keep_ratio": True,
"max_height": 1000,
"max_width": None,
Multiple transformations can be combined. For example, to resize an image to a fixed height of 1000 pixels and convert images to RGB, use the following configuration in dataset_params.config.preprocessings
"type": "fixed_height",
"fixed_height": 1000
"type": "to_RGB"
Data augmentation
Augmentation transformations are applied on-the-fly during training to artificially increase data variability.
The following transformations are implemented in dan/
- Color inversion
- DPI adjusting
- Dilation and erosion
- Elastic distortion
- Reducing interline spacing
- Gaussian blur
- Gaussian noise
DAN also takes advantage of transforms from torchvision:
- ColorJitter
- GaussianBlur
- RandomCrop
- RandomPerspective
The following configuration is used by default when using the teklia-dan train document
command. Data augmentation is applied with a probability of 0.9, and each transformation has a 0.1 probability to be used.
"order": "random",
"proba": 0.9,
"augmentations": [
"type": "dpi",
"proba": 0.1,
"min_factor": 0.75,
"max_factor": 1,
"preserve_ratio": True,
"type": "perspective",
"proba": 0.1,
"min_factor": 0,
"max_factor": 0.4,
"type": "elastic_distortion",
"proba": 0.1,
"min_alpha": 0.5,
"max_alpha": 1,
"min_sigma": 1,
"max_sigma": 10,
"min_kernel_size": 3,
"max_kernel_size": 9,
"type": "dilation_erosion",
"proba": 0.1,
"min_kernel": 1,
"max_kernel": 3,
"iterations": 1,
"type": "color_jittering",
"proba": 0.1,
"factor_hue": 0.2,
"factor_brightness": 0.4,
"factor_contrast": 0.4,
"factor_saturation": 0.4,
"type": "gaussian_blur",
"proba": 0.1,
"min_kernel": 3,
"max_kernel": 5,
"min_sigma": 3,
"max_sigma": 5,
"type": "gaussian_noise",
"proba": 0.1,
"std": 0.5,
"type": "sharpen",
"proba": 0.1,
"min_alpha": 0,
"max_alpha": 1,
"min_strength": 0,
"max_strength": 1,
Synthetic data
In most cases, loading pre-trained weights from the RIMES model is sufficient for DAN to converge on a new dataset.
However, training DAN on documents written in different scripts can be more challenging (Arabic, Chinese, ...). In such cases, it is useful to train DAN with a combination of real and synthetic documents. It is recommended to train with 90% of synthetic documents at first, and then gradually decrease the proportion of synthetic documents. Since synthetic documents are "easy" to recognize, they mostly help adapting the decoder to the new language. As training continues, the ratio of real documents increases, so that the encoder learns to extract relevant features from real documents.
The following configuration can be used by default. It must be defined in dataset_params.config.synthetic_data
Name | Description | Type | Default |
font_path |
Path to a directory containing fonts. | str |
init_proba |
Initial proportion of synthetic documents. | float |
0.9 |
end_proba |
End proportion of synthetic documents. | float |
0.2 |
num_steps_proba |
Number of steps during which the ratio decreases from init_proba to end_proba . |
int |
200000 |
proba_scheduler_function |
Scheduler function to decrease the ratio (see dan/ ). |
dan.scheduler |
linear_scheduler |
start_scheduler_at_max_line |
Whether to start decreasing ratio after curriculum reach max number of lines. | bool |
True |
curriculum |
Whether to use curriculum learning (increase number of lines in synthetic documents). | bool |
True |
crop_curriculum |
Whether to crop images under the last text line. | bool |
True |
curr_start |
Step from which curriculum starts. | int |
0 |
curr_step |
Number of steps before increasing the number of lines for curriculum learning. | int |
10000 |
min_nb_lines |
Initial number of lines for curriculum learning. | int |
1 |
max_nb_lines |
Maximum number of lines for curriculum learning. | int |
10 |
padding_value |
Padding value. | int |
255 |
config.background_color_default |
Background color. | tuple |
(255, 255, 255) |
config.background_color_eps |
Epsilon for the background color. | int |
15 |
config.text_color_default |
Text color. | tuple |
(0, 0, 0) |
config.text_color_eps |
Epsilon for the text color. | int |
15 |
config.font_size_min |
Minimum font size. | int |
35 |
config.font_size_max |
Maximum font size. | int |
45 |
config.color_mode |
Color mode of synthetic documents. | str |
"RGB" |
config.padding_left_ratio_min |
Min ratio for padding on the left side. | float |
0.00 |
config.padding_left_ratio_max |
Max ratio for padding on the left side. | float |
0.05 |
config.padding_right_ratio_min |
Min ratio for padding on the right side. | float |
0.02 |
config.padding_right_ratio_max |
Max ratio for padding on the right side. | float |
0.2 |
config.padding_top_ratio_min |
Min ratio for padding at the top. | float |
0.02 |
config.padding_top_ratio_max |
Max ratio for padding at the top. | float |
0.1 |
config.padding_bottom_ratio_min |
Min ratio for padding at the bottom. | float |
0.02 |
config.padding_bottom_ratio_max |
Max ratio for padding at the bottom. | float |
0.1 |
Model parameters
Name | Description | Type | Default |
model_params.models.encoder |
Encoder class. | custom class | FCN_encoder |
model_params.models.decoder |
Decoder class. | custom class | GlobalHTADecoder |
model_params.transfer_learning.encoder |
Model to load for the encoder [state_dict_name, checkpoint_path, learnable, strict]. | list |
["encoder", "pretrained_models/", True, True] |
model_params.transfer_learning.decoder |
Model to load for the decoder [state_dict_name, checkpoint_path, learnable, strict]. | list |
["encoder", "pretrained_models/", True, False] |
model_params.transfered_charset |
Transfer learning of the decision layer based on charset of the model to transfer. | bool |
True |
model_params.additional_tokens |
For decision layer = [, ], only for transferred charset. | int |
1 |
model_params.input_channels |
Number of channels of input image. | int |
3 |
model_params.dropout |
Dropout probability in the encoder. | float |
0.5 |
model_params.enc_dim |
Dimension of features extracted by the encoder. | int |
256 |
model_params.nb_layers |
Number of layers in the encoder. | int |
5 |
model_params.h_max |
Maximum height for encoder output (for 2D positional embedding). | int |
500 |
model_params.w_max |
Maximum width for encoder output (for 2D positional embedding). | int |
1000 |
model_params.l_max |
Maximum predicted sequence length (for 1D positional embedding). | int |
15000 |
model_params.dec_num_layers |
Number of transformer decoder layers. | int |
8 |
model_params.dec_num_heads |
Number of heads in transformer decoder layers. | int |
4 |
model_params.dec_res_dropout |
Dropout probability in transformer decoder layers. | int |
0.1 |
model_params.dec_pred_dropout |
Dropout rate before decision layer. | float |
0.1 |
model_params.dec_att_dropout |
Dropout rate in multi head attention. | float |
0.1 |
model_params.dec_dim_feedforward |
Number of dimensions for feedforward layer in transformer decoder layers. | int |
256 |
model_params.use_2d_pe |
Whether to use 2D positional embedding. | bool |
True |
model_params.use_1d_pe |
Whether to use 1D positional embedding. | bool |
True |
model_params.use_lstm |
Whether to use a LSTM layer in the decoder. | bool |
False |
model_params.attention_win |
Length of attention window. | int |
100 |
model_params.dropout_scheduler.function |
Curriculum dropout scheduler. | custom class. | 100 |
model_params.dropout_scheduler.T |
Exponential factor. | float |
5e4 |
Training parameters
Name | Description | Type | Default |
training_params.output_folder |
Directory for checkpoint and results. | str |
training_params.max_nb_epochs |
Maximum number of epochs before stopping training. | int |
800 |
training_params.max_training_time |
Maximum time (in seconds) before stopping training. | int |
350000 |
training_params.load_epoch |
Model to load. Should be either "best" (evaluation) or last (training). |
str |
"last" |
training_params.interval_save_weights |
Step to save weights. Set to None to keep only best and last epochs. |
int |
None |
training_params.batch_size |
Mini-batch size for the training loop. | int |
2 |
training_params.valid_batch_size |
Mini-batch size for the valdiation loop. | int |
4 |
training_params.use_ddp |
Whether to use DistributedDataParallel. | bool |
False |
training_params.ddp_port |
DDP port. | int |
20027 |
training_params.use_amp |
Whether to enable automatic mix-precision. | int |
torch.cuda.device_count() |
training_params.nb_gpu |
Number of GPUs to train DAN. | str |
training_params.optimizers.all.class |
Optimizer class. | custom class | Adam | |
Learning rate for the optimizer. | float |
0.0001 |
training_params.optimizers.all.args.amsgrad |
Whether to use AMSGrad optimization. | custom class | False |
training_params.lr_schedulers |
Learning rate schedulers. | custom class | None |
training_params.eval_on_valid |
Whether to evaluate and log metrics on the validation set during training. | bool |
True |
training_params.eval_on_valid_interval |
Interval (in epochs) to evaluate during training. | int |
5 |
training_params.focus_metric |
Metrics to focus on to determine best epoch. | str |
cer |
training_params.expected_metric_value |
Best value for the focus metric. Should be either "high" or "low" . |
low |
cer |
training_params.set_name_focus_metric |
Dataset to focus on to select best weights. | str |
training_params.train_metrics |
List of metrics to compute during training. | list |
["loss_ce", "cer", "wer", "wer_no_punct"] |
training_params.train_metrics |
List of metrics to compute during validation. | list |
["cer", "wer", "wer_no_punct"] |
training_params.force_cpu |
Whether to train on CPU (for debugging). | bool |
False |
training_params.max_char_prediction |
Maximum number of characters to predict. | int |
1000 |
training_params.label_noise_scheduler.min_error_rate |
Minimum ratio of teacher forcing. | float |
0.2 |
training_params.label_noise_scheduler.max_error_rate |
Maximum ratio of teacher forcing. | float |
0.2 |
training_params.label_noise_scheduler.total_num_steps |
Number of steps before stopping teacher forcing. | float |
5e4 |
MLFlow logging
To log your experiment on MLFlow, update the following arguments.
Name | Description | Type | Default |
mlflow.run_id |
Name of the current run in MLflow. | str |
mlflow.s3_endpoint_url |
URL of S3 endpoint. | str |
mlflow.aws_access_key_id |
Access key id to the AWS server. | str |
mlflow.aws_secret_access_key |
Secret access key to the AWS server. | str |