# Training configuration To train a model, you need to write a JSON configuration file. The list of fields are described in the [next section](#dataset-parameters). An empty configuration file is available at `configs/quickstart.json`. You will need to fill in the paths. ## Dataset parameters | Parameter | Description | Type | Default | | ----------------------------- | --------------------------------------------------------------------------------------------------------------------- | -------------- | ------- | | `dataset.max_char_prediction` | Maximum number of characters to predict. | `int` | `1000` | | `dataset.tokens` | Path to a NER tokens configuration file similar to [the one used for extraction](../datasets/extract.md#description). | `pathlib.Path` | | To determine the value to use for `dataset.max_char_prediction`, you can use the [analyze command](../datasets/analyze.md) to find the maximum number of characters in a label of the dataset. !!! note You must replace the pseudo-variables `$dataset_name` and `$dataset_path` with respectively the name and the relative/absolute path to your dataset. ## Model parameters | Name | Description | Type | Default | | -------------------------- | ---------------------------------------------------------------------------------- | ------ | ------- | | `model.transfered_charset` | Transfer learning of the decision layer based on charset of the model to transfer. | `bool` | `True` | | `model.additional_tokens` | For decision layer = \[<eot>, \], only for transferred charset. | `int` | `1` | | `model.h_max` | Maximum height for encoder output (for 2D positional embedding). | `int` | `500` | | `model.w_max` | Maximum width for encoder output (for 2D positional embedding). | `int` | `1000` | ### Encoder | Name | Description | Type | Default | | ------------------------- | ----------------------------------- | ------- | ------- | | `model.encoder.dropout` | Dropout probability in the encoder. | `float` | `0.5` | | `model.encoder.nb_layers` | Number of layers in the encoder. | `int` | `5` | ### Decoder | Name | Description | Type | Default | | ----------------------------------- | ------------------------------------------------------------------------- | ------- | ------- | | `model.decoder.enc_dim` | Dimension of features extracted by the encoder. | `int` | `256` | | `model.decoder.l_max` | Maximum predicted sequence length (for 1D positional embedding). | `int` | `15000` | | `model.decoder.dec_num_layers` | Number of transformer decoder layers. | `int` | `8` | | `model.decoder.dec_num_heads` | Number of heads in transformer decoder layers. | `int` | `4` | | `model.decoder.dec_res_dropout` | Dropout probability in transformer decoder layers. | `float` | `0.1` | | `model.decoder.dec_pred_dropout` | Dropout rate before decision layer. | `float` | `0.1` | | `model.decoder.dec_att_dropout` | Dropout rate in multi head attention. | `float` | `0.1` | | `model.decoder.dec_dim_feedforward` | Number of dimensions for feedforward layer in transformer decoder layers. | `int` | `256` | | `model.decoder.attention_win` | Length of attention window. | `int` | `100` | ### Language model This assumes that you have already [trained a language model](../train/language_model.md). | Name | Description | Type | Default | | ----------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | ------- | | `model.lm.path` | Path to the language model. | `str` | | | `model.lm.weight` | How much weight to give to the language model. It should be set carefully (usually between 0.5 and 2.0) as it will affect the quality of the predictions. | `float` | | !!! note - linebreaks are treated as spaces by language models, as a result predictions will not include linebreaks. The `model.lm.path` argument expects a path to the language mode, but the parent folder should also contains: - a `lexicon.txt` file, - a `tokens.txt` file. You should get the following tree structure: ``` folder/ ├── <model.lm.path> # Path to the language model ├── lexicon.txt └── tokens.txt ``` ## Training parameters | Name | Description | Type | Default | | ------------------------ | --------------------------------------------------------------------------- | ------------ | -------- | | `training.output_folder` | Directory for checkpoint and results. | `str` | | | `training.max_nb_epochs` | Maximum number of epochs before stopping training. | `int` | `800` | | `training.load_epoch` | Model to load. Should be either `"best"` (evaluation) or `last` (training). | `str` | `"last"` | | `training.lr_schedulers` | Learning rate schedulers. | custom class | | ### Device | Name | Description | Type | Default | | -------------------------- | --------------------------------------------------------------------------------------------------------------------------- | ------ | ------- | | `training.device.use_ddp` | Whether to use DistributedDataParallel. | `bool` | `False` | | `training.device.ddp_port` | DDP port. | `int` | `20027` | | `training.device.use_amp` | Whether to enable automatic mix-precision. | `bool` | `True` | | `training.device.nb_gpu` | Number of GPUs to train DAN. Set to `null` to use all GPUs available. | `int` | | | `training.device.force` | Use a specific device if available. Use `cpu` to train on CPU (for debugging) or `cuda`/`cuda:$gpu_device` to train on GPU. | `str` | | To train on several GPUs, simply set the `training.device.use_ddp` parameter to `True`. By default, the model will use all available GPUs. To restrict access to fewer GPUs, one can modify the `training.device.nb_gpu` parameter. ### Optimizers | Name | Description | Type | Default | | -------------------------------------- | ------------------------------------ | ------- | -------- | | `training.optimizers.all.args.lr` | Learning rate for the optimizer. | `float` | `0.0001` | | `training.optimizers.all.args.amsgrad` | Whether to use AMSGrad optimization. | `bool` | `False` | ### Validation | Name | Description | Type | Default | | -------------------------------------------- | -------------------------------------------------------------------------- | ------ | -------------------------- | | `training.validation.eval_on_valid` | Whether to evaluate and log metrics on the validation set during training. | `bool` | `True` | | `training.validation.eval_on_valid_interval` | Interval (in epochs) to evaluate during training. | `int` | `5` | | `training.validation.eval_on_valid_start` | Wait until this epoch before evaluating. | `int` | `0` | | `training.validation.set_name_focus_metric` | Dataset to focus on to select best weights. | `str` | | | `training.validation.font` | Path to the font used in the image to log. | `str` | `fonts/LinuxLibertine.ttf` | | `training.validation.maximum_font_size` | Maximum size used for the font of the image to log. | `int` | | | `training.validation.nb_logged_images` | Number of images to log during validation. | `int` | `5` | | `training.validation.limit_val_steps` | Number of validation steps within an epoch. | `int` | `500` | During the validation stage, the batch size is set to 1. This avoids problems associated with image sizes that can be very different inside batches and lead to significant padding, resulting in performance degradations. ### Metrics | Name | Description | Type | Default | | ------------------------ | --------------------------------------------- | ------ | --------------------------------------------------------------------------- | | `training.metrics.train` | List of metrics to compute during training. | `list` | `["loss_ce", "cer", "cer_no_token", "wer", "wer_no_punct", "wer_no_token"]` | | `training.metrics.eval` | List of metrics to compute during validation. | `list` | `["cer", "cer_no_token", "wer", "wer_no_punct", "wer_no_token"]` | ### Label noise scheduler | Name | Description | Type | Default | | ------------------------------------------------ | ------------------------------------------------ | ------- | ------- | | `training.label_noise_scheduler.min_error_rate` | Minimum ratio of teacher forcing. | `float` | `0.2` | | `training.label_noise_scheduler.max_error_rate` | Maximum ratio of teacher forcing. | `float` | `0.2` | | `training.label_noise_scheduler.total_num_steps` | Number of steps before stopping teacher forcing. | `float` | `5e4` | ### Transfer learning | Name | Description | Type | Default | | ------------------------------------ | ------------------------------------------------------------------------------------ | ------ | ----------------------------------------------------------------- | | `training.transfer_learning.encoder` | Model to load for the encoder [state_dict_name, checkpoint_path, learnable, strict]. | `list` | `["encoder", "pretrained_models/dan_rimes_page.pt", True, True]` | | `training.transfer_learning.decoder` | Model to load for the decoder [state_dict_name, checkpoint_path, learnable, strict]. | `list` | `["decoder", "pretrained_models/dan_rimes_page.pt", True, False]` | ### Data | Name | Description | Type | Default | | --------------------------------- | ---------------------------------------------------------- | ------ | ----------------------------------------------- | | `training.data.batch_size` | Mini-batch size for the training loop. | `int` | `2` | | `training.data.load_in_memory` | Load all images in CPU memory. | `bool` | `True` | | `training.data.worker_per_gpu` | Number of parallel processes per gpu for data loading. | `int` | `4` | | `training.data.preprocessings` | List of pre-processing functions to apply to input images. | `list` | (see [dedicated section](#preprocessing)) | | `training.data.augmentation` | Whether to use data augmentation on the training set. | `bool` | `True` (see [dedicated section](#augmentation)) | | `training.data.limit_train_steps` | Number of training steps within an epoch. | `int` | `500` | #### Preprocessing Preprocessing is applied before training the network (see the [dedicated references](../../ref/ocr/managers/dataset.md)). The list of accepted transforms is defined in the [dedicated references](../../ref/ocr/transforms.md#dan.ocr.transforms.Preprocessing). Usage: - Resize to a fixed height ```py [ { "type": "fixed_height_resize", "fixed_height": 1500, } ] ``` - Resize to a fixed width ```py [ { "type": "fixed_width_resize", "fixed_width": 1500, } ] ``` - Resize to a fixed width and a fixed height ```py [ { "type": "fixed_resize", "fixed_height": 1900, "fixed_width": 1250, } ] ``` - Resize to a maximum size (only if the image is bigger than the given size) ```py [ { "type": "max_resize, "max_height": 2000, "max_width": 2000, } ] ``` - Combine these pre-processings ```py [ { "type": "fixed_height_resize", "fixed_height": 2000, }, { "type": "fixed_width_resize", "fixed_width": 2000, } ] ``` #### Augmentation Augmentation transformations are applied on-the-fly during training to artificially increase data variability. DAN takes advantage of transforms from [albumentations](https://albumentations.ai/). The following configuration is used by default when using the `teklia-dan train` command. Data augmentation is applied with a probability of 0.9. In this case, two transformations are randomly selected to be applied. ```py transforms = A.Compose( [ # Scale between 0.75 and 1.0 RandomScale(scale_limit=[-0.25, 0], p=1, interpolation=cv2.INTER_AREA), A.SomeOf( [ ErosionDilation(min_kernel=1, max_kernel=4, iterations=1), Perspective(scale=(0.05, 0.09), fit_output=True, p=0.4), GaussianBlur(sigma_limit=2.5, p=1), GaussNoise(var_limit=50**2, p=1), ColorJitter( contrast=0.2, brightness=0.2, saturation=0.2, hue=0.2, p=1 ), ElasticTransform( alpha=20.0, sigma=5.0, border_mode=0, p=1 ), Sharpen(alpha=(0.0, 1.0), p=1), Affine(shear={"x": (-20, 20), "y": (0, 0)}, p=1), CoarseDropout(p=1), ToGray(p=0.5), ], n=2, p=0.9, ), ], p=0.9, ) ``` For a detailed description of all augmentation transforms, see the [dedicated page](augmentation.md). ## MLFlow logging To log your experiment on MLFlow, you need to: - install the extra requirements via ```shell $ pip install .[mlflow] ``` - update the following arguments: | Name | Description | Type | Default | | ------------------------------ | --------------------------------------- | ----- | ------- | | `mlflow.run_id` | ID of the current run in MLflow. | `int` | | | `mlflow.run_name` | Name of the current run in MLflow. | `str` | | | `mlflow.s3_endpoint_url` | URL of S3 endpoint. | `str` | | | `mlflow.tracking_uri` | URI of a tracking server. | `str` | | | `mlflow.experiment_id` | ID of the current experiment in MLFlow. | `str` | | | `mlflow.aws_access_key_id` | Access key ID to the AWS server. | `str` | | | `mlflow.aws_secret_access_key` | Secret access key to the AWS server. | `str` | | ## Weights & Biases logging To log your run on [Weights & Biases](https://wandb.ai/) (W&B), you need to: - [login to W&B](https://docs.wandb.ai/ref/cli/wandb-login) via ```shell wandb login ``` - update the following arguments: | Name | Description | Type | Default | | ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ | ------- | | `wandb.init` | Keys and values to use to initialise your experiment on W&B. See the full list of available keys on [the official documentation](https://docs.wandb.ai/ref/python/init). | `dict` | | | `wandb.images` | Whether to log images during validation with their predicted transcription. | `bool` | `False` | | `wandb.inferences` | Whether to log inferences during evaluation. | `bool` | `False` | Using W&B during DAN training will allow you to follow the DAN training with a W&B run. This run will automatically record: - a **configuration** using the DAN training configuration. Any `wandb.init.config.*` keys and values found in the DAN training configuration will be added to the W&B run configuration. - **metrics** listed in the `training.metrics` key of the DAN training configuration. To edit the metrics to log to W&B see [the dedicated section](#metrics). - **images** according to the `wandb.images` and `training.validation.*` keys of the DAN training configuration. To edit the images to log to W&B see [the dedicated section](#validation). ### Resume run To be sure that your DAN training will only produce one W&B run even if your DAN training has been resumed, we strongly recommend you to either reuse [your `--wandb` parameter of your `analyze` command](../datasets/analyze.md#weights-biases-logging) or define these two keys **before** starting your DAN training: - `wandb.init.id` with a unique ID that has never been used on your W&B project. We recommend you to generate a random 8-character word composed of letters and numbers using [the Short Unique ID (UUID) Generating Library](https://shortunique.id/). - `wandb.init.resume` with the value `auto`. The final configuration should look like: ```json { "wandb": { "init": { "id": "<unique_ID>", "resume": "auto" } } } ``` Otherwise, W&B will create a new run for each DAN training session, even if the DAN training has been resumed. ### Offline mode If you do not have Internet access during the DAN training, you can set the `wandb.init.mode` key to `offline` to use W&B's offline mode. W&B will create a `wandb` folder in the `training.output_folder` defined in the DAN training configuration. To use another location, see [the dedicated section](#training-parameters). The final configuration should look like: ```json { "wandb": { "init": { "mode": "offline" } } } ``` Once your DAN training is complete, you can publish your W&B run with the [`wandb sync`](https://docs.wandb.ai/ref/cli/wandb-sync) command and **the `--append` parameter**: ```shell wandb sync --project <wandb_project> --sync-all --append ``` If you prefer, you can publish your W&B run regularly using a script similar to: ```shell #!/bin/bash while : do echo "[`date +%Y-%m-%d\ %H:%M:%S`] Publishing W&B runs..."; wandb sync --project <wandb_project> --sync-all --append; echo "[`date +%Y-%m-%d\ %H:%M:%S`] W&B runs published."; # Publish W&B runs every 5 minutes sleep 5m done ``` As in online mode, we recommend you to set up a resume of your W&B runs (see [the dedicated section](#resume-run)).