Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • atr/dan
1 result
Show changes
Showing
with 214 additions and 61 deletions
......@@ -144,7 +144,8 @@ def start_mlflow_run(config: dict):
assert experiment_id, "Missing MLflow experiment ID in the configuration"
# Start run
yield mlflow.start_run(
run_id=run_id, run_name=run_name, experiment_id=experiment_id
), run_id is None
yield (
mlflow.start_run(run_id=run_id, run_name=run_name, experiment_id=experiment_id),
run_id is None,
)
mlflow.end_run()
......@@ -363,7 +363,7 @@ def get_polygon(
max_value: np.float32,
offset: int,
weights: np.ndarray,
size: Tuple[int, int] = None,
size: Tuple[int, int] | None = None,
max_object_height: int = 50,
) -> Tuple[dict, np.ndarray]:
"""
......
......@@ -5,7 +5,7 @@ import logging
import pickle
import re
from pathlib import Path
from typing import Dict, List, Optional, Tuple
from typing import Dict, List, Tuple
import numpy as np
import torch
......@@ -159,7 +159,7 @@ class DAN:
word_separators: re.Pattern = parse_delimiters(["\n", " "]),
line_separators: re.Pattern = parse_delimiters(["\n"]),
tokens: Dict[str, EntityType] = {},
start_token: str = None,
start_token: str | None = None,
max_object_height: int = 50,
) -> dict:
"""
......@@ -426,7 +426,7 @@ def process_batch(
def run(
image_dir: Optional[Path],
image_dir: Path,
model: Path,
output: Path,
confidence_score: bool,
......
......@@ -34,6 +34,12 @@ def train(rank, params, mlflow_logging=False):
model = Manager(params)
model.load_model()
if params["dataset"]["tokens"] is not None:
if "ner" not in params["training"]["metrics"]["train"]:
params["training"]["metrics"]["train"].append("ner")
if "ner" not in params["training"]["metrics"]["eval"]:
params["training"]["metrics"]["eval"].append("ner")
if mlflow_logging:
logger.info("MLflow logging enabled")
......
# -*- coding: utf-8 -*-
from pathlib import Path
from typing import Dict, List, Optional
import torch
from prettytable import MARKDOWN, PrettyTable
from torch.optim import Adam
from dan.ocr.decoder import GlobalHTADecoder
from dan.ocr.encoder import FCN_Encoder
from dan.ocr.transforms import Preprocessing
METRICS_TABLE_HEADER = {
"cer": "CER (HTR-NER)",
"cer_no_token": "CER (HTR)",
"wer": "WER (HTR-NER)",
"wer_no_token": "WER (HTR)",
"wer_no_punct": "WER (HTR no punct)",
"ner": "NER",
}
REVERSE_HEADER = {column: metric for metric, column in METRICS_TABLE_HEADER.items()}
def update_config(config: dict):
"""
......@@ -51,3 +63,36 @@ def update_config(config: dict):
# set nb_gpu if not present
if config["training"]["device"]["nb_gpu"] is None:
config["training"]["device"]["nb_gpu"] = torch.cuda.device_count()
def create_metrics_table(metrics: List[str]) -> PrettyTable:
"""
Create a Markdown table to display metrics in (CER, WER, NER, etc)
for each evaluated split.
"""
table = PrettyTable(
field_names=["Split"]
+ [title for metric, title in METRICS_TABLE_HEADER.items() if metric in metrics]
)
table.set_style(MARKDOWN)
return table
def add_metrics_table_row(
table: PrettyTable, split: str, metrics: Optional[Dict[str, int | float]]
) -> PrettyTable:
"""
Add a row to an existing metrics Markdown table for the currently evaluated split.
To create such table please refer to
[create_metrics_table][dan.ocr.utils.create_metrics_table] function.
"""
row = [split]
for column in table.field_names:
if column not in REVERSE_HEADER:
continue
metric_name = REVERSE_HEADER[column]
row.append(metrics.get(metric_name, ""))
table.add_row(row)
......@@ -6,11 +6,12 @@ There are a several steps to follow when training a DAN model.
To extract the data, DAN uses an Arkindex export database in SQLite format. You will need to:
1. Structure the data into folders (`train` / `val` / `test`) in [Arkindex](https://demo.arkindex.org/).
1. Structure the data into splits (`train` / `val` / `test`) in a project dataset in [Arkindex](https://demo.arkindex.org/).
1. [Export the project](https://doc.arkindex.org/howto/export/) in SQLite format.
1. Extract the data with the [extract command](../usage/datasets/extract.md).
1. Download images with the [download command](../usage/datasets/download.md).
This command will extract and format the images and labels needed to train DAN. It will also tokenize the training corpus at character, subword, and word levels, allowing you to combine DAN with an explicit statistical language model to improve performance.
These commands will extract and format the images and labels needed to train DAN. It will also tokenize the training corpus at character, subword, and word levels, allowing you to combine DAN with an explicit statistical language model to improve performance.
At the end, you should get the following tree structure:
......
# Exceptions
::: dan.datasets.download.exceptions
# Image
::: dan.datasets.download.images
# Download
::: dan.datasets.download
# Utils
::: dan.datasets.download.utils
# Exceptions
::: dan.datasets.extract.exceptions
options:
show_source: false
# Utils
::: dan.ocr.utils
# Dataset download
## Description
Use the `teklia-dan dataset download` command to download images of a dataset from a split extracted by DAN. This will:
- Generate the images of each element (in the `images/` folder),
- Create the mapping of the images that have been correctly uploaded (identified by its path) to the ground-truth transcription (with NER tokens if needed) (in the `labels.json` file).
If an image download fails for whatever reason, it won't appear in the transcriptions file. The reason will be printed to stdout at the end of the process. Before trying to download the image, it checks that it wasn't downloaded previously. It is thus safe to run this command twice if a few images failed.
| Parameter | Description | Type | Default |
| ---------------- | -------------------------------------------------------------------------------- | -------------- | ------- |
| `--output` | Path where the `split.json` file is stored and where the data will be generated. | `pathlib.Path` | |
| `--max-width` | Images larger than this width will be resized to this width. | `int` | |
| `--max-height` | Images larger than this height will be resized to this height. | `int` | |
| `--image-format` | Images will be saved under this format. | `str` | `.jpg` |
The `--output` directory should have a `split.json` JSON-formatted file with a specific format. A mapping of the elements (identified by its ID) to the image information and the ground-truth transcription (with NER tokens if needed). This file can be generated by the `teklia-dan dataset extract` command. More details in the [dedicated page](./extract.md).
```json
{
"train": {
"<element_id>": {
"dataset_id": "<dataset_id>",
"image": {
"iiif_url": "https://<iiif_server>/iiif/2/<path>",
"polygon": [
[37, 191],
[37, 339],
[767, 339],
[767, 191],
[37, 191]
]
},
"text": "ⓢCoufet ⓕBouis ⓑ07.12.14"
},
},
"val": {},
"test": {}
}
```
## Examples
### Download full images
To download images from an extracted split, please use the following:
```shell
teklia-dan dataset download \
--output data
```
### Download resized images
To download cropped images from an extracted split and limit the width and/or the height of images, please use the following:
```shell
teklia-dan dataset download \
--output data \
--max-width 1800
```
or
```shell
teklia-dan dataset download \
--output data \
--max-height 2000
```
or
```shell
teklia-dan dataset download \
--output data \
--max-width 1800 \
--max-height 2000
```
......@@ -4,31 +4,22 @@
Use the `teklia-dan dataset extract` command to extract a dataset from an Arkindex export database (SQLite format). This will:
- Generate the images of each element (in the `images/` folder),
- Create the mapping of the images (identified by its path) to the ground-truth transcription (with NER tokens if needed) (in the `labels.json` file),
- Create a mapping of the elements (identified by its ID) to the image information and the ground-truth transcription (with NER tokens if needed) (in the `split.json` file),
- Store the set of characters encountered in the dataset (in the `charset.pkl` file),
- Generate the resources needed to build a n-gram language model at character, subword or word-level with [kenlm](https://github.com/kpu/kenlm) (in the `language_model/` folder).
If an image download fails for whatever reason, it won't appear in the transcriptions file. The reason will be printed to stdout at the end of the process. Before trying to download the image, it checks that it wasn't downloaded previously. It is thus safe to run this command twice if a few images failed.
| Parameter | Description | Type | Default |
| -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------- | ------- |
| `database` | Path to an Arkindex export database in SQLite format. | `pathlib.Path` | |
| `--dataset-id ` | ID of the dataset to extract from Arkindex. | `uuid` | |
| `--element-type` | Type of the elements to extract. You may specify multiple types. | `str` | |
| `--parent-element-type` | Type of the parent element containing the data. | `str` | `page` |
| `--output` | Folder where the data will be generated. | `pathlib.Path` | |
| `--entity-separators` | Removes all text that does not appear in an entity or in the list of given ordered characters. If several separators follow each other, keep only the first to appear in the list. Do not give any arguments to keep the whole text (see [dedicated section](#examples)). | `str` | |
| `--unknown-token` | Token to use to replace character in the validation/test sets that is not included in the training set. | `str` | `⁇` |
| `--tokens` | Mapping between starting tokens and end tokens to extract text with their entities. | `pathlib.Path` | |
| `--train-folder` | ID of the training folder to extract from Arkindex. | `uuid` | |
| `--val-folder` | ID of the validation folder to extract from Arkindex. | `uuid` | |
| `--test-folder` | ID of the training folder to extract from Arkindex. | `uuid` | |
| `--transcription-worker-version` | Filter transcriptions by worker_version. Use `manual` for manual filtering. | `str` or `uuid` | |
| `--entity-worker-version` | Filter transcriptions entities by worker_version. Use `manual` for manual filtering | `str` or `uuid` | |
| `--max-width` | Images larger than this width will be resized to this width. | `int` | |
| `--max-height` | Images larger than this height will be resized to this height. | `int` | |
| `--keep-spaces` | Transcriptions are trimmed by default. Use this flag to disable this behaviour. | `bool` | `False` |
| `--image-format` | Images will be saved under this format. | `str` | `.jpg` |
| `--allow-empty` | Elements with no transcriptions are skipped by default. This flag disables this behaviour. | `bool` | `False` |
| `--subword-vocab-size` | Size of the vocabulary used to train the sentencepiece subword tokenizer used to train the optional language model. | `int` | `1000` |
......@@ -67,9 +58,7 @@ To use the data from three folders as **training**, **validation** and **testing
```shell
teklia-dan dataset extract \
database.sqlite \
--train-folder train_folder_uuid \
--val-folder val_folder_uuid \
--test-folder test_folder_uuid \
--dataset-id dataset_uuid \
--element-type page \
--output data \
--tokens tokens.yml
......@@ -121,10 +110,7 @@ To extract HTR data from **annotations** and **text_zones** from each folder, bu
```shell
teklia-dan dataset extract \
database.sqlite \
--train-folder train_folder_uuid \
--val-folder val_folder_uuid \
--test-folder test_folder_uuid \
--dataset-id dataset_uuid \
--element-type text_zone annotation \
--parent-element-type single_page \
--output data
```
......@@ -13,3 +13,6 @@ Two operations are available through subcommands:
`teklia-dan dataset extract`
: To extract a dataset from an [Arkindex export](https://doc.arkindex.org/howto/export/). More details in the [dedicated page](./extract.md).
`teklia-dan dataset download`
: To download images of a dataset. More details in the [dedicated page](./download.md).
......@@ -7,3 +7,12 @@ To evaluate DAN on your dataset:
1. Create a JSON configuration file. You can base the configuration file off the training one. Refer to the [dedicated page](../train/config.md) for a description of parameters.
1. Run `teklia-dan evaluate --config path/to/your/config.json`.
1. Evaluation results for every split are available in the `results` subfolder of the output folder indicated in your configuration.
1. A metrics Markdown table, providing results for each evaluated split, is also printed in the console (see table example below).
### Example output - Metrics Markdown table
| Split | CER (HTR-NER) | CER (HTR) | WER (HTR-NER) | WER (HTR) | WER (HTR no punct) | NER |
| :---: | :-----------: | :-------: | :-----------: | :-------: | :----------------: | :-: |
| train | x | x | x | x | x | x |
| val | x | x | x | x | x | x |
| test | x | x | x | x | x | x |
......@@ -26,14 +26,14 @@ Use the `teklia-dan predict` command to apply a trained DAN model on an image.
| `--start-token` | Use a specific starting token at the beginning of the prediction. Useful when making predictions on different single pages. | `str` | |
| `--use-language-model` | Whether to use an explicit language model to rescore text hypotheses. | `bool` | `False` |
## Examples
In the following examples the `models` directory should have:
The `--model` argument expects a directory with the following files:
- a `model.pt` file,
- a `charset.pkl` file,
- a `parameters.yml` file corresponding to the `inference_parameters.yml` file generated during training.
## Examples
### Predict with confidence scores
To run a prediction with confidence scores, run this command:
......
......@@ -37,34 +37,34 @@ To determine the value to use for `dataset.max_char_prediction`, you can use the
## Training parameters
| Name | Description | Type | Default |
| ------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------- | ------------ | ----------------------------------------------------------------- |
| `training.data.batch_size` | Mini-batch size for the training loop. | `int` | `2` |
| `training.data.load_in_memory` | Load all images in CPU memory. | `bool` | `True` |
| `training.data.worker_per_gpu` | Number of parallel processes per gpu for data loading. | `int` | `4` |
| `training.data.preprocessings` | List of pre-processing functions to apply to input images. | `list` | (see [dedicated section](#data-preprocessing)) |
| `training.data.augmentation` | Whether to use data augmentation on the training set. | `bool` | `True` (see [dedicated section](#data-augmentation)) |
| `training.output_folder` | Directory for checkpoint and results. | `str` | |
| `training.max_nb_epochs` | Maximum number of epochs before stopping training. | `int` | `800` |
| `training.load_epoch` | Model to load. Should be either `"best"` (evaluation) or `last` (training). | `str` | `"last"` |
| `training.device.use_ddp` | Whether to use DistributedDataParallel. | `bool` | `False` |
| `training.device.ddp_port` | DDP port. | `int` | `20027` |
| `training.device.use_amp` | Whether to enable automatic mix-precision. | `bool` | `True` |
| `training.device.nb_gpu` | Number of GPUs to train DAN. Set to `null` to use all GPUs available. | `int` | |
| `training.device.force` | Use a specific device if available. Use `cpu` to train on CPU (for debugging) or `cuda`/`cuda:$gpu_device` to train on GPU. | `str` | |
| `training.optimizers.all.args.lr` | Learning rate for the optimizer. | `float` | `0.0001` |
| `training.optimizers.all.args.amsgrad` | Whether to use AMSGrad optimization. | `bool` | `False` |
| `training.lr_schedulers` | Learning rate schedulers. | custom class | |
| `training.validation.eval_on_valid` | Whether to evaluate and log metrics on the validation set during training. | `bool` | `True` |
| `training.validation.eval_on_valid_interval` | Interval (in epochs) to evaluate during training. | `int` | `5` |
| `training.validation.set_name_focus_metric` | Dataset to focus on to select best weights. | `str` | |
| `training.metrics.train` | List of metrics to compute during training. | `list` | `["loss_ce", "cer", "wer", "wer_no_punct"]` |
| `training.metrics.eval` | List of metrics to compute during validation. | `list` | `["cer", "wer", "wer_no_punct"]` |
| `training.label_noise_scheduler.min_error_rate` | Minimum ratio of teacher forcing. | `float` | `0.2` |
| `training.label_noise_scheduler.max_error_rate` | Maximum ratio of teacher forcing. | `float` | `0.2` |
| `training.label_noise_scheduler.total_num_steps` | Number of steps before stopping teacher forcing. | `float` | `5e4` |
| `training.transfer_learning.encoder` | Model to load for the encoder \[state_dict_name, checkpoint_path, learnable, strict\]. | `list` | `["encoder", "pretrained_models/dan_rimes_page.pt", True, True]` |
| `training.transfer_learning.decoder` | Model to load for the decoder \[state_dict_name, checkpoint_path, learnable, strict\]. | `list` | `["encoder", "pretrained_models/dan_rimes_page.pt", True, False]` |
| Name | Description | Type | Default |
| ------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------- | ------------ | --------------------------------------------------------------------------- |
| `training.data.batch_size` | Mini-batch size for the training loop. | `int` | `2` |
| `training.data.load_in_memory` | Load all images in CPU memory. | `bool` | `True` |
| `training.data.worker_per_gpu` | Number of parallel processes per gpu for data loading. | `int` | `4` |
| `training.data.preprocessings` | List of pre-processing functions to apply to input images. | `list` | (see [dedicated section](#data-preprocessing)) |
| `training.data.augmentation` | Whether to use data augmentation on the training set. | `bool` | `True` (see [dedicated section](#data-augmentation)) |
| `training.output_folder` | Directory for checkpoint and results. | `str` | |
| `training.max_nb_epochs` | Maximum number of epochs before stopping training. | `int` | `800` |
| `training.load_epoch` | Model to load. Should be either `"best"` (evaluation) or `last` (training). | `str` | `"last"` |
| `training.device.use_ddp` | Whether to use DistributedDataParallel. | `bool` | `False` |
| `training.device.ddp_port` | DDP port. | `int` | `20027` |
| `training.device.use_amp` | Whether to enable automatic mix-precision. | `bool` | `True` |
| `training.device.nb_gpu` | Number of GPUs to train DAN. Set to `null` to use all GPUs available. | `int` | |
| `training.device.force` | Use a specific device if available. Use `cpu` to train on CPU (for debugging) or `cuda`/`cuda:$gpu_device` to train on GPU. | `str` | |
| `training.optimizers.all.args.lr` | Learning rate for the optimizer. | `float` | `0.0001` |
| `training.optimizers.all.args.amsgrad` | Whether to use AMSGrad optimization. | `bool` | `False` |
| `training.lr_schedulers` | Learning rate schedulers. | custom class | |
| `training.validation.eval_on_valid` | Whether to evaluate and log metrics on the validation set during training. | `bool` | `True` |
| `training.validation.eval_on_valid_interval` | Interval (in epochs) to evaluate during training. | `int` | `5` |
| `training.validation.set_name_focus_metric` | Dataset to focus on to select best weights. | `str` | |
| `training.metrics.train` | List of metrics to compute during training. | `list` | `["loss_ce", "cer", "cer_no_token", "wer", "wer_no_punct", "wer_no_token"]` |
| `training.metrics.eval` | List of metrics to compute during validation. | `list` | `["cer", "cer_no_token", "wer", "wer_no_punct", "wer_no_token"]` |
| `training.label_noise_scheduler.min_error_rate` | Minimum ratio of teacher forcing. | `float` | `0.2` |
| `training.label_noise_scheduler.max_error_rate` | Maximum ratio of teacher forcing. | `float` | `0.2` |
| `training.label_noise_scheduler.total_num_steps` | Number of steps before stopping teacher forcing. | `float` | `5e4` |
| `training.transfer_learning.encoder` | Model to load for the encoder \[state_dict_name, checkpoint_path, learnable, strict\]. | `list` | `["encoder", "pretrained_models/dan_rimes_page.pt", True, True]` |
| `training.transfer_learning.decoder` | Model to load for the decoder \[state_dict_name, checkpoint_path, learnable, strict\]. | `list` | `["encoder", "pretrained_models/dan_rimes_page.pt", True, False]` |
- To train on several GPUs, simply set the `training.use_ddp` parameter to `True`. By default, the model will use all available GPUs. To restrict access to fewer GPUs, one can modify the `training.nb_gpu` parameter.
- During the validation stage, the batch size is set to 1. This avoids problems associated with image sizes that can be very different inside batches and lead to significant padding, resulting in performance degradations.
......
......@@ -65,6 +65,7 @@ nav:
- Dataset entities: usage/datasets/entities.md
- Dataset tokens: usage/datasets/tokens.md
- Dataset extraction: usage/datasets/extract.md
- Dataset download: usage/datasets/download.md
- Training:
- usage/train/index.md
- Configuration: usage/train/config.md
......@@ -81,6 +82,11 @@ nav:
- Analyze:
- ref/datasets/analyze/index.md
- Statistics: ref/datasets/analyze/statistics.md
- Download:
- ref/datasets/download/index.md
- Images: ref/datasets/download/images.md
- Utils: ref/datasets/download/utils.md
- Exceptions: ref/datasets/download/exceptions.md
- Entities:
- ref/datasets/entities/index.md
- Extract: ref/datasets/entities/extract.md
......@@ -112,6 +118,7 @@ nav:
- MLflow: ref/ocr/mlflow.md
- Schedulers: ref/ocr/schedulers.md
- Transformations: ref/ocr/transforms.md
- Utils: ref/ocr/utils.md
- CLI: ref/cli.md
- Utils: ref/utils.md
......
......@@ -12,7 +12,9 @@ select = [
# Isort
"I",
# Pathlib usage
"PTH"
"PTH",
# Implicit Optional
"RUF013"
]
[tool.ruff.isort]
......