Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • atr/dan
1 result
Show changes
Showing
with 473 additions and 249 deletions
# Arkindex
::: dan.datasets.extract.extract
::: dan.datasets.extract.arkindex
# Extraction
::: dan.datasets.extract
# Datasets
::: dan.datasets
# Generate
::: dan.datasets.tokens.generate
# Tokens
::: dan.datasets.tokens
# OCR
::: dan.ocr
# Dataset Managers
# Dataset manager
::: dan.ocr.manager.dataset
# Managers
# Manager
::: dan.ocr.manager
# Predict
::: dan.ocr.predict
# Inference
::: dan.ocr.predict.inference
# Inference
::: dan.ocr.predict.prediction
......@@ -2,15 +2,15 @@
## Description
Use the `teklia-dan dataset analyze` command to analyze a dataset. This will display statistics in Markdown format.
Use the `teklia-dan dataset analyze` command to analyze a dataset. This will display statistics in [Markdown](https://www.markdownguide.org/) format.
The available arguments are
The available arguments are:
| Parameter | Description | Type | Default |
| --------------- | -------------------------------- | ----- | ------- |
| `--labels` | Path to the `labels.json` file. | `str` | |
| `--tokens` | Path to the `tokens.yaml` file. | `str` | `None` |
| `--output-file` | Where the summary will be saved. | `str` | |
| Parameter | Description | Type | Default |
| --------------- | -------------------------------- | -------------- | ------- |
| `--labels` | Path to the `labels.json` file. | `pathlib.Path` | |
| `--tokens` | Path to the `tokens.yml` file. | `pathlib.Path` | |
| `--output-file` | Where the summary will be saved. | `pathlib.Path` | |
## Examples
......@@ -27,6 +27,6 @@ teklia-dan dataset analyze \
```shell
teklia-dan dataset analyze \
--labels path/to/dataset/labels.json \
--tokens path/to/tokens.yaml \
--tokens path/to/tokens.yml \
--output-file statistics.md
```
# Dataset entities
## Description
Use the `teklia-dan dataset entities` command extract entities from an Arkindex export database (SQLite format). This will create a YAML file with all the entity names found.
| Parameter | Description | Type | Default |
| --------------- | --------------------------------------------------- | -------------- | -------------- |
| `database` | Path where the data were exported from Arkindex. | `pathlib.Path` | |
| `--output-file` | Path to a YAML file to save the extracted entities. | `pathlib.Path` | `entities.yml` |
## Examples
```shell
teklia-dan dataset entities \
database.sqlite
```
This command will create an `entities.yml` YAML-formatted file with the list of entity names.
```yaml
entities:
- INTITULE
- DATE
- ANALYSE_COMPL.
- PRECISIONS_SUR_COTE
- COTE_ARTICLE
- CLASSEMENT
```
......@@ -2,36 +2,37 @@
## Description
Use the `teklia-dan dataset extract` command to extract a dataset from an Arkindex export database (SQLite format). This will :
Use the `teklia-dan dataset extract` command to extract a dataset from an Arkindex export database (SQLite format). This will:
- Generate the images of each element (in the `images/` folder),
- Create the mapping of the images (identified by its path) to the ground-truth transcription (with NER tokens if needed) (in the `labels.json` file),
- Store the set of characters encountered in the dataset (in the `charset.pkl` file),
- Generate the resources needed to build a N-gram language model with [kenlm](https://github.com/kpu/kenlm) (in the `language_model/` folder).
- Generate the resources needed to build a n-gram language model at character, subword or word-level with [kenlm](https://github.com/kpu/kenlm) (in the `language_model/` folder).
If an image download fails for whatever reason, it won't appear in the transcriptions file. The reason will be printed to stdout at the end of the process. Before trying to download the image, it checks that it wasn't downloaded previously. It is thus safe to run this command twice if a few images failed.
| Parameter | Description | Type | Default |
| -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------- | -------------------------------------------------- |
| `database` | Path to an Arkindex export database in SQLite format. | `Path` | |
| `database` | Path to an Arkindex export database in SQLite format. | `pathlib.Path` | |
| `--element-type` | Type of the elements to extract. You may specify multiple types. | `str` | |
| `--parent-element-type` | Type of the parent element containing the data. | `str` | `page` |
| `--output` | Folder where the data will be generated. | `Path` | |
| `--output` | Folder where the data will be generated. | `pathlib.Path` | |
| `--entity-separators` | Removes all text that does not appear in an entity or in the list of given ordered characters. If several separators follow each other, keep only the first to appear in the list. Do not give any arguments to keep the whole text. | `str` | `["\n", " "]` (see [dedicated section](#examples)) |
| `--unknown-token` | Token to use to replace character in the validation/test sets that is not included in the training set. | `str` | `⁇` |
| `--tokens` | Mapping between starting tokens and end tokens to extract text with their entities. | `Path` | |
| `--train-folder` | ID of the training folder to import from Arkindex. | `uuid` | |
| `--val-folder` | ID of the validation folder to import from Arkindex. | `uuid` | |
| `--test-folder` | ID of the training folder to import from Arkindex. | `uuid` | |
| `--tokens` | Mapping between starting tokens and end tokens to extract text with their entities. | `pathlib.Path` | |
| `--train-folder` | ID of the training folder to extract from Arkindex. | `uuid` | |
| `--val-folder` | ID of the validation folder to extract from Arkindex. | `uuid` | |
| `--test-folder` | ID of the training folder to extract from Arkindex. | `uuid` | |
| `--transcription-worker-version` | Filter transcriptions by worker_version. Use `manual` for manual filtering. | `str` or `uuid` | |
| `--entity-worker-version` | Filter transcriptions entities by worker_version. Use `manual` for manual filtering | `str` or `uuid` | |
| `--max-width` | Images larger than this width will be resized to this width. | `int` | |
| `--max-height` | Images larger than this height will be resized to this height. | `int` | |
| `--keep-spaces` | Transcriptions are trimmed by default. Use this flag to disable this behaviour. | `bool` | False |
| `--keep-spaces` | Transcriptions are trimmed by default. Use this flag to disable this behaviour. | `bool` | `False` |
| `--image-format` | Images will be saved under this format. | `str` | `.jpg` |
| `--allow-empty` | Elements with no transcriptions are skipped by default. This flag disables this behaviour. | `bool` | False |
| `--allow-empty` | Elements with no transcriptions are skipped by default. This flag disables this behaviour. | `bool` | `False` |
| `--subword-vocab-size` | Size of the vocabulary used to train the sentencepiece subword tokenizer used to train the optional language model. | `int` | `1000` |
The `--tokens` argument expects a YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively.
The `--tokens` argument expects a YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively. This file can be generated by the `teklia-dan dataset tokens` command. More details in the [dedicated page](./tokens.md).
```yaml
INTITULE: # Type of the entity on Arkindex
......@@ -74,6 +75,16 @@ teklia-dan dataset extract \
--tokens tokens.yml
```
If there is no end token, it is possible to define the characters to keep with the `--entity-separators` parameter:
```shell
teklia-dan dataset extract \
[...] \
--entity-separators $'\n' " "
```
If several separators follow each other, it will keep only one, ideally a line break if there is one, otherwise a space. If you change the order of the `--entity-separators` parameters, then it will keep a space if there is one, otherwise a line break.
### HTR from multiple element types
To extract HTR data from **annotations** and **text_zones** from each folder, but only keep those that are children of **single_pages**, please use the following:
......@@ -88,16 +99,3 @@ teklia-dan dataset extract \
--parent-element-type single_page \
--output data
```
### HTR + NER data
To extract NER data and keep line breaks and spaces between entities, use the following command:
```shell
teklia-dan dataset extract \
[...]
--entity-separators $'\n' " " \
--tokens tokens.yml
```
If several separators follow each other, it will keep only one, ideally a line break if there is one, otherwise a space. If you change the order of the `--entity-separators` parameters, then it will keep a space if there is one, otherwise a line break.
......@@ -2,8 +2,14 @@
Two operations are available through subcommands:
`teklia-dan dataset extract`
: To extract a dataset from Arkindex using its [Python API](https://demo.arkindex.org/api-docs/). More details in [the dedicated section](./extract.md).
`teklia-dan dataset analyze`
: To analyze datasets and display statistics. More details in [the dedicated section](./analyze.md).
: To analyze datasets and display statistics. More details in the [dedicated page](./analyze.md).
`teklia-dan dataset entities`
: To extract entities from an [Arkindex export](https://doc.arkindex.org/howto/export/). More details in the [dedicated page](./entities.md).
`teklia-dan dataset tokens`
: To generate a YAML file containing entities and their token(s) to train a DAN model. More details in the [dedicated page](./tokens.md).
`teklia-dan dataset extract`
: To extract a dataset from an [Arkindex export](https://doc.arkindex.org/howto/export/). More details in the [dedicated page](./extract.md).
# Dataset tokens
## Description
Use the `teklia-dan dataset tokens` command generate a YAML file containing entities and their token(s) to train a DAN model.
| Parameter | Description | Type | Default |
| --------------- | ------------------------------------------------------------ | -------------- | ------------ |
| `entities` | Path to a YAML file containing the extracted entities. | `pathlib.Path` | |
| `--end-tokens` | Whether to generate end tokens along with starting tokens. | `bool` | `False` |
| `--output-file` | Path to a YAML file to save the entities and their token(s). | `bool` | `tokens.yml` |
The `entities` argument expects a YAML-formatted file with the list of entity names. This file can be generated by the `teklia-dan dataset entities` command. More details in the [dedicated page](./entities.md).
```yaml
entities:
- INTITULE
- DATE
- ANALYSE_COMPL.
- PRECISIONS_SUR_COTE
- COTE_ARTICLE
- CLASSEMENT
```
## Examples
### Start tokens
```shell
teklia-dan dataset tokens \
entities.yml
```
This command will create a `tokens.yml` YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively.
```yaml
INTITULE: # Type of the entity on Arkindex
start: # Starting token for this entity
end: ''
DATE:
start:
end: ''
ANALYSE_COMPL.:
start:
end: ''
PRECISIONS_SUR_COTE:
start:
end: ''
COTE_ARTICLE:
start:
end: ''
CLASSEMENT:
start:
end: ''
```
### Start tokens + End tokens
```shell
teklia-dan dataset tokens \
entities.yml \
--end-tokens
```
This command will create a `tokens.yml` YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively.
```yaml
INTITULE: # Type of the entity on Arkindex
start: # Starting token for this entity
end: # Ending token for this entity
DATE:
start:
end:
ANALYSE_COMPL.:
start:
end:
PRECISIONS_SUR_COTE:
start:
end:
COTE_ARTICLE:
start:
end:
CLASSEMENT:
start:
end:
```
......@@ -3,10 +3,10 @@
When `teklia-dan` is installed in your environment, you may use the following commands:
`teklia-dan dataset`
: To preprocess datasets from Arkindex for training. More details in [the dedicated section](./datasets/index.md).
: To preprocess datasets from Arkindex for training. More details in the [dedicated page](./datasets/index.md).
`teklia-dan train`
: To train a new DAN model. More details in [the dedicated section](./train/index.md).
: To train a new DAN model. More details in the [dedicated page](./train/index.md).
`teklia-dan predict`
: To predict an image using a trained DAN model. More details in [the dedicated section](./predict/index.md).
: To predict an image using a trained DAN model. More details in the [dedicated page](./predict/index.md).
## Examples
### Predict with confidence scores
To run a prediction with confidence scores, run this command:
```shell
teklia-dan predict \
--image dan_humu_page/example.jpg \
--model dan_humu_page/model.pt \
--parameters dan_humu_page/parameters.yml \
--charset dan_humu_page/charset.pkl \
--output dan_humu_page/predict/ \
--confidence-score
```
It will create the following JSON file named `dan_humu_page/predict/example.json`
```json
{
"text": "Hansteensgt. 2 IV 28/4 - 19\nKj\u00e6re Gerhard.\nTak for Brevet om Boken og Haven\nog Crokus og Blaaveis og tak fordi\nDu vilde be mig derut sammen\nmed Kris og Ragna. Men vet Du\nda ikke, at Kris reiste med sin S\u00f8-\nster Fru Cr\u00f8ger til Lillehammer\nnogle Dage efter Begravelsen? Hen\ndes Address er Amtsingeni\u00f8r\nCr\u00f8ger. Hun skriver at de blir\nder til lidt ut i Mai. Nu er hun\nnoksaa medtat skj\u00f8nner jeg af Sorg\nog af L\u00e6ngsel, skriver saameget r\u00f8-\nrende om Oluf. Ragna har det\nherligt, skriver hun. Hun er bare\ngla, og det vet jeg, at \"Oluf er gla over,\nder hvor han nu er. Jeg har saa in-\nderlig ondt af hende, og om Du skrev\net Par Ord tror jeg det vilde gj\u00f8re\nhende godt. - Jeg gl\u00e6der mig over,\nat Du har skrevet en Bok, og\njeg er vis paa, at den er god.",
"confidences": {
"total": 0.99
}
}
```
### Predict with confidence scores and line-level attention maps
To run a prediction with confidence scores and plot line-level attention maps, run this command:
```shell
teklia-dan predict \
--image dan_humu_page/example.jpg \
--model dan_humu_page/model.pt \
--parameters dan_humu_page/parameters.yml \
--charset dan_humu_page/charset.pkl \
--output dan_humu_page/predict/ \
--confidence-score \
--attention-map \
```
It will create the following JSON file named `dan_humu_page/predict/example.json` and a GIF showing a word-level attention map `dan_humu_page/predict/example_line.gif`
```json
{
"text": "Hansteensgt. 2 IV 28/4 - 19\nKj\u00e6re Gerhard.\nTak for Brevet om Boken og Haven\nog Crokus og Blaaveis og tak fordi\nDu vilde be mig derut sammen\nmed Kris og Ragna. Men vet Du\nda ikke, at Kris reiste med sin S\u00f8-\nster Fru Cr\u00f8ger til Lillehammer\nnogle Dage efter Begravelsen? Hen\ndes Address er Amtsingeni\u00f8r\nCr\u00f8ger. Hun skriver at de blir\nder til lidt ut i Mai. Nu er hun\nnoksaa medtat skj\u00f8nner jeg af Sorg\nog af L\u00e6ngsel, skriver saameget r\u00f8-\nrende om Oluf. Ragna har det\nherligt, skriver hun. Hun er bare\ngla, og det vet jeg, at \"Oluf er gla over,\nder hvor han nu er. Jeg har saa in-\nderlig ondt af hende, og om Du skrev\net Par Ord tror jeg det vilde gj\u00f8re\nhende godt. - Jeg gl\u00e6der mig over,\nat Du har skrevet en Bok, og\njeg er vis paa, at den er god.",
"confidences": {
"total": 0.99
},
"attention_gif": "dan_humu_page/predict/example_line.gif"
}
```
<img src="../../../assets/example_line.gif" />
### Predict with confidence scores and word-level attention maps
To run a prediction with confidence scores and plot word-level attention maps, run this command:
```shell
teklia-dan predict \
--image dan_humu_page/example.jpg \
--model dan_humu_page/model.pt \
--parameters dan_humu_page/parameters.yml \
--charset dan_humu_page/charset.pkl \
--output dan_humu_page/predict/ \
--confidence-score \
--attention-map \
--attention-map-level word \
--attention-map-scale 0.5
```
It will create the following JSON file named `dan_humu_page/predict/example.json` and a GIF showing a word-level attention map `dan_humu_page/predict/example_word.gif`.
```json
{
"text": "Hansteensgt. 2 IV 28/4 - 19\nKj\u00e6re Gerhard.\nTak for Brevet om Boken og Haven\nog Crokus og Blaaveis og tak fordi\nDu vilde be mig derut sammen\nmed Kris og Ragna. Men vet Du\nda ikke, at Kris reiste med sin S\u00f8-\nster Fru Cr\u00f8ger til Lillehammer\nnogle Dage efter Begravelsen? Hen\ndes Address er Amtsingeni\u00f8r\nCr\u00f8ger. Hun skriver at de blir\nder til lidt ut i Mai. Nu er hun\nnoksaa medtat skj\u00f8nner jeg af Sorg\nog af L\u00e6ngsel, skriver saameget r\u00f8-\nrende om Oluf. Ragna har det\nherligt, skriver hun. Hun er bare\ngla, og det vet jeg, at \"Oluf er gla over,\nder hvor han nu er. Jeg har saa in-\nderlig ondt af hende, og om Du skrev\net Par Ord tror jeg det vilde gj\u00f8re\nhende godt. - Jeg gl\u00e6der mig over,\nat Du har skrevet en Bok, og\njeg er vis paa, at den er god.",
"confidences": {
"total": 0.99
},
"attention_gif": "dan_humu_page/predict/example_word.gif"
}
```
<img src="../../../assets/example_word.gif" >
### Predict with line-level attention maps and extract polygons
To run a prediction, plot line-level attention maps, and extract polygons, run this command:
```shell
teklia-dan predict \
--image dan_humu_page/example.jpg \
--model dan_humu_page/model.pt \
--parameters dan_humu_page/parameters.yml \
--charset dan_humu_page/charset.pkl \
--output dan_humu_page/predict/ \
--attention-map \
--predict-objects \
--threshold-method otsu
```
It will create the following JSON file named `dan_humu_page/predict/example.json` and a GIF showing a line-level attention map with extracted polygons `dan_humu_page/predict/example_line.gif`
```json
{
"text": "Oslo\n39 \nOresden den 24te Rasser!\nH\u00f8jst\u00e6redesherr Hartvig - assert!\nUllereder fra den f\u00f8rste tide da\njeg havder den tilfredsstillelser at vide den ar-\ndistiske ledelser af Kristiania theater i Deres\nhronder, har jeg g\u00e5t hernede med et stille\nh\u00e5b om fra Dem at modtage et forelag, sig -\nsende tils at lade \"K\u00e6rlighedens \u00abKomedie\u00bb\nopf\u00f8re fore det norske purblikum.\nEt s\u00e5dant forslag er imidlertid, imod\nforventning; ikke fremkommet, og jeg n\u00f8des der-\nfor tils self at grivbe initiativet, hvilket hervede\nsker, idet jeg\nbeder\nbet\nragte stigkket some ved denne\nskrivelse officielde indleveret til theatret. No-\nget exemplar af bogen vedlagger jeg ikke da\ndenne (i 2den udgave) med Lethed kan er -\nholdet deroppe.\nDe bet\u00e6nkeligheder, jeg i sin tid n\u00e6-\nrede mod stykkets opf\u00f8relse, er for l\u00e6nge si -\ndem forsvundne. Af mange begn er jeg kom-\nmen til den overbevisning at almenlreden\naru har f\u00e5tt sine \u00f8gne opladte for den sand -\nMed at dette arbejde i sin indersten id\u00e9 hviler\np\u00e5 et ubedinget meralsk grundlag, og brad\nstykkets hele kunstneriske struktuve ang\u00e5r,",
"objects": [
{
"confidence": 0.68,
"polygon": [
[
264,
118
],
[
410,
118
],
[
410,
185
],
[
264,
185
]
],
"text": "Oslo",
"text_confidence": 0.8
}
],
"attention_gif": "dan_humu_page/predict/example_line.gif"
}
```
<img src="../../../assets/example_line_polygon.gif" >
### Predict with an external n-gram language model
This example assumes that you have already [trained a language model](training_lm.md).
First, update the `inference_parameters.yml` file obtained during DAN training. The `weight` parameter defines how much weight to give to the language model. It should be set carefully (usually between 0.5 and 2.0) as it will affect the quality of the predictions.
```yaml
parameters:
...
language_model:
model: my_dataset/language_model/model.arpa
lexicon: my_dataset/language_model/lexicon.txt
tokens: my_dataset/language_model/tokens.txt
weight: 0.5
```
Then, run this command:
```shell
teklia-dan predict \
--image dan_humu_page/example.jpg \
--model dan_humu_page/model.pt \
--parameters dan_humu_page/parameters.yml \
--charset dan_humu_page/charset.pkl \
--use-language-model \
--output dan_humu_page/predict/
```
It will create the following JSON file named `dan_humu_page/predict/example.json`
```json
{
"text": "etc., some jeg netop idag\nholder Vask paa.\nLeien af Skj\u00f8rterne\nbestad i at jeg kj\u00f8bte\net Forkl\u00e6de til hver\naf de to Piger, some\nhavde laant os dem.\nResten var Vask af Hardan-\ngerskj\u00f8rter og et Forkl\u00e6de,\nsamt Fragt paa det Gods\n(N\u00f8i) some man sendte\nmig ubet\u00e6lt.\nIdag fik jeg hyggeligt\nFrimarkebrev fra Fosvold\nMed Hilsen\nDeres\nHulda Garborg",
"language_model": {
"text": "eet., some jeg netop idag\nholder Vask paa.\nLeien af Skj\u00f9rterne\nbestad i at jeg kj\u00f9bte\net Forkl\u00e7de til hver\naf de to Piger, some\nhavde laant os dem.\nResten var Vask af Hardan-\ngerskj\u00f9rter og et Forkl\u00e7de,\nsamt Fragt paa det Gods\n(N\u00f9i) some man sendte\nmig ubetalt.\nIdag fik jeg hyggeligt\nFrimarkebrev fra Fosvold\nMed Hilsen\nDeres\nHulda Garborg",
"confidence": 0.87
}
}
```
......@@ -2,6 +2,295 @@
Use the `teklia-dan predict` command to apply a trained DAN model on an image.
- [Training a statistical language model](training_lm.md)
- [Parameters](parameters.md)
- [Examples](examples.md)
## Description
| Parameter | Description | Type | Default |
| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | -------------- | ------------- |
| `--image` | Path to the image to predict. Must not be provided with `--image-dir`. | `pathlib.Path` | |
| `--image-dir` | Path to the folder where the images to predict are stored. Must not be provided with `--image`. | `pathlib.Path` | |
| `--image-extension` | The extension of the images in the folder. Ignored if `--image-dir` is not provided. | `str` | .jpg |
| `--model` | Path to the model to use for prediction | `pathlib.Path` | |
| `--parameters` | Path to the YAML parameters file. | `pathlib.Path` | |
| `--charset` | Path to the charset file. | `pathlib.Path` | |
| `--output` | Path to the output folder. Results will be saved in this directory. | `pathlib.Path` | |
| `--tokens` | Path to a yaml file containing a mapping between starting tokens and end tokens. Needed for entities. | `pathlib.Path` | |
| `--temperature` | Temperature scaling scalar parameter. | `float` | `1.0` |
| `--confidence-score` | Whether to return confidence scores. | `bool` | `False` |
| `--confidence-score-levels` | Level to return confidence scores. Should be any combination of `["line", "word", "char", "ner"]`. | `str` | |
| `--attention-map` | Whether to plot attention maps. | `bool` | `False` |
| `--attention-map-scale` | Image scaling factor before creating the GIF. | `float` | `0.5` |
| `--attention-map-level` | Level to plot the attention maps. Should be in `["line", "word", "char", "ner"]`. | `str` | `"line"` |
| `--predict-objects` | Whether to return polygons coordinates. | `bool` | `False` |
| `--max-object-height` | Maximum height for predicted objects. If set, grid search segmentation will be applied and width will be normalized to element width. | `int` | |
| `--word-separators` | List of word separators. | `list` | `[" ", "\n"]` |
| `--line-separators` | List of line separators. | `list` | `["\n"]` |
| `--threshold-method` | Method to use for attention mask thresholding. Should be in `["otsu", "simple"]`. | `str` | `"otsu"` |
| `--threshold-value ` | Threshold to use for the "simple" thresholding method. | `int` | `0` |
| `--gpu-device` | Use a specific GPU if available. | `int` | |
| `--batch-size` | Size of the batches for prediction. | `int` | `1` |
| `--start-token` | Use a specific starting token at the beginning of the prediction. Useful when making predictions on different single pages. | `str` | |
| `--use-language-model` | Whether to use an explicit language model to rescore text hypotheses. | `bool` | `False` |
## Examples
### Predict with confidence scores
To run a prediction with confidence scores, run this command:
```shell
teklia-dan predict \
--image example.jpg \
--model model.pt \
--parameters inference_parameters.yml \
--charset charset.pkl \
--output predict/ \
--confidence-score
```
It will create the following JSON file named `predict/example.json`
```json
{
"text": "Hansteensgt. 2 IV 28/4 - 19\nKj\u00e6re Gerhard.\nTak for Brevet om Boken og Haven\nog Crokus og Blaaveis og tak fordi\nDu vilde be mig derut sammen\nmed Kris og Ragna. Men vet Du\nda ikke, at Kris reiste med sin S\u00f8-\nster Fru Cr\u00f8ger til Lillehammer\nnogle Dage efter Begravelsen? Hen\ndes Address er Amtsingeni\u00f8r\nCr\u00f8ger. Hun skriver at de blir\nder til lidt ut i Mai. Nu er hun\nnoksaa medtat skj\u00f8nner jeg af Sorg\nog af L\u00e6ngsel, skriver saameget r\u00f8-\nrende om Oluf. Ragna har det\nherligt, skriver hun. Hun er bare\ngla, og det vet jeg, at \"Oluf er gla over,\nder hvor han nu er. Jeg har saa in-\nderlig ondt af hende, og om Du skrev\net Par Ord tror jeg det vilde gj\u00f8re\nhende godt. - Jeg gl\u00e6der mig over,\nat Du har skrevet en Bok, og\njeg er vis paa, at den er god.",
"confidences": {
"total": 0.99
}
}
```
### Predict with confidence scores and line-level attention maps
To run a prediction with confidence scores and plot line-level attention maps, run this command:
```shell
teklia-dan predict \
--image example.jpg \
--model model.pt \
--parameters inference_parameters.yml \
--charset charset.pkl \
--output predict/ \
--confidence-score \
--attention-map
```
It will create the following JSON file named `predict/example.json` and a GIF showing a line-level attention map `predict/example_line.gif`
```json
{
"text": "Hansteensgt. 2 IV 28/4 - 19\nKj\u00e6re Gerhard.\nTak for Brevet om Boken og Haven\nog Crokus og Blaaveis og tak fordi\nDu vilde be mig derut sammen\nmed Kris og Ragna. Men vet Du\nda ikke, at Kris reiste med sin S\u00f8-\nster Fru Cr\u00f8ger til Lillehammer\nnogle Dage efter Begravelsen? Hen\ndes Address er Amtsingeni\u00f8r\nCr\u00f8ger. Hun skriver at de blir\nder til lidt ut i Mai. Nu er hun\nnoksaa medtat skj\u00f8nner jeg af Sorg\nog af L\u00e6ngsel, skriver saameget r\u00f8-\nrende om Oluf. Ragna har det\nherligt, skriver hun. Hun er bare\ngla, og det vet jeg, at \"Oluf er gla over,\nder hvor han nu er. Jeg har saa in-\nderlig ondt af hende, og om Du skrev\net Par Ord tror jeg det vilde gj\u00f8re\nhende godt. - Jeg gl\u00e6der mig over,\nat Du har skrevet en Bok, og\njeg er vis paa, at den er god.",
"confidences": {
"total": 0.99
},
"attention_gif": "predict/example_line.gif"
}
```
<img src="../../../assets/example_line.gif" />
### Predict with confidence scores and word-level attention maps
To run a prediction with confidence scores and plot word-level attention maps, run this command:
```shell
teklia-dan predict \
--image example.jpg \
--model model.pt \
--parameters inference_parameters.yml \
--charset charset.pkl \
--output predict/ \
--confidence-score \
--attention-map \
--attention-map-level word \
--attention-map-scale 0.5
```
It will create the following JSON file named `predict/example.json` and a GIF showing a word-level attention map `predict/example_word.gif`.
```json
{
"text": "Hansteensgt. 2 IV 28/4 - 19\nKj\u00e6re Gerhard.\nTak for Brevet om Boken og Haven\nog Crokus og Blaaveis og tak fordi\nDu vilde be mig derut sammen\nmed Kris og Ragna. Men vet Du\nda ikke, at Kris reiste med sin S\u00f8-\nster Fru Cr\u00f8ger til Lillehammer\nnogle Dage efter Begravelsen? Hen\ndes Address er Amtsingeni\u00f8r\nCr\u00f8ger. Hun skriver at de blir\nder til lidt ut i Mai. Nu er hun\nnoksaa medtat skj\u00f8nner jeg af Sorg\nog af L\u00e6ngsel, skriver saameget r\u00f8-\nrende om Oluf. Ragna har det\nherligt, skriver hun. Hun er bare\ngla, og det vet jeg, at \"Oluf er gla over,\nder hvor han nu er. Jeg har saa in-\nderlig ondt af hende, og om Du skrev\net Par Ord tror jeg det vilde gj\u00f8re\nhende godt. - Jeg gl\u00e6der mig over,\nat Du har skrevet en Bok, og\njeg er vis paa, at den er god.",
"confidences": {
"total": 0.99
},
"attention_gif": "predict/example_word.gif"
}
```
<img src="../../../assets/example_word.gif" >
### Predict with line-level attention maps and extract polygons
To run a prediction, plot line-level attention maps, and extract polygons, run this command:
```shell
teklia-dan predict \
--image example.jpg \
--model model.pt \
--parameters inference_parameters.yml \
--charset charset.pkl \
--output predict/ \
--attention-map \
--predict-objects \
--threshold-method otsu
```
It will create the following JSON file named `predict/example.json` and a GIF showing a line-level attention map with extracted polygons `predict/example_line.gif`
```json
{
"text": "Oslo\n39 \nOresden den 24te Rasser!\nH\u00f8jst\u00e6redesherr Hartvig - assert!\nUllereder fra den f\u00f8rste tide da\njeg havder den tilfredsstillelser at vide den ar-\ndistiske ledelser af Kristiania theater i Deres\nhronder, har jeg g\u00e5t hernede med et stille\nh\u00e5b om fra Dem at modtage et forelag, sig -\nsende tils at lade \"K\u00e6rlighedens \u00abKomedie\u00bb\nopf\u00f8re fore det norske purblikum.\nEt s\u00e5dant forslag er imidlertid, imod\nforventning; ikke fremkommet, og jeg n\u00f8des der-\nfor tils self at grivbe initiativet, hvilket hervede\nsker, idet jeg\nbeder\nbet\nragte stigkket some ved denne\nskrivelse officielde indleveret til theatret. No-\nget exemplar af bogen vedlagger jeg ikke da\ndenne (i 2den udgave) med Lethed kan er -\nholdet deroppe.\nDe bet\u00e6nkeligheder, jeg i sin tid n\u00e6-\nrede mod stykkets opf\u00f8relse, er for l\u00e6nge si -\ndem forsvundne. Af mange begn er jeg kom-\nmen til den overbevisning at almenlreden\naru har f\u00e5tt sine \u00f8gne opladte for den sand -\nMed at dette arbejde i sin indersten id\u00e9 hviler\np\u00e5 et ubedinget meralsk grundlag, og brad\nstykkets hele kunstneriske struktuve ang\u00e5r,",
"objects": [
{
"confidence": 0.68,
"polygon": [
[
264,
118
],
[
410,
118
],
[
410,
185
],
[
264,
185
]
],
"text": "Oslo",
"text_confidence": 0.8
}
],
"attention_gif": "predict/example_line.gif"
}
```
<img src="../../../assets/example_line_polygon.gif" >
### Predict with an external n-gram language model
This example assumes that you have already [trained a language model](../train/language_model.md).
Note that:
- the `weight` parameter defines how much weight to give to the language model. It should be set carefully (usually between 0.5 and 2.0) as it will affect the quality of the predictions.
- linebreaks are treated as spaces by language models, as a result predictions will not include linebreaks.
#### Language model at character level
First, update the `inference_parameters.yml` file obtained during DAN training.
```yaml
parameters:
...
language_model:
model: my_dataset/language_model/model_characters.arpa
lexicon: my_dataset/language_model/lexicon_characters.txt
tokens: my_dataset/language_model/tokens.txt
weight: 0.5
```
Note that the `weight` parameter defines how much weight to give to the language model. It should be set carefully (usually between 0.5 and 2.0) as it will affect the quality of the predictions.
Then, run this command:
```shell
teklia-dan predict \
--image dan_humu_page/6e830f23-e70d-4399-8b94-f36ed3198575.jpg \
--model dan_humu_page/model.pt \
--parameters dan_humu_page/inference_parameters_char_lm.yml \
--charset dan_humu_page/charset.pkl \
--use-language-model \
--output dan_humu_page/predict_char_lm/
```
It will create the following JSON file named `dan_humu_page/predict_char_lm/6e830f23-e70d-4399-8b94-f36ed3198575.json`
```json
{
"text": "etc., some jeg netop idag\nholder Vask paa.\nLeien af Skj\u00f8rterne\nbestad i at jeg kj\u00f8bte\net Forkl\u00e6de til hver\naf de to Piger, some\nhavde laant os dem.\nResten var Vask af Hardan-\ngerskj\u00f8rter og et Forkl\u00e6de,\nsamt Fragt paa det Gods\n(N\u00f8i) some man sendte\nmig ubet\u00e6lt.\nIdag fik jeg hyggeligt\nFrimarkebrev fra Fosvold\nMed Hilsen\nDeres\nHulda Garborg",
"language_model": {
"text": "eet., some jeg netop idag holder Vask paa. Leien af Skj\u00f8rterne bestad i at jeg kj\u00f8bte et Forkl\u00e6de til hver af de to Piger, some havde laant os dem. Resten var Vask af Hardan- gerskj\u00f8rter og et Forkl\u00e6de, samt Fragt paa det Gods (T\u00f8i) some man sendte mig ubet\u00e6lt. Idag fik jeg hyggeligt Frimarkebrev fra Fosvold Med Hilsen Deres Hulda Garborg",
"confidence": 0.9
}
}
```
#### Language model at subword level
Update the `inference_parameters.yml` file obtained during DAN training.
```yaml
parameters:
...
language_model:
model: my_dataset/language_model/model_subwords.arpa
lexicon: my_dataset/language_model/lexicon_subwords.txt
tokens: my_dataset/language_model/tokens.txt
weight: 0.5
```
Then, run this command:
```shell
teklia-dan predict \
--image dan_humu_page/6e830f23-e70d-4399-8b94-f36ed3198575.jpg \
--model dan_humu_page/model.pt \
--parameters dan_humu_page/inference_parameters_subword_lm.yml \
--charset dan_humu_page/charset.pkl \
--use-language-model \
--output dan_humu_page/predict_subword_lm/
```
It will create the following JSON file named `dan_humu_page/predict_subword_lm/6e830f23-e70d-4399-8b94-f36ed3198575.json`
```json
{
"text": "etc., some jeg netop idag\nholder Vask paa.\nLeien af Skj\u00f8rterne\nbestad i at jeg kj\u00f8bte\net Forkl\u00e6de til hver\naf de to Piger, some\nhavde laant os dem.\nResten var Vask af Hardan-\ngerskj\u00f8rter og et Forkl\u00e6de,\nsamt Fragt paa det Gods\n(N\u00f8i) some man sendte\nmig ubet\u00e6lt.\nIdag fik jeg hyggeligt\nFrimarkebrev fra Fosvold\nMed Hilsen\nDeres\nHulda Garborg",
"language_model": {
"text": "eet., some jeg netop idag holder Vask paa. Leien af Skj\u00f8rterne bestad i at jeg kj\u00f8bte et Forkl\u00e6de til hver af de to Piger, some havde laant os dem. Resten var Vask af Hardan- gerskj\u00f8rter og et Forkl\u00e6de, samt Fragt paa det Gods (T\u00f8i) some man sendte mig ubet\u00e6lt. Idag fik jeg hyggeligt Frim\u00e6rkebrev fra Fosvold Med Hilsen Deres Hulda Garborg",
"confidence": 0.84
}
}
```
#### Language model at word level
Update the `inference_parameters.yml` file obtained during DAN training.
```yaml
parameters:
...
language_model:
model: my_dataset/language_model/model_words.arpa
lexicon: my_dataset/language_model/lexicon_words.txt
tokens: my_dataset/language_model/tokens.txt
weight: 0.5
```
Then, run this command:
```shell
teklia-dan predict \
--image dan_humu_page/6e830f23-e70d-4399-8b94-f36ed3198575.jpg \
--model dan_humu_page/model.pt \
--parameters dan_humu_page/inference_parameters_word_lm.yml \
--charset dan_humu_page/charset.pkl \
--use-language-model \
--output dan_humu_page/predict_word_lm/
```
It will create the following JSON file named `dan_humu_page/predict_word_lm/6e830f23-e70d-4399-8b94-f36ed3198575.json`
```json
{
"text": "etc., some jeg netop idag\nholder Vask paa.\nLeien af Skj\u00f8rterne\nbestad i at jeg kj\u00f8bte\net Forkl\u00e6de til hver\naf de to Piger, some\nhavde laant os dem.\nResten var Vask af Hardan-\ngerskj\u00f8rter og et Forkl\u00e6de,\nsamt Fragt paa det Gods\n(N\u00f8i) some man sendte\nmig ubet\u00e6lt.\nIdag fik jeg hyggeligt\nFrimarkebrev fra Fosvold\nMed Hilsen\nDeres\nHulda Garborg",
"language_model": {
"text": "etc., some jeg netop idag holder Vask paa. Leien af Skj\u00f8rterne bestad i at jeg kj\u00f8bte et Forkl\u00e6de til hver af de to Piger, some havde laant os dem. Resten var Vask af Hardan- gerskj\u00f8rter og et Forkl\u00e6de, samt Fragt paa det Gods (T\u00f8i) some man sendte mig ubetalt. Idag fik jeg hyggeligt Frim\u00e6rkebrev fra Fosvold Med Hilsen Deres Hulda Garborg",
"confidence": 0.77
}
}
```
# Description of parameters
| Parameter | Description | Type | Default |
| --------------------------- | --------------------------------------------------------------------------------------------------------------------------- | ------- | ------------- |
| `--image` | Path to the image to predict. Must not be provided with `--image-dir`. | `Path` | |
| `--image-dir` | Path to the folder where the images to predict are stored. Must not be provided with `--image`. | `Path` | |
| `--image-extension` | The extension of the images in the folder. Ignored if `--image-dir` is not provided. | `str` | .jpg |
| `--model` | Path to the model to use for prediction | `Path` | |
| `--parameters` | Path to the YAML parameters file. | `Path` | |
| `--charset` | Path to the charset file. | `Path` | |
| `--output` | Path to the output folder. Results will be saved in this directory. | `Path` | |
| `--confidence-score` | Whether to return confidence scores. | `bool` | `False` |
| `--confidence-score-levels` | Level to return confidence scores. Should be any combination of `["line", "word", "char", "ner"]`. | `str` | |
| `--attention-map` | Whether to plot attention maps. | `bool` | `False` |
| `--attention-map-scale` | Image scaling factor before creating the GIF. | `float` | `0.5` |
| `--attention-map-level` | Level to plot the attention maps. Should be in `["line", "word", "char", "ner"]`. | `str` | `"line"` |
| `--predict-objects` | Whether to return polygons coordinates. | `bool` | `False` |
| `--word-separators` | List of word separators. | `list` | `[" ", "\n"]` |
| `--line-separators` | List of line separators. | `list` | `["\n"]` |
| `--threshold-method` | Method to use for attention mask thresholding. Should be in `["otsu", "simple"]`. | `str` | `"otsu"` |
| `--threshold-value ` | Threshold to use for the "simple" thresholding method. | `int` | `0` |
| `--batch-size ` | Size of the batches for prediction. | `int` | `1` |
| `--start-token ` | Use a specific starting token at the beginning of the prediction. Useful when making predictions on different single pages. | `str` | `None` |