Skip to content
Snippets Groups Projects

Support subword and word language models

Merged Solene Tarride requested to merge subword-and-word-lm into main
All threads resolved!
1 file
+ 101
21
Compare changes
  • Side-by-side
  • Inline
+ 101
21
@@ -4,27 +4,57 @@ Use the `teklia-dan predict` command to apply a trained DAN model on an image.
## Description of parameters
| Parameter | Description | Type | Default |
| --------------------------- | --------------------------------------------------------------------------------------------------------------------------- | ------- | ------------- |
| `--image` | Path to the image to predict. Must not be provided with `--image-dir`. | `Path` | |
| `--image-dir` | Path to the folder where the images to predict are stored. Must not be provided with `--image`. | `Path` | |
| `--image-extension` | The extension of the images in the folder. Ignored if `--image-dir` is not provided. | `str` | .jpg |
| `--model` | Path to the model to use for prediction | `Path` | |
| `--parameters` | Path to the YAML parameters file. | `Path` | |
| `--charset` | Path to the charset file. | `Path` | |
| `--output` | Path to the output folder. Results will be saved in this directory. | `Path` | |
| `--confidence-score` | Whether to return confidence scores. | `bool` | `False` |
| `--confidence-score-levels` | Level to return confidence scores. Should be any combination of `["line", "word", "char"]`. | `str` | |
| `--attention-map` | Whether to plot attention maps. | `bool` | `False` |
| `--attention-map-scale` | Image scaling factor before creating the GIF. | `float` | `0.5` |
| `--attention-map-level` | Level to plot the attention maps. Should be in `["line", "word", "char"]`. | `str` | `"line"` |
| `--predict-objects` | Whether to return polygons coordinates. | `bool` | `False` |
| `--word-separators` | List of word separators. | `list` | `[" ", "\n"]` |
| `--line-separators` | List of line separators. | `list` | `["\n"]` |
| `--threshold-method` | Method to use for attention mask thresholding. Should be in `["otsu", "simple"]`. | `str` | `"otsu"` |
| `--threshold-value ` | Threshold to use for the "simple" thresholding method. | `int` | `0` |
| `--batch-size ` | Size of the batches for prediction. | `int` | `1` |
| `--start-token ` | Use a specific starting token at the beginning of the prediction. Useful when making predictions on different single pages. | `str` | `None` |
| Parameter | Description | Type | Default |
| --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | ------------- |
| `--image` | Path to the image to predict. Must not be provided with `--image-dir`. | `Path` | |
| `--image-dir` | Path to the folder where the images to predict are stored. Must not be provided with `--image`. | `Path` | |
| `--image-extension` | The extension of the images in the folder. Ignored if `--image-dir` is not provided. | `str` | .jpg |
| `--model` | Path to the model to use for prediction | `Path` | |
| `--parameters` | Path to the YAML parameters file. | `Path` | |
| `--charset` | Path to the charset file. | `Path` | |
| `--output` | Path to the output folder. Results will be saved in this directory. | `Path` | |
| `--confidence-score` | Whether to return confidence scores. | `bool` | `False` |
| `--confidence-score-levels` | Level to return confidence scores. Should be any combination of `["line", "word", "char"]`. | `str` | |
| `--attention-map` | Whether to plot attention maps. | `bool` | `False` |
| `--attention-map-scale` | Image scaling factor before creating the GIF. | `float` | `0.5` |
| `--attention-map-level` | Level to plot the attention maps. Should be in `["line", "word", "char"]`. | `str` | `"line"` |
| `--predict-objects` | Whether to return polygons coordinates. | `bool` | `False` |
| `--word-separators` | List of word separators. | `list` | `[" ", "\n"]` |
| `--line-separators` | List of line separators. | `list` | `["\n"]` |
| `--threshold-method` | Method to use for attention mask thresholding. Should be in `["otsu", "simple"]`. | `str` | `"otsu"` |
| `--threshold-value ` | Threshold to use for the "simple" thresholding method. | `int` | `0` |
| `--batch-size ` | Size of the batches for prediction. | `int` | `1` |
| `--start-token ` | Use a specific starting token at the beginning of the prediction. Useful when making predictions on different single pages. | `str` | `None` |
| `--use-language-model` | Whether to use an external n-gram language model to rescore hypotheses. See [the next section](#rescoring-hypotheses-with-a-n-gram-language-model) for details. | `bool` | `False` |
## Rescoring hypotheses with a N-gram language model
A dataset extracted with the `teklia-dan dataset extract` command should contain the files required to build a language model (in the `language_model` folder).
To refine DAN's predictions with a language model, follow these steps:
1. Install and build [kenlm](https://github.com/kpu/kenlm)
1. Build a 6-gram language model using the following command
```sh
bin/lmplz --order 6 \
--text my_dataset/language_model/corpus.txt \
--arpa my_dataset/language_model/model.arpa
```
1. Update `inference_parameters.yml`. The `weight` parameter defines how much weight to give to the language model. It should be set carefully (usually between 0.5 and 2.0) as it will affect the quality of the predictions.
```yaml
parameters:
...
language_model:
model: my_dataset/language_model/model.arpa
lexicon: my_dataset/language_model/lexicon.txt
tokens: my_dataset/language_model/tokens.txt
weight: 1.0
```
1. Predict with the `--use-language-model` argument.
## Examples
@@ -158,3 +188,53 @@ It will create the following JSON file named `dan_humu_page/predict/example.json
```
<img src="../../assets/example_line_polygon.gif" >
### Predict with an external n-gram language model
To run a prediction with the n-gram language model, run this command:
```shell
teklia-dan predict \
--image dan_humu_page/example.jpg \
--model dan_humu_page/model.pt \
--parameters dan_humu_page/parameters.yml \
--charset dan_humu_page/charset.pkl \
--use-language-model \
--output dan_humu_page/predict/
```
It will create the following JSON file named `dan_humu_page/predict/example.json`
```json
{
"text": "Oslo\n39 \nOresden den 24te Rasser!\nH\u00f8jst\u00e6redesherr Hartvig - assert!\nUllereder fra den f\u00f8rste tide da\njeg havder den tilfredsstillelser at vide den ar-\ndistiske ledelser af Kristiania theater i Deres\nhronder, har jeg g\u00e5t hernede med et stille\nh\u00e5b om fra Dem at modtage et forelag, sig -\nsende tils at lade \"K\u00e6rlighedens \u00abKomedie\u00bb\nopf\u00f8re fore det norske purblikum.\nEt s\u00e5dant forslag er imidlertid, imod\nforventning; ikke fremkommet, og jeg n\u00f8des der-\nfor tils self at grivbe initiativet, hvilket hervede\nsker, idet jeg\nbeder\nbet\nragte stigkket some ved denne\nskrivelse officielde indleveret til theatret. No-\nget exemplar af bogen vedlagger jeg ikke da\ndenne (i 2den udgave) med Lethed kan er -\nholdet deroppe.\nDe bet\u00e6nkeligheder, jeg i sin tid n\u00e6-\nrede mod stykkets opf\u00f8relse, er for l\u00e6nge si -\ndem forsvundne. Af mange begn er jeg kom-\nmen til den overbevisning at almenlreden\naru har f\u00e5tt sine \u00f8gne opladte for den sand -\nMed at dette arbejde i sin indersten id\u00e9 hviler\np\u00e5 et ubedinget meralsk grundlag, og brad\nstykkets hele kunstneriske struktuve ang\u00e5r,",
"language_model": [
{
"confidence": 0.68,
"polygon": [
[
264,
118
],
[
410,
118
],
[
410,
185
],
[
264,
185
]
],
"text": "Oslo",
"text_confidence": 0.8
}
],
"attention_gif": "dan_humu_page/predict/example_line.gif"
}
```
<img src="../../assets/example_line_polygon.gif" >
Loading