Newer
Older
# Dataset download
## Description
Use the `teklia-dan dataset download` command to download images of a dataset from a split extracted by DAN. This will:
- Store the set of characters encountered in the dataset (in the `charset.pkl` file),
- Generate the resources needed to build a n-gram language model at character, subword or word-level with [kenlm](https://github.com/kpu/kenlm) (in the `language_model/` folder).
- Generate the images of each element (in the `images/` folder),
- Create the mapping of the images that have been correctly uploaded (identified by its path) to the ground-truth transcription (with NER tokens if needed) (in the `labels.json` file).
If an image download fails for whatever reason, it won't appear in the transcriptions file. The reason will be printed to stdout at the end of the process. Before trying to download the image, it checks that it wasn't downloaded previously. It is thus safe to run this command twice if a few images failed.
| Parameter | Description | Type | Default |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------- | -------------- | ------- |
| `--output` | Path where the `split.json` file is stored and where the data will be generated. | `pathlib.Path` | |
| `--max-width` | Images larger than this width will be resized to this width. | `int` | |
| `--max-height` | Images larger than this height will be resized to this height. | `int` | |
| `--image-format` | Images will be saved under this format. | `str` | `.jpg` |
| `--unknown-token` | Token to use to replace character in the validation/test sets that is not included in the training set. | `str` | `⁇` |
| `--tokens` | Mapping between starting tokens and end tokens to extract text with their entities. | `pathlib.Path` | |
| `--subword-vocab-size` | Size of the vocabulary used to train the sentencepiece subword tokenizer used to train the optional language model. | `int` | `1000` |
The `--output` directory should have a `split.json` JSON-formatted file with a specific format. A mapping of the elements (identified by its ID) to the image information and the ground-truth transcription (with NER tokens if needed). This file can be generated by the `teklia-dan dataset extract` command. More details in the [dedicated page](./extract.md).
```json
{
"train": {
"<element_id>": {
"dataset_id": "<dataset_id>",
"image": {
"iiif_url": "https://<iiif_server>/iiif/2/<path>",
"polygon": [
[37, 191],
[37, 339],
[767, 339],
[767, 191],
[37, 191]
]
},
"text": "ⓢCoufet ⓕBouis ⓑ07.12.14"
},
},
"val": {},
"test": {}
}
```
The `--tokens` argument expects a YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively. This file can be generated by the `teklia-dan dataset tokens` command. More details in the [dedicated page](./tokens.md).
```yaml
INTITULE: # Type of the entity on Arkindex
start: ⓘ # Starting token for this entity
end: Ⓘ # Optional ending token for this entity
DATE:
start: ⓓ
end: Ⓓ
COTE_SERIE:
start: ⓢ
end: Ⓢ
ANALYSE_COMPL.:
start: ⓒ
end: Ⓒ
PRECISIONS_SUR_COTE:
start: ⓟ
end: Ⓟ
COTE_ARTICLE:
start: ⓐ
end: Ⓐ
CLASSEMENT:
start: ⓛ
end: Ⓛ
```
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
## Examples
### Download full images
To download images from an extracted split, please use the following:
```shell
teklia-dan dataset download \
--output data
```
### Download resized images
To download cropped images from an extracted split and limit the width and/or the height of images, please use the following:
```shell
teklia-dan dataset download \
--output data \
--max-width 1800
```
or
```shell
teklia-dan dataset download \
--output data \
--max-height 2000
```
or
```shell
teklia-dan dataset download \
--output data \
--max-width 1800 \
--max-height 2000
```