Dataset download
Description
Use the teklia-dan dataset download
command to download images of a dataset from a split extracted by DAN. This will:
- Generate the images of each element (in the
images/
folder), - Create the mapping of the images that have been correctly uploaded (identified by its path) to the ground-truth transcription (with NER tokens if needed) (in the
labels.json
file).
If an image download fails for whatever reason, it won't appear in the transcriptions file. The reason will be printed to stdout at the end of the process. Before trying to download the image, it checks that it wasn't downloaded previously. It is thus safe to run this command twice if a few images failed.
Parameter | Description | Type | Default |
---|---|---|---|
--output |
Path where the split.json file is stored and where the data will be generated. |
pathlib.Path |
|
--max-width |
Images larger than this width will be resized to this width. | int |
|
--max-height |
Images larger than this height will be resized to this height. | int |
|
--image-format |
Images will be saved under this format. | str |
.jpg |
The --output
directory should have a split.json
JSON-formatted file with a specific format. A mapping of the elements (identified by its ID) to the image information and the ground-truth transcription (with NER tokens if needed). This file can be generated by the teklia-dan dataset extract
command. More details in the dedicated page.
{
"train": {
"<element_id>": {
"image": {
"iiif_url": "https://<iiif_server>/iiif/2/<path>",
"polygon": [
[37, 191],
[37, 339],
[767, 339],
[767, 191],
[37, 191]
]
},
"text": "ⓢCou⁇e⁇ ⓕBouis ⓑ⁇.12.14"
},
},
"val": {},
"test": {}
}
Examples
Download full images
To download images from an extracted split, please use the following:
teklia-dan dataset download \
--output data
Download resized images
To download cropped images from an extracted split and limit the width and/or the height of images, please use the following:
teklia-dan dataset download \
--output data \
--max-width 1800
or
teklia-dan dataset download \
--output data \
--max-height 2000
or
teklia-dan dataset download \
--output data \
--max-width 1800 \
--max-height 2000