Separate command to download images
We need a new command and to update the current teklia-dan dataset extract
Specification
teklia-dan dataset extract
- Same as before (without image downloading)
- Generate a file
split.json
will look like
{
"train": {
"<path_to_image>": {
"text": "...",
"url": "<download_url>",
}
},
...
}
teklia-dan dataset download
-
split.json
lists the images that need to be downloaded. -
labels.json
lists the images already downloaded
- Takes a
split.json
- Takes a path to
labels.json
, optional. If present, load data - Iterate over split, and over image paths
- Try downloading through URL with multithreading
- if successful do
- remove from split.json data
- add to labels.json data
- if successful do
- Save
labels.json
data
{
"train": {
"<path_to_image>": "<transcription>"
}
}
- Save
split.json
data