Skip to content

Separate command to download images

We need a new command and to update the current teklia-dan dataset extract

Specification

teklia-dan dataset extract

  • Same as before (without image downloading)
  • Generate a file split.json will look like
{
    "train": {
        "<path_to_image>": {
            "text": "...",
            "url": "<download_url>",
        }
    },
    ...
}

teklia-dan dataset download

  • split.json lists the images that need to be downloaded.
  • labels.json lists the images already downloaded
  • Takes a split.json
  • Takes a path to labels.json, optional. If present, load data
  • Iterate over split, and over image paths
  • Try downloading through URL with multithreading
    • if successful do
      • remove from split.json data
      • add to labels.json data
  • Save labels.json data
{
    "train": {
        "<path_to_image>": "<transcription>"
    }
}
  • Save split.json data