Use multithreading to download images at the end, remove cache as we now... (!273) · Merge requests · Automatic Text Recognition / DAN

Yoann Schneider requested to merge support-offset-pages into main Sep 06, 2023

We had an issue with the image downloading/extraction using the code from teklia_line_image_extractor. This code would not support extracting right single pages as they are offset-ted. This code is too complicated for our use anyway so I decided to do something different and easier to understand.

We're back to downloading every element's image independently, without caching the parent's as we no longer do cropping operations in Python. Instead, we rely on IIIF to do the cropping and max_resizing (if needed). This means more download calls but smaller (=faster) ones.

I decided to use a multi-threading pool to do these operations using the full capacities of the computer. I decided against multi-processing (using more CPUs) as we only do I/O operations. I could have made the pool start to work as soon as the first element is processed but I decided against it because:

waiting till all transcriptions are extracted is really not an issue,
it would complexify the code too much imo.

That's why the pool only starts working after all elements transcriptions have been parsed. If there is an issue with an image (maximum retries exceeded, anything), the corresponding transcription won't be in labels.json. The full list of failed downloads as well as the encountered error is printed at the end of the process.

There is no more cache mechanism but the image won't be downloaded twice if it's already present. This means that if 10 images failed to download at the first try because of network issues (50x errors), one can simply restart the command and images won't be downloaded twice. It's up to the user to delete any old images if they want to start anew.

Use multithreading to download images at the end, remove cache as we now...

Merge request reports