Merge the Transkribus and file imports (!360) · Merge requests · Arkindex / Tasks

I had to change the export archive detection logic to look for a mets.xml file anywhere in the archive, not necessarily at the root. A fresh export from a random Transkribus collection had one mets.xml file per Transkribus document.
While looking for the MIME type of a ZIP archive, I got some confusing statements about the possibility of another MIME type being used on Windows. I checked with a Windows 10 VM: Firefox, Chromium and Edge all upload their archives as application/x-zip-compressed on Windows, and as application/zip on Linux. I handled both MIME types.
It took me 5 hours to get my first successful import of ~90 pages. I noticed that extracting an image could take up to 30 minutes, even on master. Turns out the file-like object of ZipFile.open() was really not happy with Pillow's random accesses, so I extracted the file first before editing the image, making my test imports take 10 to 20 minutes instead of 5 hours.
My test imports were quickly filling up the disk, as #153 (closed) still occurs, causing each of my test imports to copy the export as an artifact.

Edited Oct 12, 2023 by Erwan Rouchet

Merge the Transkribus and file imports