Merge the Transkribus and file imports
Closes #158 (closed)
-
I had to change the export archive detection logic to look for a
mets.xml
file anywhere in the archive, not necessarily at the root. A fresh export from a random Transkribus collection had onemets.xml
file per Transkribus document. -
While looking for the MIME type of a ZIP archive, I got some confusing statements about the possibility of another MIME type being used on Windows. I checked with a Windows 10 VM: Firefox, Chromium and Edge all upload their archives as
application/x-zip-compressed
on Windows, and asapplication/zip
on Linux. I handled both MIME types. -
It took me 5 hours to get my first successful import of ~90 pages. I noticed that extracting an image could take up to 30 minutes, even on
master
. Turns out the file-like object ofZipFile.open()
was really not happy with Pillow's random accesses, so I extracted the file first before editing the image, making my test imports take 10 to 20 minutes instead of 5 hours. -
My test imports were quickly filling up the disk, as #153 (closed) still occurs, causing each of my test imports to copy the export as an artifact.