Skip to content

Merge the Transkribus and file imports

Erwan Rouchet requested to merge import-files-transkribus into master

Closes #158 (closed)

  • I had to change the export archive detection logic to look for a mets.xml file anywhere in the archive, not necessarily at the root. A fresh export from a random Transkribus collection had one mets.xml file per Transkribus document.

  • While looking for the MIME type of a ZIP archive, I got some confusing statements about the possibility of another MIME type being used on Windows. I checked with a Windows 10 VM: Firefox, Chromium and Edge all upload their archives as application/x-zip-compressed on Windows, and as application/zip on Linux. I handled both MIME types.

  • It took me 5 hours to get my first successful import of ~90 pages. I noticed that extracting an image could take up to 30 minutes, even on master. Turns out the file-like object of ZipFile.open() was really not happy with Pillow's random accesses, so I extracted the file first before editing the image, making my test imports take 10 to 20 minutes instead of 5 hours.

  • My test imports were quickly filling up the disk, as #153 (closed) still occurs, causing each of my test imports to copy the export as an artifact.

Edited by Erwan Rouchet

Merge request reports

Loading