Support extracting archives in file import
Requires backend!2184 (merged), frontend!1599 (merged), and a base-libmagic
tag, closes #159 (closed)
The file import has been refactored to split the main workflow into multiple functions, to make it more readable and to allow treating the files within archives as if they were actual DataFiles. This means an archive can contain an archive that contains an archive, and the import will extract everything.
This adds a dependency on python-magic
to detect the MIME types of the files contained within archives. Otherwise, the file import wouldn't be able to know what to do with each file.
This uses tarfile.data_filter
to defend against tarbombs, and includes a maximum depth for recursive extractions that protects against ZIP quines.
This does not include any defenses against ZIP bombs; resource limits would be more easily configured at the Docker level, or would likely require adding some backend settings and passing them to tasks, as it is possible to have a Transkribus export that is so large it looks like a bomb, or a ZIP of huge PDFs. Testing shows the maximum damage this can do is some high CPU usage, then filling up the hard disk, then the task crashing due to disk space exhaustion, then Ponos should be destroying the container and freeing up the disk again.