PDF support in S3 imports
Closes #170 (closed)
On the question of whether to use Cantaloupe or download the PDF, this replies with yes. Cantaloupe is used to retrieve images, and the PDF is still downloaded to use pdfminer
on it and get the transcriptions. This means we don't have to have write permissions to import and we don't fill up the buckets with lots of JPEGs.
To just be able to count pages using pdfminer
, I had to bump it to its latest version, so this closes !290. Cantaloupe does not provide any way to count pages, so I didn't have much of a choice.
base-0.4.5
tag, assuming the next version of tasks will be 0.4.5.
The base image bump also required me to switch the base image's base image to python:3.8-alpine3.17
, as the original python:3.8-alpine
uses Alpine 3.13. cryptography
documents that Alpine ≤3.14 has an outdated version of Rust that prevents it from compiling, so I could not rebuild the image anymore, despite not touching the version of cryptography
at all. Who needs reproducible builds anyway?
This tries to reuse as much code from the file import as possible, hence some updates to it to move some functions around.
I had some bugs with the existing S3 import tests that were a little harder to debug because pytest-style assert
was used, instead of the unittest-style self.assertEqual
.