Skip to content

PDF support in S3 imports

Erwan Rouchet requested to merge s3-pdf into master

Closes #170 (closed)

On the question of whether to use Cantaloupe or download the PDF, this replies with yes. Cantaloupe is used to retrieve images, and the PDF is still downloaded to use pdfminer on it and get the transcriptions. This means we don't have to have write permissions to import and we don't fill up the buckets with lots of JPEGs.

To just be able to count pages using pdfminer, I had to bump it to its latest version, so this closes !290. Cantaloupe does not provide any way to count pages, so I didn't have much of a choice.

This pdfminer bump requires a base-0.4.5 tag, assuming the next version of tasks will be 0.4.5.

The base image bump also required me to switch the base image's base image to python:3.8-alpine3.17, as the original python:3.8-alpine uses Alpine 3.13. cryptography documents that Alpine ≤3.14 has an outdated version of Rust that prevents it from compiling, so I could not rebuild the image anymore, despite not touching the version of cryptography at all. Who needs reproducible builds anyway?

This tries to reuse as much code from the file import as possible, hence some updates to it to move some functions around.

I had some bugs with the existing S3 import tests that were a little harder to debug because pytest-style assert was used, instead of the unittest-style self.assertEqual.

Edited by Erwan Rouchet

Merge request reports

Loading