Support PDF in S3-compatible ingestion
Refs https://redmine.teklia.com/issues/2851
We need to support PDF during the ingestion from ceph/S3-compatible buckets on arkindex.
The files are currently listed, but only the first page is ingested (as Cantaloupe supports it), which lead to strange results for the end user.
We would have two implementation options:
- through cantaloupe
- through download & local parsing of PDF
Cantaloupe
The cantaloupe server supports PDFs, and mention page index to browse the pages in its source code.
I was not able to find any reference to such page index in the IIIF 3.0 spec, nor able to access the various page on a sample PDF.
There does not seem to have any information in the related info.json - tiles are not the pages we are looking for.
If you find how to specify the page index, it may be interesting, but the big downside is that we would not have the potential transcriptions (we could live without that if the implementation is quickly implemented...).
Download & parse
The most feature-complete solution is then to
- download the file from the bucket
- extract its images using poppler as we already do in tasks
- upload each image onto the bucket
- get the potential text transcriptions
- create images on arkindex
- create page elements on arkindex
- create transcriptions & their elements on arkindex
We already have the PDF parsing/extraction, and the code for images + elements creation.
This breaks the existing workflow as we only rely on remote files + IIIF right now.
The ceph credentials would need to be read-write (RO only right now, but it's an infra detail).