Extract text from PDFs during import (!187) · Merge requests · Arkindex / Tasks · GitLab

This is an archived project. Repository and other project resources are read-only.

Bastien Abadie requested to merge pdf-text-extraction into master Jan 13, 2021

This is easier than previously thought, thanks to pdfminer.six:

load the pdf in the lib
the library gives a nice tree of lines
only use lines (rect & other features are too verbose)
publish lines with content as transcriptions, using a bulk endpoint

Edited Jan 14, 2021 by Bastien Abadie