Skip to content

Extract text from PDFs during import

Bastien Abadie requested to merge pdf-text-extraction into master

This is easier than previously thought, thanks to pdfminer.six:

  • load the pdf in the lib
  • the library gives a nice tree of lines
  • only use lines (rect & other features are too verbose)
  • publish lines with content as transcriptions, using a bulk endpoint
Edited by Bastien Abadie

Merge request reports

Loading