Skip to content

ALTO import command

Erwan Rouchet requested to merge upload-alto into master

Closes #51 (closed)

This is unoptimized, but successfully imports every .alto.xml found in POPP-datasets. I had to stray away a little from the process described on the issue:

  • There was nothing specified for the path to the XML files, so I added an argument that can only accept a directory, which behaves the same as the PAGE XML import.
  • Instead of <Page> elements, which do not have a position (no HPOS and no VPOS), and cannot have <Polygon> elements either, I used <PrintSpace>, which is right below the <Page> and has a position. It is supposed to not include the page's margins, which are specified using Left/Top/Right/BottomMargin elements, but there are no margins in most ALTO files I've encountered.

Note that all of those files are simultaneously using ALTO 1.4 and 3.0 namespaces. They validate against ALTO 3.0 and 3.1, and with some editing of the namespace URLs, they actually validate on every ALTO version.

This cannot import the other document mentioned in the request as it uses String differently (as actual word elements), uses an unsupported type CompoundBlock, uses an absolute path as the <fileName>, and seems to describe a dozen pages at once on the same image.

This uses the XML abstraction layer of transkribus-client, which could probably be moved to teklia-toolbox instead.

Edited by ml bonhomme

Merge request reports

Loading