ALTO import command
Closes #51 (closed)
This is unoptimized, but successfully imports every .alto.xml
found in POPP-datasets. I had to stray away a little from the process described on the issue:
- There was nothing specified for the path to the XML files, so I added an argument that can only accept a directory, which behaves the same as the PAGE XML import.
- Instead of
<Page>
elements, which do not have a position (noHPOS
and noVPOS
), and cannot have<Polygon>
elements either, I used<PrintSpace>
, which is right below the<Page>
and has a position. It is supposed to not include the page's margins, which are specified usingLeft/Top/Right/BottomMargin
elements, but there are no margins in most ALTO files I've encountered.
Note that all of those files are simultaneously using ALTO 1.4 and 3.0 namespaces. They validate against ALTO 3.0 and 3.1, and with some editing of the namespace URLs, they actually validate on every ALTO version.
This cannot import the other document mentioned in the request as it uses String
differently (as actual word
elements), uses an unsupported type CompoundBlock
, uses an absolute path as the <fileName>
, and seems to describe a dozen pages at once on the same image.
This uses the XML abstraction layer of transkribus-client
, which could probably be moved to teklia-toolbox instead.