New command to import ALTO files
Refs https://gitlab.com/arkindex/requests/-/issues/646
Goal: Be able to import https://git.litislab.fr/tconstum/popp-datasets/-/tree/main/
We need a new command arkindex upload alto
that allows a user to import one or more ALTO XML files.
This tool will assume that images are already available on an IIIF server, and its base url will be provided as CLI arg.
To simplify, we'll only support creating elements in an existing folder, without support for any hierarchy: each alto file produces a page that is stored in the parent provided to the command.
Workflow should be:
- build full IIIF url using base url and
fileName
from header - create the image on arkindex using IIIF endpoint
- iterate over Page element, and create
page-type
elements with mentionned coordinates - iterate over TextBlock elements and create
text-block-type
elements with mentionned coordinates - iterate over TextLine elements and create
line-type
elements with mentionned coordinates - iterate over
String
and create transcription
CLI arguments:
-
--iiif-base-url
required -
--parent-id
required -
--page-type
, optional, default topage
-
--line-type
, optional, default totext_line
-
--text-block-type
, optional, default toparagraph
This is not a generic solution, it most likely will not support a lot of ALTO files, but it's a start. We could later replace the --xxx-type
configuration by dynamic configuration allowing to match each types from the XML with a type on Arkindex and thus reproduce the full hierarchy (or skip levels)