Skip to content

Faster ALTO publication

Bastien Abadie requested to merge mets-fixes into master

Refs https://redmine.teklia.com/issues/3512

This MR brings bulk endpoint publication to the ALTO files. It will fully replace the create_elements method in the end.

There are a few constraints:

  • bulk endpoints are only available with a worker_run_id, but not all users will have access to it
    • this means a slow mode must remain, publishing elements & transcription one by one
    • a new CLI option --worker-run-id is added
  • there is not bulk endpoints for publishing metadatas across a range of element ids (only multiple metadatas on one element)
    • this means publishing alto IDs become optional, as it's rarely needed for prod corpus (but useful for debug ones)
    • a new CLI option --skip-metadatas is added
  • CreateElements only support elements linked to an image with a polygon, so we still need to publish one-by-one the structural elements

Remaining steps:

  • use create_elements_fast instead of create_elements in ALTO publication
  • add both new CLI options to alto tool
  • fix unit tests, do not introduce new ones (some may even disappear)
  • remove AltoElement.serialize as it will be unused
  • document CLI options
Edited by Bastien Abadie

Merge request reports

Loading