Skip to content

PageXML export

Depends arkindex/cli#186 (closed) and workers/base-worker#356 (closed)

This worker will use the docker image of 0.4.4a1 https://gitlab.teklia.com/arkindex/cli.

Name: PageXML export Slug: pagexml-export Type: export

Implement this in a worker_export.pagexml module. Download latest corpus export in configure, store the path to the export as instance attribute. This worker processes folder elements and use this element as parent argument to arkindex_cli.commands.export.pagexml.run. Expose as user configuration parameter:

Generate files in a temporary folder, with one directory per processed element. The following is an example of the file structure when processing two folders in the same process

├── <temp_folder>
│   ├── <folder_id_1>
│   │   ├── *.xml
│   ├── <folder_id_2>
│   │   ├── *.xml

At the end of run, zip the content of <temp_folder> and store the TAR+ZST as page_xml.tar.zst in self.work_dir using create_tar_zst_archive.

Refs https://redmine.teklia.com/issues/7549

Edited by Yoann Schneider