PageXML export
Depends arkindex/cli#186 (closed) and workers/base-worker#356 (closed)
This worker will use the docker image of 0.4.4a1 https://gitlab.teklia.com/arkindex/cli.
Name: PageXML export
Slug: pagexml-export
Type: export
Implement this in a worker_export.pagexml module.
Download latest corpus export in configure, store the path to the export as instance attribute.
This worker processes folder elements and use this element as parent argument to arkindex_cli.commands.export.pagexml.run.
Expose as user configuration parameter:
- line_type
- paragraph_type
- transcription source (use parse_source_id)
Generate files in a temporary folder, with one directory per processed element. The following is an example of the file structure when processing two folders in the same process
├── <temp_folder>
│ ├── <folder_id_1>
│ │ ├── *.xml
│ ├── <folder_id_2>
│ │ ├── *.xml
At the end of run, zip the content of <temp_folder> and store the TAR+ZST as page_xml.tar.zst in self.work_dir using create_tar_zst_archive.