PageXML export
Depends arkindex/cli#186 (closed) and workers/base-worker#356 (closed)
This worker will use the docker image of 0.4.4a1
https://gitlab.teklia.com/arkindex/cli.
Name: PageXML export
Slug: pagexml-export
Type: export
Implement this in a worker_export.pagexml
module.
Download latest corpus export in configure
, store the path to the export as instance attribute.
This worker processes folder elements and use this element as parent
argument to arkindex_cli.commands.export.pagexml.run
.
Expose as user configuration parameter:
- line_type
- paragraph_type
- transcription source (use parse_source_id)
Generate files in a temporary folder, with one directory per processed element. The following is an example of the file structure when processing two folders in the same process
├── <temp_folder>
│ ├── <folder_id_1>
│ │ ├── *.xml
│ ├── <folder_id_2>
│ │ ├── *.xml
At the end of run
, zip the content of <temp_folder>
and store the TAR+ZST as page_xml.tar.zst
in self.work_dir
using create_tar_zst_archive
.