Skip to content

PageXML export

We will implement a new export mode in arkindex_cli.commands.export.pagexml. We are targeting the XML schema from https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd.

This is a first version of the export. We only want to support pages, paragraphs and text lines with their transcription.

CLI Args

  • --line-type: defaults to text_line
  • --paragraph-type: if unset, defaults to --line-type
  • --page-type: defaults to page
  • --parent: limit search to an element
  • --transcription-source: WK run source to filter transcriptions
  • --output where files are generated

Workflow

  • 1 XML file per page element (CLI argument --page-type), named after page.id
  • For each page:
    • iterate over direct children
    • create <TextRegion> with element coordinate
      • if child is of paragraph-type
        • iterate over direct children
        • look for lines there instead
      • TextRegion.readingDirection is deduced from orientation of transcription of first line
      • export each line as <TextLine>
      • write text in <Unicode> child node

Node description

  • root node is <PcGts>
    • <Metadata> children
      • <Creator>: Arkindex CLI
      • <Created>: datetime.now().isoformat(timespec='seconds')
      • <LastChange>: datetime.now().isoformat(timespec='seconds')
    • <page> attributes
      • image_filename: element.name
      • image_width: element.image.width
      • image_height: element.image.height
    • <TextRegion>
      • id attribute: element.type + - + element.name
      • type attribute: paragraph
      • Orientation attribute: parsed from element.orientation ([+ 90, - 89])
      • readingDirection and textLineOrder attribute: parsed from transcription.orientation (see documentation)
      • <coords>: <coords><point x="..." y="..." /><point ...</coords> from element.polygon[:4]
      • list of <TextLine> if element.type==paragraph_type else list of <TextEquiv>.
    • <TextLine>
      • id attribute: element.type + - + element.name
      • <coords> child
      • <TextEquiv> child node with <Unicode> node with transcription

readingDirection stands for the direction in which text within lines should be read. textLineOrder stands for The order of text lines within a block.

Refs https://redmine.teklia.com/issues/7549

Output with only text lines
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15
http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
    <Metadata>
        <Creator>Arkindex CLI</Creator>
        <Created>2017-05-03T10:20:47</Created>
        <LastChange>2017-05-03T10:27:21</LastChange>
    </Metadata>
    <Page imageFilename="1258612" imageWidth="700" imageHeight="1066">
        <TextRegion id="line-1" type="paragraph">
            <Coords points="25,310 25,430 400,430 400,310"/>
            <TextEquiv conf="1.0">
                <Unicode>Paradis le 29 novembre 1919</Unicode>
            </TextEquiv>
        </TextRegion>
        <TextRegion id="line-2" type="paragraph">
            <Coords points="25,310 25,430 400,430 400,310"/>
            <TextEquiv conf="1.0">
                <Unicode>Cher Marius</Unicode>
            </TextEquiv>
        </TextRegion>
        ...
    </Page>
</PcGts>
Output with a paragraph
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15
http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
    <Metadata>
        <Creator>Arkindex CLI</Creator>
        <Created>2017-05-03T10:20:47</Created>
        <LastChange>2017-05-03T10:27:21</LastChange>
    </Metadata>
    <Page imageFilename="1258612" imageWidth="700" imageHeight="1066">
        <TextRegion id="paragraph-1" type="paragraph">
            <Coords points="25,310 25,430 400,430 400,310"/>
            <TextLine id="line-1">
                <Coords points="25,310 25,430 400,430 400,310"/>
                <TextEquiv conf="1.0">
                    <Unicode>Paradis le 29 novembre 1919</Unicode>
                </TextEquiv>
            </TextLine>
            <TextLine id="line-2">
                <Coords points="25,310 25,430 400,430 400,310"/>
                <TextEquiv conf="1.0">
                    <Unicode>Cher Marius</Unicode>
                </TextEquiv>
            </TextLine>
        </TextRegion>
        ...
    </Page>
</PcGts>
Edited by Yoann Schneider