PageXML export
We will implement a new export mode in arkindex_cli.commands.export.pagexml. We are targeting the XML schema from https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd.
This is a first version of the export. We only want to support pages, paragraphs and text lines with their transcription.
CLI Args
-
--line-type: defaults totext_line -
--paragraph-type: if unset, defaults to--line-type -
--page-type: defaults topage -
--parent: limit search to an element -
--transcription-source: WK run source to filter transcriptions -
--outputwhere files are generated
Workflow
- 1 XML file per page element (CLI argument
--page-type), named after page.id - For each page:
- iterate over direct children
- create
<TextRegion>with element coordinate- if child is of
paragraph-type- iterate over direct children
- look for lines there instead
-
TextRegion.readingDirectionis deduced from orientation of transcription of first line - export each line as
<TextLine> - write text in
<Unicode>child node
- if child is of
Node description
- root node is
<PcGts>-
<Metadata>children-
<Creator>:Arkindex CLI -
<Created>:datetime.now().isoformat(timespec='seconds') -
<LastChange>:datetime.now().isoformat(timespec='seconds')
-
-
<page>attributes-
image_filename:element.name -
image_width:element.image.width -
image_height:element.image.height
-
-
<TextRegion>-
idattribute:element.type+-+element.name -
typeattribute:paragraph -
Orientationattribute: parsed from element.orientation ([+ 90, - 89]) -
readingDirectionandtextLineOrderattribute: parsed from transcription.orientation (see documentation) -
<coords>:<coords><point x="..." y="..." /><point ...</coords>from element.polygon[:4] - list of
<TextLine>ifelement.type==paragraph_typeelse list of<TextEquiv>.
-
-
<TextLine>-
idattribute:element.type+-+element.name -
<coords>child -
<TextEquiv>child node with<Unicode>node with transcription
-
-
readingDirection stands for the direction in which text within lines should be read.
textLineOrder stands for The order of text lines within a block.
Refs https://redmine.teklia.com/issues/7549
Output with only text lines
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15
http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
<Metadata>
<Creator>Arkindex CLI</Creator>
<Created>2017-05-03T10:20:47</Created>
<LastChange>2017-05-03T10:27:21</LastChange>
</Metadata>
<Page imageFilename="1258612" imageWidth="700" imageHeight="1066">
<TextRegion id="line-1" type="paragraph">
<Coords points="25,310 25,430 400,430 400,310"/>
<TextEquiv conf="1.0">
<Unicode>Paradis le 29 novembre 1919</Unicode>
</TextEquiv>
</TextRegion>
<TextRegion id="line-2" type="paragraph">
<Coords points="25,310 25,430 400,430 400,310"/>
<TextEquiv conf="1.0">
<Unicode>Cher Marius</Unicode>
</TextEquiv>
</TextRegion>
...
</Page>
</PcGts>
Output with a paragraph
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15
http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
<Metadata>
<Creator>Arkindex CLI</Creator>
<Created>2017-05-03T10:20:47</Created>
<LastChange>2017-05-03T10:27:21</LastChange>
</Metadata>
<Page imageFilename="1258612" imageWidth="700" imageHeight="1066">
<TextRegion id="paragraph-1" type="paragraph">
<Coords points="25,310 25,430 400,430 400,310"/>
<TextLine id="line-1">
<Coords points="25,310 25,430 400,430 400,310"/>
<TextEquiv conf="1.0">
<Unicode>Paradis le 29 novembre 1919</Unicode>
</TextEquiv>
</TextLine>
<TextLine id="line-2">
<Coords points="25,310 25,430 400,430 400,310"/>
<TextEquiv conf="1.0">
<Unicode>Cher Marius</Unicode>
</TextEquiv>
</TextLine>
</TextRegion>
...
</Page>
</PcGts>
Edited by Yoann Schneider