PageXML export
We will implement a new export mode in arkindex_cli.commands.export.pagexml
. We are targeting the XML schema from https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd.
This is a first version of the export. We only want to support pages, paragraphs and text lines with their transcription.
CLI Args
-
--line-type
: defaults totext_line
-
--paragraph-type
: if unset, defaults to--line-type
-
--page-type
: defaults topage
-
--parent
: limit search to an element -
--transcription-source
: WK run source to filter transcriptions -
--output
where files are generated
Workflow
- 1 XML file per page element (CLI argument
--page-type
), named after page.id - For each page:
- iterate over direct children
- create
<TextRegion>
with element coordinate- if child is of
paragraph-type
- iterate over direct children
- look for lines there instead
-
TextRegion.readingDirection
is deduced from orientation of transcription of first line - export each line as
<TextLine>
- write text in
<Unicode>
child node
- if child is of
Node description
- root node is
<PcGts>
-
<Metadata>
children-
<Creator>
:Arkindex CLI
-
<Created>
:datetime.now().isoformat(timespec='seconds')
-
<LastChange>
:datetime.now().isoformat(timespec='seconds')
-
-
<page>
attributes-
image_filename
:element.name
-
image_width
:element.image.width
-
image_height
:element.image.height
-
-
<TextRegion>
-
id
attribute:element.type
+-
+element.name
-
type
attribute:paragraph
-
Orientation
attribute: parsed from element.orientation ([+ 90, - 89]) -
readingDirection
andtextLineOrder
attribute: parsed from transcription.orientation (see documentation) -
<coords>
:<coords><point x="..." y="..." /><point ...</coords>
from element.polygon[:4] - list of
<TextLine>
ifelement.type==paragraph_type
else list of<TextEquiv>
.
-
-
<TextLine>
-
id
attribute:element.type
+-
+element.name
-
<coords>
child -
<TextEquiv>
child node with<Unicode>
node with transcription
-
-
readingDirection
stands for the direction in which text within lines should be read.
textLineOrder
stands for The order of text lines within a block.
Refs https://redmine.teklia.com/issues/7549
Output with only text lines
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15
http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
<Metadata>
<Creator>Arkindex CLI</Creator>
<Created>2017-05-03T10:20:47</Created>
<LastChange>2017-05-03T10:27:21</LastChange>
</Metadata>
<Page imageFilename="1258612" imageWidth="700" imageHeight="1066">
<TextRegion id="line-1" type="paragraph">
<Coords points="25,310 25,430 400,430 400,310"/>
<TextEquiv conf="1.0">
<Unicode>Paradis le 29 novembre 1919</Unicode>
</TextEquiv>
</TextRegion>
<TextRegion id="line-2" type="paragraph">
<Coords points="25,310 25,430 400,430 400,310"/>
<TextEquiv conf="1.0">
<Unicode>Cher Marius</Unicode>
</TextEquiv>
</TextRegion>
...
</Page>
</PcGts>
Output with a paragraph
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15
http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
<Metadata>
<Creator>Arkindex CLI</Creator>
<Created>2017-05-03T10:20:47</Created>
<LastChange>2017-05-03T10:27:21</LastChange>
</Metadata>
<Page imageFilename="1258612" imageWidth="700" imageHeight="1066">
<TextRegion id="paragraph-1" type="paragraph">
<Coords points="25,310 25,430 400,430 400,310"/>
<TextLine id="line-1">
<Coords points="25,310 25,430 400,430 400,310"/>
<TextEquiv conf="1.0">
<Unicode>Paradis le 29 novembre 1919</Unicode>
</TextEquiv>
</TextLine>
<TextLine id="line-2">
<Coords points="25,310 25,430 400,430 400,310"/>
<TextEquiv conf="1.0">
<Unicode>Cher Marius</Unicode>
</TextEquiv>
</TextLine>
</TextRegion>
...
</Page>
</PcGts>
Edited by Yoann Schneider