Skip to content

Better polygon validation

Possible follow-up to #163 (closed): we could use ST_IsValid or ST_IsRing to further validate zone polygons and avoid weird polygons that simply have no reason to exist. The following polygons are considered invalid:

image

Invalid polygons in preprod

At the time of writing this issue, there are 2838 invalid polygons in preprod, out of 10554485 zones:

  • 999 polygons from Tobacco800 are supposed to be rectangles, like those ones, but actually are "bowties".
    Those have been fixed manually in a Python script, by moving one point in the polygon. The same issue occurred in production and was fixed in the past.
  • One polygon is some dumb test of the annotation interface I did a while ago.
  • 1838 polygons are related to transcriptions from Valencià and Transkribus.

Invalid polygons in prod

Elements

Corpus Count
Total 796
PRImA 427
RDCL2019 131
cBAD 2019 130
HORAE | DLA Annotations 70
Bozen 15
BNPP-archives 8
AN-Index 6
READ-BAD2017 5
BALSAC | Balsac annotations 2
Test Martin 1
CaptchAN 1
SQL query
select coalesce(c.name, 'Total'), count(*)
from images_zone z
inner join documents_element e on (e.zone_id = z.id)
inner join documents_corpus c on (c.id = e.corpus_id)
where not ST_IsRing(z.polygon)
group by rollup(c.name)
order by count(*) desc;

Transcriptions

Source Count
Total 1305563
litis_lexical_ocr 653072
litis_raw_ocr 649828
manual 2550
transkribus 108
kaldi_bnpp 2
kaldi_bnpp_03 1
kaldi_bnpp_04 1
kaldi_bnpp_oscar 1
SQL query
select coalesce(ds.slug, 'Total'), count(*)
from images_zone z
inner join documents_transcription t on (t.zone_id = z.id)
inner join documents_datasource ds on (ds.id = t.source_id)
where not ST_IsRing(z.polygon)
group by rollup(ds.slug)
order by count(*) desc;
Edited by Erwan Rouchet