Better polygon validation
Possible follow-up to #163 (closed): we could use ST_IsValid or ST_IsRing to further validate zone polygons and avoid weird polygons that simply have no reason to exist. The following polygons are considered invalid:
Invalid polygons in preprod
At the time of writing this issue, there are 2838 invalid polygons in preprod, out of 10554485 zones:
- 999 polygons from
Tobacco800
are supposed to be rectangles, like those ones, but actually are "bowties".
Those have been fixed manually in a Python script, by moving one point in the polygon. The same issue occurred in production and was fixed in the past. - One polygon is some dumb test of the annotation interface I did a while ago.
- 1838 polygons are related to transcriptions from Valencià and Transkribus.
Invalid polygons in prod
Elements
Corpus | Count |
---|---|
Total | 796 |
PRImA | 427 |
RDCL2019 | 131 |
cBAD 2019 | 130 |
HORAE | DLA Annotations | 70 |
Bozen | 15 |
BNPP-archives | 8 |
AN-Index | 6 |
READ-BAD2017 | 5 |
BALSAC | Balsac annotations | 2 |
Test Martin | 1 |
CaptchAN | 1 |
SQL query
select coalesce(c.name, 'Total'), count(*)
from images_zone z
inner join documents_element e on (e.zone_id = z.id)
inner join documents_corpus c on (c.id = e.corpus_id)
where not ST_IsRing(z.polygon)
group by rollup(c.name)
order by count(*) desc;
Transcriptions
Source | Count |
---|---|
Total | 1305563 |
litis_lexical_ocr | 653072 |
litis_raw_ocr | 649828 |
manual | 2550 |
transkribus | 108 |
kaldi_bnpp | 2 |
kaldi_bnpp_03 | 1 |
kaldi_bnpp_04 | 1 |
kaldi_bnpp_oscar | 1 |
SQL query
select coalesce(ds.slug, 'Total'), count(*)
from images_zone z
inner join documents_transcription t on (t.zone_id = z.id)
inner join documents_datasource ds on (ds.id = t.source_id)
where not ST_IsRing(z.polygon)
group by rollup(ds.slug)
order by count(*) desc;
Edited by Erwan Rouchet