Better polygon validation
Possible follow-up to #163 (closed): we could use ST_IsValid or ST_IsRing to further validate zone polygons and avoid weird polygons that simply have no reason to exist. The following polygons are considered invalid:
Invalid polygons in preprod
At the time of writing this issue, there are 2838 invalid polygons in preprod, out of 10554485 zones:
- 999 polygons from
Tobacco800are supposed to be rectangles, like those ones, but actually are "bowties".
Those have been fixed manually in a Python script, by moving one point in the polygon. The same issue occurred in production and was fixed in the past. - One polygon is some dumb test of the annotation interface I did a while ago.
- 1838 polygons are related to transcriptions from Valencià and Transkribus.
Invalid polygons in prod
Elements
| Corpus | Count |
|---|---|
| Total | 796 |
| PRImA | 427 |
| RDCL2019 | 131 |
| cBAD 2019 | 130 |
| HORAE | DLA Annotations | 70 |
| Bozen | 15 |
| BNPP-archives | 8 |
| AN-Index | 6 |
| READ-BAD2017 | 5 |
| BALSAC | Balsac annotations | 2 |
| Test Martin | 1 |
| CaptchAN | 1 |
SQL query
select coalesce(c.name, 'Total'), count(*)
from images_zone z
inner join documents_element e on (e.zone_id = z.id)
inner join documents_corpus c on (c.id = e.corpus_id)
where not ST_IsRing(z.polygon)
group by rollup(c.name)
order by count(*) desc;
Transcriptions
| Source | Count |
|---|---|
| Total | 1305563 |
| litis_lexical_ocr | 653072 |
| litis_raw_ocr | 649828 |
| manual | 2550 |
| transkribus | 108 |
| kaldi_bnpp | 2 |
| kaldi_bnpp_03 | 1 |
| kaldi_bnpp_04 | 1 |
| kaldi_bnpp_oscar | 1 |
SQL query
select coalesce(ds.slug, 'Total'), count(*)
from images_zone z
inner join documents_transcription t on (t.zone_id = z.id)
inner join documents_datasource ds on (ds.id = t.source_id)
where not ST_IsRing(z.polygon)
group by rollup(ds.slug)
order by count(*) desc;
Edited by Erwan Rouchet
