Add a unique constraint on ElementPath
While it is possible to have the element have multiple paths with the same direct parent:
X Y
\ /
A
|
B
Element | Path | Ordering |
---|---|---|
A | X | 0 |
A | Y | 0 |
B | X, A | 0 |
B | Y, A | 0 |
It should not be possible to have the exact same path, including the other ancestors ; B in X,A should not happen twice.
To check for the amount of duplicates:
SELECT COUNT(*) - COUNT(DISTINCT (element_id, path)) FROM documents_elementpath
To find the top-level elements that hold some duplicated paths (useful to check the impact of the deduplication):
SELECT DISTINCT corpus.name, element.*
FROM documents_element element
INNER JOIN documents_corpus corpus ON (element.corpus_id = corpus.id)
INNER JOIN documents_elementpath path ON (element.id = path[1])
WHERE path.id IN (
SELECT a.id
FROM (
SELECT id, ROW_NUMBER() OVER (PARTITION element_id, path) > 1 AS is_duplicate
FROM documents_elementpath
) AS a WHERE is_duplicate IS TRUE
);
-
60 048 duplicated paths in preprod, all belonging to three out of four sets in GoLFor-DEEPN named
nep_docs
-
128 148 duplicated paths in prod:
- 4 sets named
committee_members_set[0-3]
inGoLFor-DEEPN | Full
-
BNF Zoologie
in CaptchAN -
HWR_annotation_set_2020_12
inMCIT | Arabic OCR
-
The_Originals
inBNPP | Archives
-
Letters
andLetters_split
inPublic | RIMES
-
Single page
inHORAE | Complete
Interesting note: all of those belong to set/volume set/dataset types.
- 4 sets named
This does not fix any of the underlying issues that may have caused those duplicate paths. I could not find any obvious case that would lead to the apparition of such duplicated paths, I guess Sentry will tell us all about it with this constraint.
Deleting the 60K paths in preprod should take ~20 seconds, I do not know about the prod.
Judging from some extremely complex analysis of the element paths found in the HORAE folder mentioned in #769 (aka just looking at the result of some SELECT queries), this unique constraint should solve the issues we noticed. The exclusion constraints in #762 (closed) are still important because they solve different issues that will not be covered by this unique constraint.