Skip to content

Add a unique constraint on ElementPath

Erwan Rouchet requested to merge unique-paths into master

While it is possible to have the element have multiple paths with the same direct parent:

X Y
\ /
 A
 |
 B
Element Path Ordering
A X 0
A Y 0
B X, A 0
B Y, A 0

It should not be possible to have the exact same path, including the other ancestors ; B in X,A should not happen twice.

To check for the amount of duplicates:

SELECT COUNT(*) - COUNT(DISTINCT (element_id, path)) FROM documents_elementpath

To find the top-level elements that hold some duplicated paths (useful to check the impact of the deduplication):

SELECT DISTINCT corpus.name, element.*
FROM documents_element element
INNER JOIN documents_corpus corpus ON (element.corpus_id = corpus.id)
INNER JOIN documents_elementpath path ON (element.id = path[1])
WHERE path.id IN (
    SELECT a.id
    FROM (
        SELECT id, ROW_NUMBER() OVER (PARTITION element_id, path) > 1 AS is_duplicate
        FROM documents_elementpath
    ) AS a WHERE is_duplicate IS TRUE
);
  • 60 048 duplicated paths in preprod, all belonging to three out of four sets in GoLFor-DEEPN named nep_docs

  • 128 148 duplicated paths in prod:

    • 4 sets named committee_members_set[0-3] in GoLFor-DEEPN | Full
    • BNF Zoologie in CaptchAN
    • HWR_annotation_set_2020_12 in MCIT | Arabic OCR
    • The_Originals in BNPP | Archives
    • Letters and Letters_split in Public | RIMES
    • Single page in HORAE | Complete

    Interesting note: all of those belong to set/volume set/dataset types.

This does not fix any of the underlying issues that may have caused those duplicate paths. I could not find any obvious case that would lead to the apparition of such duplicated paths, I guess Sentry will tell us all about it with this constraint.

Deleting the 60K paths in preprod should take ~20 seconds, I do not know about the prod.

Judging from some extremely complex analysis of the element paths found in the HORAE folder mentioned in #769 (aka just looking at the result of some SELECT queries), this unique constraint should solve the issues we noticed. The exclusion constraints in #762 (closed) are still important because they solve different issues that will not be covered by this unique constraint.

Edited by Erwan Rouchet

Merge request reports

Loading