Skip to content
Snippets Groups Projects

Add a unique constraint on ElementPath

Merged Erwan Rouchet requested to merge unique-paths into master

While it is possible to have the element have multiple paths with the same direct parent:

X Y
\ /
 A
 |
 B
Element Path Ordering
A X 0
A Y 0
B X, A 0
B Y, A 0

It should not be possible to have the exact same path, including the other ancestors ; B in X,A should not happen twice.

To check for the amount of duplicates:

SELECT COUNT(*) - COUNT(DISTINCT (element_id, path)) FROM documents_elementpath

To find the top-level elements that hold some duplicated paths (useful to check the impact of the deduplication):

SELECT DISTINCT corpus.name, element.*
FROM documents_element element
INNER JOIN documents_corpus corpus ON (element.corpus_id = corpus.id)
INNER JOIN documents_elementpath path ON (element.id = path[1])
WHERE path.id IN (
    SELECT a.id
    FROM (
        SELECT id, ROW_NUMBER() OVER (PARTITION element_id, path) > 1 AS is_duplicate
        FROM documents_elementpath
    ) AS a WHERE is_duplicate IS TRUE
);
  • 60 048 duplicated paths in preprod, all belonging to three out of four sets in GoLFor-DEEPN named nep_docs

  • 128 148 duplicated paths in prod:

    • 4 sets named committee_members_set[0-3] in GoLFor-DEEPN | Full
    • BNF Zoologie in CaptchAN
    • HWR_annotation_set_2020_12 in MCIT | Arabic OCR
    • The_Originals in BNPP | Archives
    • Letters and Letters_split in Public | RIMES
    • Single page in HORAE | Complete

    Interesting note: all of those belong to set/volume set/dataset types.

This does not fix any of the underlying issues that may have caused those duplicate paths. I could not find any obvious case that would lead to the apparition of such duplicated paths, I guess Sentry will tell us all about it with this constraint.

Deleting the 60K paths in preprod should take ~20 seconds, I do not know about the prod.

Judging from some extremely complex analysis of the element paths found in the HORAE folder mentioned in #769 (aka just looking at the result of some SELECT queries), this unique constraint should solve the issues we noticed. The exclusion constraints in #762 (closed) are still important because they solve different issues that will not be covered by this unique constraint.

Edited by Erwan Rouchet

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
Please register or sign in to reply
Loading