Skip to content

Unlink all children in DestroyElement

Erwan Rouchet requested to merge unlink-children into master

Closes #1530 (closed), but not #795, as this can only unlink the child elements on one parent at once.

This adds an element_delete task that just calls element_trash, but when delete_children is not set, it instead removes the paths to child elements. This task uses the same timeout setting as for element_trash as it is more of an implementation detail than a completely separate task.


For #795, we will need to be able to unlink on any queryset of parents. We cannot make shortcuts by relying on any assumption other than all elements being in the same corpus, because many combinations of filters on DestroyElements will ignore those. For example, nothing can guarantee that all parent elements share the same grandparent elements. So this needs to work on any queryset.

The algorithm is as follows:

  1. Delete all paths of the deleted parents on the child elements recursively, for elements that have any path that does not include any of the parent elements to unlink. We won't need to do anything else with those, since the other paths will maintain the hierarchy between the child elements and their grandchildren.

  2. On all the other descendants, delete all paths other than the first one. We will be updating just that one path, and not deleting would cause duplicates to appear.

  3. Update the remaining paths to remove the prefixes of their parents.

Those queries are not that difficult to write. The main issue is listing all child elements of any amount of parents. The most sensitive part, performance-wise, is filtering on ElementPath.path, and I could not find any way to filter on this efficiently with an SQL query.

The GIN index on documents_elementpath.path can only be used with the =, @>, <@ and && filters; anything else is likely to stop using indexes entirely, or try to use the B-tree indexes used by unique constraints, which are rarely, if ever, efficient.

For my testing, I used a to_delete temporary table that contains a sample of approximatively 0.01% of my element paths, so a few thousand paths. See the FROM docs to learn more about TABLESAMPLE, which is far faster than an ORDER BY RANDOM().

CREATE TEMPORARY TABLE to_delete AS
SELECT DISTINCT path[array_length(path, 1)] AS id
FROM documents_elementpath TABLESAMPLE SYSTEM (0.01);

I will then use SELECT id FROM to_delete as a generic query, which returns a bunch of parents whose children should be unlinked. In reality, this query would be the element QuerySet that we use in ElementManager.trash(). Let's try a basic query to find all descendants recursively:

SELECT id
FROM documents_elementpath
WHERE path && ARRAY(SELECT id FROM to_delete);

This would be a bad idea in practice, because assembling an array of hundreds of thousands of elements will use a lot of RAM, but we can try it on this smaller sample. This query ran for over an hour before I cancelled it. The estimated cost was 882825.28, which is not great but far from the worst I've ever seen, and I wouldn't have expected this query to take so long.

Let's try without using an array. We want to find the ElementPath where the path contains any of the IDs in the query, so we could filter with UNNEST:

SELECT *
FROM (
    SELECT id, element_id, UNNEST(path) AS ancestor_id
    FROM documents_elementpath
) a
WHERE ancestor_id IN (SELECT id FROM to_delete);

Table-returning functions cannot be used directly in a WHERE, so we are using a subquery and filtering on the column returned by that query. Here, the estimated cost balloons into 11060563.67. This time, the query did manage to run, but it took minutes. Since it relies on a Seq Scan, how long it takes really depends on how big the whole instance is (how many paths there are in total). That could cause significant issues with transaction management, or with replication.

Instead of using one very large array, I tried to match with the paths of the parent elements. To find child elements recursively, we look for paths that contain the parent path. But since the @> operator does not care about ordering or duplicates, we need an extra check to make sure the child path actually starts with the parent path.

SELECT child_paths.*
FROM
    documents_elementpath parent_paths,
    documents_elementpath child_paths
WHERE
    parent_paths.element_id IN (SELECT id FROM to_delete)
    AND child_paths.path && parent_paths.path
    AND child_paths.path[:array_length(parent_paths.path, 1) + 1] = parent_paths.path || parent_paths.element_id;

This was the worst query, with an estimated cost of 8106336462.31. PostgreSQL just gives up, only using an index to find the parent paths, but doing nothing for the child paths. I think part of the reason is that it thinks it will retrieve such a large amount of rows that scanning the whole table is worth it.

I also tried to use an optimization that would only really work when deleting a lot of parent elements that all have the same grandparents, with a combination of the previous queries. First get the parent paths just like above, but then combine all of those paths into one array, deduplicating all the IDs. If all the parents have common paths, this reduces the size of the array significantly. PostgreSQL understands this and does use indices all the way this time, but it still gets excruciatingly slow, and had me cancel the query again after an hour. That query does not even include the extra checks to match the exact parent paths, which would slow it down further.

SELECT *
FROM documents_elementpath
WHERE path && (
    SELECT array_agg(DISTINCT ancestor_id)
    FROM
        (
            SELECT path
            FROM documents_elementpath
            WHERE element_id IN (SELECT id FROM to_delete)
        ) paths,
        LATERAL UNNEST(path) ancestor_id
);
Edited by Erwan Rouchet

Merge request reports

Loading