Optimize SQL queries for Solr reindexation (!1390) · Merge requests · Arkindex / Backend

Erwan Rouchet requested to merge optimize-solr-sql into master Jun 30, 2021

The query appeared to be very long due to many nested loops; the query planner said it was expecting 112 billion rows for HORAE in prod.

A first optimization attempt was made, which caused the single SQL query to take 500ms instead of 51 seconds for a large corpus locally:

SELECT COALESCE(parent.id, element.id) as parent_id,
    COALESCE(parent.name, element.name) as parent_name,
    COALESCE(parenttype.display_name, elementtype.display_name) as parent_type,
    element.id as id,
    element.name as name,
    elementtype.display_name as type_name
FROM documents_element as element
LEFT OUTER JOIN documents_elementpath as elementpath ON (element.id = elementpath.element_id)
LEFT OUTER JOIN documents_element as parent ON (elementpath.path @> ARRAY[parent.id])
LEFT OUTER JOIN documents_elementtype parenttype ON (parent.type_id = parenttype.id)
INNER JOIN documents_elementtype elementtype ON (element.type_id = elementtype.id)
WHERE parent.corpus_id = 'b90500b8-3203-4aa1-9295-d96f396cd282'
AND element.corpus_id = 'b90500b8-3203-4aa1-9295-d96f396cd282'
AND parenttype.indexable

This changed the ON clause of the parent left join to only use the @> operator, which can benefit from the GIN index on ElementPath, instead of @> … OR id = … which caused Postgres to just bypass the index. Additionally, the corpus ID filter was applied on both element tables, speeding up the sequential scans Postgres really wanted to take. COALESCE allowed to fall back to returning the element itself as its own parent, the edge case that allows the Solr indexation to index the indexable elements themselves and not just their children.

I then rewrote the query to use a CTE and a UNION ALL to first fetch only the indexable elements, then only fetch their children, therefore replacing LEFT JOINs everywhere with INNER JOINs. This allows Postgres to filter a bit more aggressively and rely on indexes more, but it could still be excessively long in prod. Splitting the union into two queries allowed the first part (only indexable elements) to run pretty quickly, taking one minute to extract about 10% of everything we would be indexing in HORAE (~280K rows), but the child part (with 26M rows) was still too slow.

I ended up using the first query, but then splitting the second query so that it takes batches of indexable elements. Because sending batches of UUIDs in SQL queries can be quite slow, mostly due to the SQL generation in Python and the later re-parsing by Postgres, I put the first query in a CTE and used LIMIT and OFFSET to tell Postgres to grab the batches by itself. To know when to stop the chunks, I implemented a small iteration helper in CounterIterator, which keeps tabs on the count of all elements returned by a QuerySet.iterator() without having to run COUNT(*). I ran both queries in prod directly to check, and using chunks of 10000 parents leads to queries that all take a minute each, which sounds acceptable to me as it would mean there's about 30 minutes of SQL query execution. The Solr indexation part takes much longer than that!

As an aside, assuming you set every element type in a corpus to be indexable, this query shows the exact number of rows you will have to deal with:

SELECT SUM(array_length(path, 1))
FROM documents_elementpath p
INNER JOIN documents_element e ON (e.id = p.element_id)
WHERE corpus_id = '...'::uuid;

This query showed 35 million elements would be indexed in Solr, not the 112 billion Postgres was expecting.

Edited Jun 30, 2021 by Erwan Rouchet

Optimize SQL queries for Solr reindexation

Merge request reports