Filter on the corpus ID of ElementTypes
After a lot of poking and prodding in pgAdmin, I found out that when using the &folder=
query parameter can cause a significant slow down when there are a lot of projects at once. The current query plans work like this:
- Bitmap index scan + heap scan — use
corpus_id
index on elements to filter them by corpus ID, and retrieve all the rows that match. - Sequential scan on the entire ElementType table to filter by
folder
. - Hash the ElementTypes (814 folder types in preprod).
- Hash Join between all those types and all the elements.
- Nested Loop join with the
unique_element_paths
index to find elements without paths. - Sort
- Gather from the two worker processes that were used to perform the sort, as Postgres estimates it to be quite large.
The bottleneck seemed to come from this final Gather/Merge, which caused the entire query to be split into two. This seemed quite unnecessary, especially for the top_level
queries since those usually return a pretty low amount of elements.
After adding an extra WHERE documents_elementtype.corpus_id = '…'
, the query plan changes completely:
- Bitmap index scan + heap scan — use the
corpus_id
index on types to find the types of the current corpus and retrieve them - Bitmap AND + heap scan (use indexes then retrieve all the rows that match) between two indexes:
- Bitmap index scan using the
type_id
index on elements to find elements that have the few types (only one on most corpora) - Bitmap index scan using the
corpus_id
index on elements to find elements in this corpus
- Bitmap index scan using the
- Nested loop to perform the actual join between elements and types and ensure integrity
- Nested Loop join with the
unique_element_paths
index to find elements without paths. - Sort.
No more multi-processing and no more hashing! Postgres estimates the cost on arkindex-dev to be of 6000 instead of 197000 for Horae.
Benchmark results using The bench script:
Value | Local 1.2.0-rc1 | Local+filter | Dev 1.2.0-rc1 | Dev+filter |
---|---|---|---|---|
Total Horae | 17.00 | 9.10 | 30.95 | 22.20 |
Average Horae | 3.40 | 1.82 | 6.19 | 4.44 |
Total Balsac | 6.39 | 6.26 | 13.16 | 13.17 |
Average Balsac | 1.28 | 1.25 | 2.63 | 2.63 |
This makes no change for Balsac since it uses a top-level type.