Skip to content

Filter on the corpus ID of ElementTypes

Erwan Rouchet requested to merge type-corpus-filter into master

After a lot of poking and prodding in pgAdmin, I found out that when using the &folder= query parameter can cause a significant slow down when there are a lot of projects at once. The current query plans work like this:

  1. Bitmap index scan + heap scan — use corpus_id index on elements to filter them by corpus ID, and retrieve all the rows that match.
  2. Sequential scan on the entire ElementType table to filter by folder.
  3. Hash the ElementTypes (814 folder types in preprod).
  4. Hash Join between all those types and all the elements.
  5. Nested Loop join with the unique_element_paths index to find elements without paths.
  6. Sort
  7. Gather from the two worker processes that were used to perform the sort, as Postgres estimates it to be quite large.

The bottleneck seemed to come from this final Gather/Merge, which caused the entire query to be split into two. This seemed quite unnecessary, especially for the top_level queries since those usually return a pretty low amount of elements.

After adding an extra WHERE documents_elementtype.corpus_id = '…', the query plan changes completely:

  1. Bitmap index scan + heap scan — use the corpus_id index on types to find the types of the current corpus and retrieve them
  2. Bitmap AND + heap scan (use indexes then retrieve all the rows that match) between two indexes:
    • Bitmap index scan using the type_id index on elements to find elements that have the few types (only one on most corpora)
    • Bitmap index scan using the corpus_id index on elements to find elements in this corpus
  3. Nested loop to perform the actual join between elements and types and ensure integrity
  4. Nested Loop join with the unique_element_paths index to find elements without paths.
  5. Sort.

No more multi-processing and no more hashing! Postgres estimates the cost on arkindex-dev to be of 6000 instead of 197000 for Horae.

Benchmark results using The bench script:

Value Local 1.2.0-rc1 Local+filter Dev 1.2.0-rc1 Dev+filter
Total Horae 17.00 9.10 30.95 22.20
Average Horae 3.40 1.82 6.19 4.44
Total Balsac 6.39 6.26 13.16 13.17
Average Balsac 1.28 1.25 2.63 2.63

This makes no change for Balsac since it uses a top-level type.

Edited by Erwan Rouchet

Merge request reports

Loading