Improve ListElementNeighbors performances

requested review from @erwanrouchet

assigned to @vrigal

added 3 commits

9f9a48f0 - Suggestions
e472b1b1 - Return the read ordering via the API
33db76e7 - Update tests

I compared performances between https://gitlab.com/teklia/arkindex/backend/-/merge_requests/2038/diffs?commit_id=33db76e71d3982c926bf8b397a514eaccf107cc9 and master, on the three examples from https://gitlab.com/teklia/arkindex/backend/-/issues/1011#note_1456641279 (debug toolbar info against the preprod dump hosted on xenarque) :

1st elt with 5k + neighbors

branch	SQL query ms	CPU ms
33db7	722	383 (1382)
master	6367	272 (6682)

nth elt with 5k + neighbors

branch	SQL query ms	CPU ms
33db7	695	184 (1130)
master	684	83 (999)

elt with 11 paths

branch	SQL query ms	CPU ms
33db7	597	159 (1062)
master	615	125 (980)

Not much improvement though, except for the first case that no longer explodes in terms of performance.

Using the real row position does not alter performances very much.
Dropping the element table join allows to gain up to 10% of perfs on those cases (requires to separate the top level case).

changed the description

mentioned in issue #1545 (closed)

DRF's HTML view tries to find a filter form, but we do not have any filter backends on this endpoint so it's useless. To do so, it calls get_queryset, which in this endpoint triggers SQL queries, which violates the best practices since it is known to cause problems. This causes all timings found by the Django debug toolbar to be duplicated, because suddenly there is twice as much work to do. I did some other timings with time curl and with the dev tools opened in the frontend, and consistently got 50% lower execution times.

By dropping the kernel pagecache, which Postgres uses to keep indices, tables or query results in memory, some ListElementNeighbors queries could take up to 14.6 seconds. They took at most 15.6 seconds on master. It is likely that we were struggling to reproduce some of the performance issues because of this; once everything is cached, it all appears to be running fine. The cache does have a size limit, and will never get filled if the endpoint times out, so on very large corpora that are accessed less often, or after a database restart, we might still see some lag.

I tested this problem in production by running the query against a corpus I often use for path performance tests, GoLForDEEPN | Full. This corpus has often caused us trouble. This element showed some similar execution times of 3 to 4 seconds without the cache, and 150ms with cache.

Removing the join on documents_element does not improve this issue, even though the query plans seem to show that this join is taking up most of the time. The window aggregation takes most of that time instead because it still computes all neighbors over all the elements with the same path before we can filter. I don't think this can be avoided without already knowing which elements are the neighbors of our element!

So I guess a quick way to fix the neighbors issue if we find it again is to download more RAM, which should increase the amount of caching we can do. We could also work on frontend#1102 (closed) to make the performance issue less visible — you don't get to see the neighbors, but you at least get access to the Actions and Display menus for example.

I might end up creating a follow-up issue to simplify the serializer; we don't care at all about compatibility, because only the frontend uses this endpoint, so we can always update the frontend. The current API actually makes it really hard to handle in the frontend, and we can easily simplify it:

The pagination is completely unnecessary, since get_neighbors will make an API request before DRF paginates, so we could remove it, which also means the frontend can now handle more than 6 parents at once (at the seventh parent, if you have a previous and a next element for each parent, then you get 21 elements which goes beyond the first page of results).
We can rename position to ordering, since we want to return the raw ordering and not a ROW_NUMBER().
We can switch from building three different objects, which then have to be re-combined by the frontend to detect which elements are the previous and next ones, into a single object with a previous and a next attribute, which also simplifies the backend implementation.

resolved all threads

added 8 commits

3132138d - 1 commit from branch master
d7046f88 - Improve neighbor list endpoint
de5c9e05 - Restore element_type filter
ee9bad16 - Update tests
17619a95 - Suggestions
98e56487 - Return the read ordering via the API
2a32084d - Update tests
176a371b - Suggestions

Compare with previous version

added 1 commit

746c6daf - Suggestions

Compare with previous version

mentioned in issue #1553 (closed)

added 9 commits

746c6daf...50de85c8 - 2 commits from branch master
d2426143 - Improve neighbor list endpoint
e2066cf6 - Restore element_type filter
7d12869c - Update tests
54c68e07 - Suggestions
4cd2d284 - Return the read ordering via the API
ac3fad1f - Update tests
59193dc8 - Suggestions

Compare with previous version

enabled an automatic merge when the pipeline for 59193dc8 succeeds

merged

changed milestone to %Arkindex 1.5.0

mentioned in merge request frontend!1527 (merged)

Improve ListElementNeighbors performances

Merge request reports

Activity

1st elt with 5k + neighbors

nth elt with 5k + neighbors

elt with 11 paths