Avoid a DISTINCT in SelectProcessFailures
https://redmine.teklia.com/issues/10859
SelectProcessFailures uses a query that retrieves all distinct element IDs from the failed worker activities. It first calls .exists() to verify whether any elements do exist, then uses .iterator() to list them all and select the elements. The .exists() call results in a query of this type:
SELECT DISTINCT ON (element_id) 1 AS a
FROM process_workeractivity
WHERE process_id = '...'::uuid AND state = 'error'
LIMIT 1;
This query gets cancelled automatically in production after about 20 seconds, without any explicit reason. The query plan is as follows:
Limit  (cost=1000.59..13892.96 rows=1 width=20)
  ->  Unique  (cost=1000.59..3907389.36 rows=303 width=20)
        ->  Gather Merge  (cost=1000.59..3907388.61 rows=303 width=20)
              Workers Planned: 2
              ->  Parallel Index Scan using process_workeractivity_element_id_9a98b12a on process_workeractivity  (cost=0.56..3906353.61 rows=126 width=20)
                    Filter: ((process_id = '...'::uuid) AND ((state)::text = 'error'::text))
Using two separate processes and gathering their results into one is generally slow, and the index scan used here is useless. It just scans through all element_id on the table, and the filtering happens by retrieving each row from the table (this is an index scan and not an index-only scan, so the table data is also fetched).
We do need to make element IDs distinct when selecting them, since we could have the same element duplicated for multiple worker versions, but we don't need that to just check that they exist. Using DISTINCT instead of DISTINCT ON (element_id) already improves the plan significantly:
Limit  (cost=0.56..211.63 rows=1 width=4)
  ->  Limit  (cost=0.56..211.63 rows=1 width=4)
        ->  Index Scan using process_workeractivity_process_id_b86dbdd6 on process_workeractivity  (cost=0.56..63954.58 rows=303 width=4)
              Index Cond: (process_id = 'ebc56551-43e8-4b82-9de7-d6658a989551'::uuid)
              Filter: ((state)::text = 'error'::text)
And not using any DISTINCT at all is ideal:
Limit  (cost=0.56..211.63 rows=1 width=4)
  ->  Index Scan using process_workeractivity_process_id_b86dbdd6 on process_workeractivity  (cost=0.56..63954.58 rows=303 width=4)
        Index Cond: (process_id = 'ebc56551-43e8-4b82-9de7-d6658a989551'::uuid)
        Filter: ((state)::text = 'error'::text)
This uses an index on process_id, which means we only fetch table rows for the activities on this specific process. We could go even further by adding a new index on (process_id, state) INCLUDE (element_id), but that's overkill here, the query without a DISTINCT already takes less than 100ms.
For the query that retrieves all element IDs, the query plan is already good and no change is needed for now:
SELECT DISTINCT element_id
FROM process_workeractivity
WHERE process_id = '...'::uuid AND state = 'error';
Unique  (cost=57228.39..57229.90 rows=303 width=16)
  ->  Sort  (cost=57228.39..57229.15 rows=303 width=16)
        Sort Key: element_id
        ->  Bitmap Heap Scan on process_workeractivity  (cost=181.19..57215.90 rows=303 width=16)
              Recheck Cond: (process_id = 'ebc56551-43e8-4b82-9de7-d6658a989551'::uuid)
              Filter: ((state)::text = 'error'::text)
              ->  Bitmap Index Scan on process_workeractivity_process_id_b86dbdd6  (cost=0.00..181.11 rows=16073 width=0)
                    Index Cond: (process_id = 'ebc56551-43e8-4b82-9de7-d6658a989551'::uuid)