Skip to content

Optimize queries in DataImport list API

Erwan Rouchet requested to merge optimize-dataimport-list into master

Requires ponos!17

GETting on the DataImport list API ran 104 Django queries, including 97 duplicate queries. Removing half of them was easy, just with prefetch_related('workflow__tasks', 'failures'), but not the rest.

The DataImport.state property returns Unscheduled if there isn't any workflow associated with it, but if there is one, it will return the value of the Workflow.state property. There were two SQL queries made by this property: one to fetch the last run number, and one to fetch the state of all tasks in this last run number. It would be possible to reduce this to one query, however there would still be 20 duplicate queries.

Since the second query relied on a .filter() (a WHERE clause), Django ignored its existing prefetch_related cache and re-ran the query. To prevent this, some ugly code was added to Ponos (see ponos!17) to detect the presence of a prefetch cache on tasks ; if there is one, it will use a Python generator instead of a query.

For the first query that fetches the last run number, merely annotating a DataImport queryset is not enough as Workflow would not be able to access it ; a Workflow.get_state(run) method is added to let a consumer give the run number itself. Workflow.state calls get_state(get_last_run()) to act just like before, but DataImport.state is modified to use its last_run annotation if it is available.

Merge request reports

Loading