Skip to content

Draft: Postgres full text search test

Erwan Rouchet requested to merge pgfts into master

Closes #651 (closed)

Performance

Quick local tests show that this endpoint can take, in the worst case, one second to perform a search on 4.8 million transcriptions of the GoLFoR corpus from a preprod database dump. It usually takes 100 to 200ms, but the more specific the query gets, the faster it goes. If we run into large performance issues in prod, using the same custom paginator as for ListProcessElements, which removes the count, could remove most of the query time.

Transcription creation via CreateElementTranscriptions is largely unaffected; the changes are not really noticeable in my benchmarks. The migration that created both the generated column and the GIN index took 40 seconds (8µs per transcription).

Sorting

It is possible to order by result relevance using SearchRank, and it is possible to take the transcription score into account, for example using score + rank. Any ORDER BY clause added up to 200ms to query time in my tests, including the optimized ordering by string UUID used in ListElementTranscriptions.

Text parsing

This hardcodes the simple search configuration as it is the most basic one and the only default one that does not depend on a particular language. A language column could be added to a transcription to allow specifying the language on each transcription, with a generated column using to_tsvector(language, text); however this requires us to handle the possible custom search configurations that a database might provide, and prevent setting transcriptions to unknown configurations.

Django integration

This uses a column that Django does not know about to generate the tsvector and allow it to be indexed. To force Django to understand the field exists only in search querysets, a RawSQL expression is used. Adding the field to the model causes much larger issues with insertions and updates as Django tries to insert into the generated column, and the very few possible hacks are complex (see django-readonly-field for an example).

Using the generated column alone, or the index alone, caused barely any changes in execution time from using the basic text__search lookup, whereas using both the index and the generated column made searches up to 80 times faster.

As Postgres will simply do a NOTICE for queries that would return absolutely everything (like a or -a) or queries that are not indexable (-a alone), this adds querytree as a QueryTree database function to check whether or not the query should be actually searched.

Can this replace ES?

This endpoint can be quickly adapted to replace the IIIF transcription search and already remove one ES index, since this is the simplest search. Even though this is not really the main goal, we could technically quickly replace the existing entity search with a similar system too, only leaving the element search.

For the element search, reaching exact feature parity with the existing API would be a mess, mostly due to results being grouped by element. The date filters would require much more work as InterpretedDates are not stored in Postgres and can come from metadata, entities on metadata or entities on transcription, which will quickly become nightmarish to handle in SQL. However, the element type and minimum score filters are easy.

Rewriting the search to be able to somehow display the page, not the text_line that has a transcription, would have multiple issues:

  • How do we determine what element should be displayed? This could require some kind of 'edge detection' between folders and regular elements or elements with/without zones in an element's path, which would probably be highly complex in SQL;
  • How do we handle pagination? Either we group everything then paginate, which on the database's side makes pagination completely useless, or we return transcriptions and have the frontend deal with the grouping, which would cause display issues as clicking Next could display another portion of the same element for example.
Edited by Erwan Rouchet

Merge request reports

Loading