[ckan-changes] [ckan/ckan] c033ef: [#1792] Optimize FTS queries on non-indexed fields

GitHub noreply at github.com
Thu Aug 7 19:59:07 UTC 2014

  Branch: refs/heads/1792-filterable-resource-views
  Home:   https://github.com/ckan/ckan
  Commit: c033efbb7f660a0ac03eee731a68439580a2d4b9
  Author: Vitor Baptista <vitor at vitorbaptista.com>
  Date:   2014-08-07 (Thu, 07 Aug 2014)

  Changed paths:
    M ckanext/datastore/db.py
    M ckanext/datastore/helpers.py
    M ckanext/datastore/plugin.py
    M ckanext/datastore/tests/test_db.py
    M ckanext/datastore/tests/test_helpers.py
    M ckanext/datastore/tests/test_plugin.py
    M ckanext/datastore/tests/test_search.py

  Log Message:
  [#1792] Optimize FTS queries on non-indexed fields

As per the commit aa51630, we can't create full-text search indexes on all
field types. Specifically, we're only indexing textual and numeric fields. Even
though we don't index other fields, the users are still able to do FTS queries
on them, and we also use them on the autocomplete, so we'd like them to be
relatively fast.

When creating a datastore table, we add a "_full_text" column with the content
of all the other columns. This is used when the user wants to do a FTS on the
entire resource, regardless of columns.

The idea behind the optimization in this patch is to use that "_full_text"
column to limit the number of rows Postgres will search. While testing, I've
used a sample table with 1 million rows and 540 MB. Among others, it has a
column named "timestamp" with (obviously) a timestamp. There're no indexes on
this column. I've done 2 tests: (a) do a FTS only on the timestamp column, and;
(b) do a FTS on both the timestamp and _full_text columns.

The results below are the average of running the tests 5 times.

(a) Querying for '06' just on the timestamp field took 5682ms;
(b) Querying for '06' both on the _full_text and timestamp fields took 90ms

For this test, querying for both _full_text and timestamp made the query ~63
times faster. Unfortunately, the result depends a lot on the query terms and
the data. On our test data, every timestamp is from 2014. If instead of
searching for '06' we search for '2014', this is what we get (again, repeated 5
times and averaged out):

(a) Querying for '2014' just on the timestamp field took 12455ms;
(a) Querying for '2014' just on the _full_text and timestamp fields took 14534ms;

Using the _full_text field made this query ~17% slower. The problem here is
that this optimization only makes sense if quering the _full_text reduces the
number of rows that we have to query on the timestamp field. When that's not
the case, it makes the query slower.

In my opinion, this patch is still worthy.

More information about the ckan-changes mailing list