Getting Started with Django and PostgreSQL Full-Text Search

Monday, September 1, 2014

After hearing a few months back that PostgreSQL had built-in full-text search support, I had been continually searching for reasons to take advantage of it. A few months went by, and when one of our clients at Lionheart started running into issues with slowness using the Django admin's built-in search support, I knew I had a solid candidate. After all was said and done, searches returned results approximately 20x faster than vanilla Django. Here's how you can do it too.

The Basics ¶

One of the great things about PostgreSQL full-text search is that there isn't too much to learn if you're already pretty familiar with search engines and how they work. The main idea is that you generate an index using a document containing terms you'd like to search on. PostgreSQL provides a lot of flexibility with regards to how you choose to do this, but I opted to store the documents as a field called fts_document on the table we wanted to search on.

PostgreSQL stores search data in a data type called a tsvector. Here's how the PostgreSQL documentation describes it:

A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word. Sorting and duplicate-elimination are done automatically during input.

Here's an example of a sentence converted to a tsvector:

=# SELECT to_tsvector('simple', 'The quick brown fox jumps over the lazy dog.');
                                to_tsvector
---------------------------------------------------------------------------
 'brown':3 'dog':9 'fox':4 'jumps':5 'lazy':8 'over':6 'quick':2 'the':1,7
(1 row)

When you want to search for something, you create what's called a tsquery.

A tsquery value stores lexemes that are to be searched for, and combines them honoring the Boolean operators & (AND), | (OR), and ! (NOT). Parentheses can be used to enforce grouping of the operators

Here's an example:

dan=# SELECT to_tsquery('simple', 'fox | jumps');
   to_tsquery
-----------------
 'fox' | 'jumps'
(1 row)

When you want to test for a match, you use the @@ operator.

dan=# SELECT to_tsvector('simple', 'The quick brown fox jumps over the lazy dog.') @@ to_tsquery('simple', 'fox');
 ?column?
----------
 t

It's a match!

The @@ operator also provides a shorthand whereby you don't need to use the to_tsvector / to_tsquery functions explicitly. E.g.,

dan=# SELECT 'The quick brown fox jumps over the lazy dog.' @@ 'fox';
 ?column?
----------
 t

Sweet! However, you might be confused by the following:

dan=# SELECT 'The quick brown fox jumps over the lazy dog.' @@ 'fox | random';
 ?column?
----------
 f

Huh? We're searching for either the term "fox" or "random". This doesn't seem right...

Well, it turns out that when you specify plain text as an argument to the @@ operator, it wraps it in plainto_tsquery, which functions a little bit differently than the to_tsquery function that we've been playing around with above. Notably, it ignores every logical operator except &. So |'s, parentheses, etc., will all be ignored. Personally, I don't find it to be all that useful.

Also, as you might have noticed above, I provided an additional argument to both the to_tsvector and to_tsquery functions called 'simple'. This represents the simple text search configuration. Other text search configurations are language-based and strip out stopwards and other lexemes based on what language you might be indexing. Here's a few other options that I have available on my machine.

dan=# \dF
               List of text search configurations
   Schema   |    Name    |              Description
------------+------------+---------------------------------------
 pg_catalog | danish     | configuration for danish language
 pg_catalog | dutch      | configuration for dutch language
 pg_catalog | english    | configuration for english language
 pg_catalog | finnish    | configuration for finnish language
 pg_catalog | french     | configuration for french language
 pg_catalog | german     | configuration for german language
 pg_catalog | hungarian  | configuration for hungarian language
 pg_catalog | italian    | configuration for italian language
 pg_catalog | norwegian  | configuration for norwegian language
 pg_catalog | portuguese | configuration for portuguese language
 pg_catalog | romanian   | configuration for romanian language
 pg_catalog | russian    | configuration for russian language
 pg_catalog | simple     | simple configuration
 pg_catalog | spanish    | configuration for spanish language
 pg_catalog | swedish    | configuration for swedish language
 pg_catalog | turkish    | configuration for turkish language
(16 rows)

Play around with a few and see which one suits you best.

Django ¶

By now you know how to create search vectors and queries, search on some data, and specify a text search configuration. Let's get started on connecting these tools with your Django project.

Since tsvector and tsquery are both standard PostgreSQL data types, we can use them just as we might any other data type like int or varchar in a column definition. This is very handy. Otherwise, we would have to regenerate vectors on-the-fly, which might offset any speedup we get from using FTS in the first place. :)

For the sake of moving forward, let's say we have a model called Employee in our Django project. We're trying to speed up searches on anything related to an employee. Assume we're storing their first name, last name, and full address.

class Employee(db.Model):
    first_name = models.CharField(max_length=64)
    last_name = models.CharField(max_length=64)
    address_1 = models.TextField()
    address_2 = models.TextField()
    city = models.CharField(max_length=64)
    state = models.CharField(max_length=2)
    zip = models.CharField(max_length=9)

We're now going to create an tsvector based on these fields. To do this, we'll concatenate all of the fields into a list and then flatten the list into a string to feed into the to_tsvector function.

dan=# SELECT to_tsvector('simple', concat_ws(' ', first_name, last_name, address_1, address_2, city, state, zip)) FROM employee;
                               to_tsvector
-------------------------------------------------------------------------
 '123':3 '12345':8 'in':7 'main':4 'pawnee':6 'ron':1 'st':5 'swanson':2
(1 row)

If we wanted to stop here, we could just do FTS matches on the fly, but this would be quite slow.

dan=# SELECT COUNT(*) FROM employee WHERE to_tsvector('simple', concat_ws(' ', first_name, last_name, address_1, address_2, city, state, zip)) @@ 'swanson';
 count
-------
     1
(1 row)

The next step is to store this tsvector in a field and keep it updated when the data changes. First step seems easy enough:

dan=# ALTER TABLE employee ADD COLUMN fts_document tsvector;

Hmm, but how to we keep this updated? There are a bunch of options, but according to the PostgreSQL documention, the best one is to add a trigger on document updates and additions. To make things easy, we're going to create a function to generate the fts document. E.g.,

CREATE FUNCTION employee_fts_document(e employee) RETURNS tsvector AS $$
DECLARE
    employee_document TEXT;
BEGIN
    SELECT concat_ws(' ', e.first_name, e.last_name, e.address_1, e.address_2, e.city, e.state, e.zip) INTO employee_document;
    RETURN to_tsvector('pg_catalog.simple', employee_document);
END;
$$ LANGUAGE plpgsql;

We can then reference this function in the trigger.

CREATE FUNCTION employee_fts_document_trigger() RETURNS TRIGGER AS $$
BEGIN
    NEW.fts_document=employee_fts_document(NEW);
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

Finally, we tell PostgreSQL to execute the trigger on employee updates and inserts.

CREATE TRIGGER employee_fts_update_trigger BEFORE UPDATE ON employee FOR EACH ROW EXECUTE PROCEDURE employee_fts_document_trigger();
CREATE TRIGGER employee_fts_insert_trigger BEFORE INSERT ON employee FOR EACH ROW EXECUTE PROCEDURE employee_fts_document_trigger();

We're almost done. The last thing we might want to do is to tell PostgreSQL that we'd like to index the fts_document column which contains our full-text search vectors. We're going to add a GIN (Generalized Inverted Index)-based index to our column. You can read about the GIN-based and GiST-based indices in the PostgreSQL documentation. Here's how we add our index.

CREATE INDEX employee_fts_index ON employee USING gin(fts_document);

That's it. Our employee table now has an always up-to-date search index that we can use. Before we continue, let's instantiate the search vectors. We'll then test everything out to make sure things are working a-ok.

dan=# UPDATE employee SET fts_document=employee_fts_document(id);
dan=# SELECT COUNT(*) FROM employee WHERE fts_document @@ 'ron';
 count
-------
     1
(1 row)

Success!

South ¶

Now that we understand the internals, wouldn't it be nice if we could keep our databases up-to-date without having to manually copy and paste commands on our PostgreSQL shell? South to the rescue!

We'll need to generate a blank migration.

$ ./manage.py schemamigration app create_employee_fts
Created 0001_create_employee_fts.py. You must now edit this migration and add the code for each direction.

Next, open up the migration, and copy in the SQL above in the forwards and backwards methods, as illustrated below.

# ... imports, etc ...

from django.db import connection

class Migration(SchemaMigration):

    def forwards(self, orm):
        sql = """
CREATE FUNCTION employee_fts_document(integer) RETURNS tsvector AS $$
DECLARE
    employee_document TEXT;
BEGIN
    SELECT concat_ws(' ', first_name, last_name, address_1, address_2, city, state, zip) INTO employee_document FROM employee WHERE id=$1;
    RETURN to_tsvector('pg_catalog.simple', employee_document);
END;
$$ LANGUAGE plpgsql;

CREATE FUNCTION employee_fts_document_trigger() RETURNS TRIGGER AS $$
BEGIN
    NEW.fts_document=employee_fts_document(NEW.id);
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;
"""

        cursor = connection.cursor()
        cursor.execute(sql);
        cursor.execute("ALTER TABLE employee ADD COLUMN fts_document tsvector")
        cursor.execute("UPDATE employee SET fts_document=employee_fts_document(id)");
        cursor.execute("CREATE TRIGGER employee_fts_update_trigger BEFORE UPDATE ON employee FOR EACH ROW EXECUTE PROCEDURE employee_fts_document_trigger()")
        cursor.execute("CREATE TRIGGER employee_fts_insert_trigger BEFORE INSERT ON employee FOR EACH ROW EXECUTE PROCEDURE employee_fts_document_trigger()")
        cursor.execute("CREATE INDEX employee_fts_index ON employee USING gin(fts_document)")

    def backwards(self, orm):
        cursor = connection.cursor()
        cursor.execute("DROP INDEX employee_fts_index")
        cursor.execute("ALTER TABLE employee DROP COLUMN fts_document")
        cursor.execute("DROP TRIGGER employee_fts_update_trigger ON employee")
        cursor.execute("DROP TRIGGER employee_fts_insert_trigger ON employee")
        cursor.execute("DROP FUNCTION employee_fts_document (integer)")
        cursor.execute("DROP FUNCTION employee_fts_document_trigger ()")

    models = {
        """
        App-specific model definitions here...
        """
    }

Now, just run this migration...

$ ./manage.py migrate app

...and watch as your FTS column, indices, and triggers are automagically created.

The Django Admin ¶

OK, so we have our FTS document created, have generated a migration to keep things in sync across machines, and are indexing everything we might possibly want to search on to find an employee. The next step is to use what we have and integrate it into the Django Admin. This assumes that you have an EmployeeAdmin object similar to the below:

class EmployeeAdmin(admin.ModelAdmin):
    search_fields = ("first_name", "last_name", "address_1", "address_2")
    # ... other admin configuration here ...

admin.site.register(models.Employee, EmployeeAdmin)

Our goal is to make searching with the Django Admin use our shiny new FTS index. To do this, we're going to need to override the get_search_results method in EmployeeAdmin to something like this:

class EmployeeAdmin(admin.ModelAdmin):
    # ...

    def get_search_results(self, request, queryset, search_term):
        if search_term == '':
            queryset, distinct = super(EmployeeAdmin, self).get_search_results(request, queryset, search_term)
        else:
            queryset, distinct = super(EmployeeAdmin, self) \
                    .get_search_results(request, queryset, None)
            queryset = queryset.extra(
                where=["employee.fts_document @@ to_tsquery('simple', %s)"],
                params=[search_term]
            )

        return queryset, distinct

Looks close, but we're missing something. to_tsquery will expect a "properly-formed" query, and we can't always trust or depend on users of the admin to know PostgreSQL search syntax (don't blame them). We need to somehow sanitize the input before feeding it into PostgreSQL.

Here are some examples of things that we might want to handle:

Strip padded whitespace.
Remove duplicated whitespace.
Add &'s between words that are only separated by spaces.
Replace double quotes (which PostgreSQL doesn't like) to single quotes.
Clear up situations where search operators are used ambiguously (i.e., "ron & | swanson")
Add a prefix wildcard to every search term, so that a search for "swan" will return "Ron Swanson" in the search results.

Luckily for you, I've already done all of the heavy lifting. Below is a function called sanitize_search_term which will take a string and form it into a proper PostgreSQL search query.

import re
import string

def sanitize_search_term(term):
    # Replace all puncuation with spaces.
    allowed_punctuation = set(['&', '|', '"', "'"])
    all_punctuation = set(string.punctuation)
    punctuation = "".join(all_punctuation - allowed_punctuation)
    term = re.sub(r"[{}]+".format(re.escape(punctuation)), " ", \
            term)

    # Substitute all double quotes to single quotes.
    term = term.replace('"', "'")
    term = re.sub(r"[']+", "'", term)

    # Create regex to find strings within quotes.
    quoted_strings_re = re.compile(r"('[^']*')")
    space_between_words_re = re.compile(r'([^ &|])[ ]+([^ &|])')
    spaces_surrounding_letter_re = re.compile(r'[ ]+([^ &|])[ ]+')
    multiple_operator_re = re.compile(r"[ &]+(&|\|)[ &]+")

    tokens = quoted_strings_re.split(term)
    processed_tokens = []
    for token in tokens:
        # Remove all surrounding whitespace.
        token = token.strip()

        if token in ['', "'"]:
            continue

        if token[0] != "'":
            # Surround single letters with &'s
            token = spaces_surrounding_letter_re.sub(r' & \1 & ', token)

            # Specify '&' between words that have neither | or & specified.
            token = space_between_words_re.sub(r'\1 & \2', token)

            # Add a prefix wildcard to every search term.
            token = re.sub(r'([^ &|]+)', r'\1:*', token)

        processed_tokens.append(token)

    term = " & ".join(processed_tokens)

    # Replace ampersands or pipes surrounded by ampersands.
    term = multiple_operator_re.sub(r" \1 ", term)

    # Escape single quotes
    return term.replace("'", "''")

Combining everything together, we get the following:

class EmployeeAdmin(admin.ModelAdmin):
    # ...

    def get_search_results(self, request, queryset, search_term):
        if search_term == '':
            queryset, distinct = super(EmployeeAdmin, self).get_search_results(request, queryset, search_term)
        else:
            search_term = sanitize_search_term(search_term)
            queryset, distinct = super(EmployeeAdmin, self) \
                    .get_search_results(request, queryset, None)
            queryset = queryset.extra(
                where=["employee.fts_document @@ to_tsquery('simple', %s)"],
                params=[search_term]
            )

        return queryset, distinct

And...we're done! Searches in the admin will now use PostgreSQL FTS.

Here's to speedy searching!

EDIT 12/15/2014: In a previous version of this post, employee_fts_document would pull stale values from the DB while generating the FTS document and performed an extraneous SELECT. Thanks, Roger!