dlo.me

Making a time-lapse on the command line using FFmpeg and ImageMagick

We’re vacationing in Whistler, BC right now as “endurance spectators” to my father-in-law’s 3rd Ironman triathlon. Expecting some beautiful landscapes and weather, I brought my newly acquired X100T to take some nice photos.

Yesterday, I set it up on an interval timer and pointed it right towards Rainbow Mountain, which faces the patio in the kitchen of the little condo unit we’re renting out. After all was said and done, I ended up with 400 images depicting clouds moving over a mountain peak and not much idea of what to do with them. So, as any self-respecting engineer would, I set out to create a time-lapse using only my trusty command-line tools: FFmpeg and ImageMagick.

Let’s get down to it.

Note: Everything in this tutorial assumes that you have a current copy of ImageMagick and FFmpeg installed on your machine.

Resizing

Even though I turned off RAW on the X100T, the images were still pretty huge (4896x3264). During my first tests, making movies from images this large gave really inconsistent results and took a long time to create, with not much extra benefit.

Therefore, the first thing you should probably do is check the size of your images and, if necessary, resize them to be a bit smaller so they will play more nicely with FFmpeg and any other image manipulation that you’re going to do.

Since I planned to upload my video to YouTube, I referenced a handy page they have that lists out their preferred resolutions, codecs, and formats for upload (https://support.google.com/youtube/answer/1722171?hl=en). If you’re like me, and you don’t care too much about maintaining the current aspect ratio, here’s what you can do. This will resize your images to a preferred resolution (in this case, 1280x720), and will potentially crop off the sides or top in the process. To start, make sure you’re in the directory with all of your photos.

$ for FILE in `ls *.JPG`; do \
  mogrify -resize 1280x720^ -gravity center -crop 1280x720+0+0 +repage -write RESIZED_PHOTO_DIRECTORY/$FILE $FILE; \
done

In detail, this command -resizes photos to a 1280x720^ resolution (the caret means that the smaller of width and height is maintained and the larger one is kept even if the resolution is larger), and then, by using -gravity and centering, we crop the image to 1280x720 exactly, and write to RESIZED_PHOTO_DIRECTORY/$FILE. Phew, that was a mouthful.

If you just want to resize to a certain height/width and want to maintain the original resolution, just do this:

$ for FILE in `ls *.JPG`; do \
    mogrify -resize 600x -write RESIZED_PHOTO_DIRECTORY/$FILE $FILE; \
done

Maintaining Color Distribution

Note: this step might not be necessary in your situation, but it greatly improved the quality of the final product for me. YMMV.

Sometimes images captured in a time lapse have very different histograms (especially if you have auto-aperture / shutter-speed enabled), and this can make things look “jumpy” from frame to frame. Obviously, this won’t look great in your final video, so we’re going to normalize the colors to a set distribution.

For an example, just compare the following two images (especially notice the trees, which are much lighter in the first example than the second):

/images/time-lapse/example2.jpg

/images/time-lapse/example1.jpg

Not ideal, right?

To help achieve this end, I used an ImageMagick script called histmatch, generously provided by Fred Weinhaus (link: http://www.fmwconcepts.com/imagemagick/histmatch/index.php). The idea to use a reference image to generate a histogram that we want all of the other images to match. Once you’ve decided on your reference image, run the following on every image except the reference image (otherwise the universe will explode).

histmatch -c gray REFERENCE.JPG TARGET_IMAGE.JPG NORMALIZED_IMAGE_DIRECTORY/FILE.JPG

(I just piped the output of ls *.JPG into a file called normalize.sh and used some of my Vim-fu to do this. Your process might be different.)

12/27/2017 Update:

Re-reading this post, I’ve found it’s much easier to just move the target image to the normalized directory, and then run this using find/exec.

find . -depth 1 -name "*.JPG" -exec histmatch -c gray normalized/TARGET_IMAGE.JPG {} normalized/{} \;

Finally, make the darned movie

This is the fun part. Just send the files through to FFmpeg and have it do its magic. If filenames are incrementally named, you’ll want to provide the parameters below (like -start_number and the _DSF%04d.JPG format) to make things match up.

ffmpeg -start_number 1 -i _DSF%04d.JPG -c:v libx264 -pix_fmt yuv420p timelapse.mp4

This tells FFmpeg to take all of the JPEGs in the directory starting with _DSF and ending with 4 digits, and to output an h.264 video with the yuv420p colorspace to video.mp4. You now have a beautiful timelapse!

If you’re interested in the final product, you can check it out on YouTube. Enjoy!

WWDC 2015 Wishlist

Here’s my list of things I’d be overjoyed to see fixed / announced / released at WWDC 2015 next week.

iOS

  • Deep linking support. Let apps define associated URLs (https://twitter.com/* –> open Twitter.app).
  • Really want to hide my carrier and clean up that status bar. Such a mess right now.
  • It’s time to modernize the UI for Notes.app and Reminders.app. iOS 6 was years ago.
  • Save alarm and world clock data in iCloud. Re-adding this on every device is really tedious.
  • Export Health.app data or save it in iCloud with a password. At the moment, it’s not shared between devices and will be lost permanently if you don’t have an encrypted local iTunes backup of your device.
  • Needs to be a way to close all tabs in Safari without tapping the screen 50x times. Absurd.

Mail.app

  • Search that actually works. Right now this is the only reason I still have the Gmail app still on my phone. I open it up whenever I want to search for a message.
  • Send from an alias like Gmail web client lets you.
  • Better UX for labeling threads.

Maps

  • (I’ve mentioned this on Twitter) Fix the UX with reviews of physical locations.
  • Public transit support.
  • Improved traffic data.

App Store

  • Search. Need I say more?
  • Direct refunds without iTunes support involvement.
  • Responding to reviews would be nice.
  • Expire old reviews (1yr+) when new versions of app have since been released.

Safari (OS X)

  • Reopen more than the most recent closed tab in Safari.
  • Paste images into web pages.

Xcode

  • Stop beachballing so darn much.
  • It would be nice to stop an archive action without needing to restart the app.
  • Debugging extensions is currently a huge pain. It would be nice if it were easier.

Added 6/7/2015

Photos on OS X

  • “Show in Finder”

The Free, Easy, and Secure Way to Encrypt Shared Dropbox Folders on Your Mac

My wife and I have been searching for months for a service to allow us to store our sensitive files in Dropbox securely. There are a lot of services out there that promise to do this well, but frankly, it’s hard to trust that our data would be safe if they happened to be hacked or compromised. I therefore went looking for a piece of software that was:

  1. Vetted as being secure.
  2. Ubiquitous. Something I wouldn’t need to worry about installing or finding on a future machine.
  3. Compatible with Mac OS X.

TrueCrypt, an open-source file encryption product, seemed like it might fit the bill. However, I must have had my head in the sand about a year ago (sure enough, I was off the grid when it happened), but TrueCrypt was abandoned by its developers with a very vague / scary message on their website:

WARNING: Using TrueCrypt is not secure as it may contain unfixed security issues

Ok then. So that in itself wasn’t reason enough for me to give up. I went ahead and downloaded TrueCrypt from some random site and verified the checksum (remind me which of my family is going to know how to do this?) and installed it (in order to do this, I had to temporarily disable OS X’s protection against unsigned software in “Security & Privacy” in System Preferences).

That’s a lot of steps for someone who just wants to share encrypted files. And more steps are bad—I didn’t want people I’m sharing stuff with to just give up. We’d be back to square one.

Not to mention, TrueCrypt doesn’t offer dynamically sized volumes. You have to specify the size of the volume at the time of creation. This means that if the number of files in a volume you’re sharing hits the limit—too bad. You need to now create a new volume (one with a larger limit) and move all of the existing files from the previous volume into it. That’s a drag. And sort of dumb.

Off I went looking for an alternative. “Wait”, I thought, “doesn’t OS X provide full-disk encryption via FileVault?”. Well, I didn’t actually ask myself that, but I concluded that there must be a way to do this just using a stock Mac OS X installation.

Sure enough, after some digging, I found this on Apple’s support site: How to create a password-protected (encrypted) disk image. Boom! There we have it.

If you follow the instructions (which are quite simple, really), you’ll have a “file” (which is actually a “sparse image”—sort of like a Volume) which you can drop into a shared Dropbox folder and share with anyone you want. To add or remove files to the encrypted folder, you need to double click it, type in the password, and then open it like you would any other volume (like your Hard Drive or a USB stick). Once you’re done modifying files, just “eject” it. Easy peasy. Dropbox syncs it immediately.

Oh—and the folder resizes automatically! Another huge plus over TrueCrypt.

So, in short, you can create an AES-encrypted folder, inside your Dropbox, and anyone else who uses a Mac will be able to add or remove files from it without needing to install any third-party software. And, it’s free.

Go forth and encrypt!

P.S. Windows users might have similar luck with BitLocker.

P.P.S. Jamie Phelps pointed me to this AgileBits article regarding a downside of using sparse bundles and Dropbox (note–a sparse image and a sparse bundle are not the same thing, so the two may not have the same downsides). Knox stores its vaults with OS X sparse bundles, and those have been shown to have issues syncing over Dropbox when simultaneous edits are made. I’ve yet to see this be a problem with using sparse images (since OS X—and Dropbox—treats a sparse image as a single file), but it might be an issue. Jamie recommends SafeMonk as an alternative.

Supercharge Your Python Shell

I’ve been using and working with Python in a professional context for around 8 years. So it came sort of a shock that something so useful, so obvious, was unfamiliar to me until around a week and a half ago.

It involves a little file called .pythonrc that lives in your home directory. To get it to work, you’ll need to define an environment variable called PYTHONSTARTUP, like so:

export PYTHONSTARTUP="$HOME/.pythonrc"

(Major hat tip to kasnalin on Reddit for pointing out that I’d forgotten to include this important piece of info in an earlier version of the post.)

If you’re a Python developer, no doubt you open up a Python shell countless times per day. Whether you use the regular Python shell, or another one such as IPython, it’s all the same. Open up the shell, import a few useful modules, and start playing around.

Maybe you find yourself importing the same modules over and over again. Maybe you leave a single shell open all the time since it’s such a pain to re-type those commands.

I spend a lot of time working with Django, so I’d gotten pretty used to typing ./manage.py shell and then from app import models to start playing around with objects in my database.

Well, I got to wondering what this .pythonrc could do to help me out with this, so I dumped in the following code:

try:
    from app import models
    from django.conf import settings
except:
    print("\nCould not import Django modules.")
else:
    print("\nImported Django modules.")

Not expecting this to work at all, I ran my trusty Django shell, and voilĂ :

$ ./manage.py shell

Imported Django modules.
Python 2.7.6 (default, Jan  6 2014, 13:22:56)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>>

It’s like magic. No more reverse searching my shell history to import commonly-used modules. If I find myself using one often enough, I just add it into my .pythonrc and it’s there for me every time.

I play around Redis a bunch, so here’s another chunk of code I plopped in there.

from redis import StrictRedis as Redis

r = Redis()

Now, I can just open up the shell and use r to play with Redis.

Here’s another one I’m pretty fond of. I’m a big fan of tab-completion and readline support. So I put this in my .pythonrc too:

try:
    import readline
except ImportError:
    print("Module readline not available.")
else:
    import rlcompleter
    if 'libedit' in readline.__doc__:
        readline.parse_and_bind("bind ^I rl_complete")
    else:
        readline.parse_and_bind("tab: complete")

Works like a charm.

Below is my full .pythonrc, in all of its glory (also available as a Gist).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# vim: set ft=python :

from __future__ import print_function

import json
import sys
import datetime

from redis import StrictRedis as Redis

r = Redis()

try:
    import readline
except ImportError:
    print("Module readline not available.")
else:
    import rlcompleter
    if 'libedit' in readline.__doc__:
        readline.parse_and_bind("bind ^I rl_complete")
    else:
        readline.parse_and_bind("tab: complete")

try:
    from app import models
    from django.conf import settings
except:
    print("\nCould not import Django modules.")
else:
    print("\nImported Django modules.")

try:
    from dateutil.parser import parse as parse_date
except ImportError:
    print("\nCould not import dateutil.")

Getting Started with Django and PostgreSQL Full-Text Search

After hearing a few months back that PostgreSQL had built-in full-text search support, I had been continually searching for reasons to take advantage of it. A few months went by, and when one of our clients at Lionheart started running into issues with slowness using the Django admin’s built-in search support, I knew I had a solid candidate. After all was said and done, searches returned results approximately 20x faster than vanilla Django. Here’s how you can do it too.

The Basics

One of the great things about PostgreSQL full-text search is that there isn’t too much to learn if you’re already pretty familiar with search engines and how they work. The main idea is that you generate an index using a document containing terms you’d like to search on. PostgreSQL provides a lot of flexibility with regards to how you choose to do this, but I opted to store the documents as a field called fts_document on the table we wanted to search on.

PostgreSQL stores search data in a data type called a tsvector. Here’s how the PostgreSQL documentation describes it:

A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word. Sorting and duplicate-elimination are done automatically during input.

Here’s an example of a sentence converted to a tsvector:

{% highlight psql %} =# SELECT to_tsvector(‘simple’, ‘The quick brown fox jumps over the lazy dog.’); to_tsvector

‘brown’:3 ‘dog’:9 ‘fox’:4 ‘jumps’:5 ’lazy’:8 ‘over’:6 ‘quick’:2 ’the’:1,7 (1 row) {% endhighlight %}

When you want to search for something, you create what’s called a tsquery.

A tsquery value stores lexemes that are to be searched for, and combines them honoring the Boolean operators & (AND), | (OR), and ! (NOT). Parentheses can be used to enforce grouping of the operators

Here’s an example:

{% highlight psql %} dan=# SELECT to_tsquery(‘simple’, ‘fox | jumps’); to_tsquery

‘fox’ | ‘jumps’ (1 row) {% endhighlight %}

When you want to test for a match, you use the @@ operator.

{% highlight psql %} dan=# SELECT to_tsvector(‘simple’, ‘The quick brown fox jumps over the lazy dog.’) @@ to_tsquery(‘simple’, ‘fox’); ?column?

t {% endhighlight %}

It’s a match!

The @@ operator also provides a shorthand whereby you don’t need to use the to_tsvector / to_tsquery functions explicitly. E.g.,

{% highlight psql %} dan=# SELECT ‘The quick brown fox jumps over the lazy dog.’ @@ ‘fox’; ?column?

t {% endhighlight %}

Sweet! However, you might be confused by the following:

{% highlight psql %} dan=# SELECT ‘The quick brown fox jumps over the lazy dog.’ @@ ‘fox | random’; ?column?

f {% endhighlight %}

Huh? We’re searching for either the term “fox” or “random”. This doesn’t seem right…

Well, it turns out that when you specify plain text as an argument to the @@ operator, it wraps it in plainto_tsquery, which functions a little bit differently than the to_tsquery function that we’ve been playing around with above. Notably, it ignores every logical operator except &. So |’s, parentheses, etc., will all be ignored. Personally, I don’t find it to be all that useful.

Also, as you might have noticed above, I provided an additional argument to both the to_tsvector and to_tsquery functions called ‘simple’. This represents the simple text search configuration. Other text search configurations are language-based and strip out stopwards and other lexemes based on what language you might be indexing. Here’s a few other options that I have available on my machine.

{% highlight psql %} dan=# \dF List of text search configurations Schema | Name | Description ————+————+————————————— pg_catalog | danish | configuration for danish language pg_catalog | dutch | configuration for dutch language pg_catalog | english | configuration for english language pg_catalog | finnish | configuration for finnish language pg_catalog | french | configuration for french language pg_catalog | german | configuration for german language pg_catalog | hungarian | configuration for hungarian language pg_catalog | italian | configuration for italian language pg_catalog | norwegian | configuration for norwegian language pg_catalog | portuguese | configuration for portuguese language pg_catalog | romanian | configuration for romanian language pg_catalog | russian | configuration for russian language pg_catalog | simple | simple configuration pg_catalog | spanish | configuration for spanish language pg_catalog | swedish | configuration for swedish language pg_catalog | turkish | configuration for turkish language (16 rows) {% endhighlight %}

Play around with a few and see which one suits you best.

Django

By now you know how to create search vectors and queries, search on some data, and specify a text search configuration. Let’s get started on connecting these tools with your Django project.

Since tsvector and tsquery are both standard PostgreSQL data types, we can use them just as we might any other data type like int or varchar in a column definition. This is very handy. Otherwise, we would have to regenerate vectors on-the-fly, which might offset any speedup we get from using FTS in the first place. :)

For the sake of moving forward, let’s say we have a model called Employee in our Django project. We’re trying to speed up searches on anything related to an employee. Assume we’re storing their first name, last name, and full address.

{% highlight python linenos %} class Employee(db.Model): first_name = models.CharField(max_length=64) last_name = models.CharField(max_length=64) address_1 = models.TextField() address_2 = models.TextField() city = models.CharField(max_length=64) state = models.CharField(max_length=2) zip = models.CharField(max_length=9) {% endhighlight %}

We’re now going to create an tsvector based on these fields. To do this, we’ll concatenate all of the fields into a list and then flatten the list into a string to feed into the to_tsvector function.

{% highlight psql %} dan=# SELECT to_tsvector(‘simple’, concat_ws(’ ‘, first_name, last_name, address_1, address_2, city, state, zip)) FROM employee; to_tsvector

‘123’:3 ‘12345’:8 ‘in’:7 ‘main’:4 ‘pawnee’:6 ‘ron’:1 ‘st’:5 ‘swanson’:2 (1 row) {% endhighlight %}

If we wanted to stop here, we could just do FTS matches on the fly, but this would be quite slow.

{% highlight psql %} dan=# SELECT COUNT(*) FROM employee WHERE to_tsvector(‘simple’, concat_ws(’ ‘, first_name, last_name, address_1, address_2, city, state, zip)) @@ ‘swanson’; count

 1

(1 row) {% endhighlight %}

The next step is to store this tsvector in a field and keep it updated when the data changes. First step seems easy enough:

{% highlight psql %} dan=# ALTER TABLE employee ADD COLUMN fts_document tsvector; {% endhighlight %}

Hmm, but how to we keep this updated? There are a bunch of options, but according to the PostgreSQL documention, the best one is to add a trigger on document updates and additions. To make things easy, we’re going to create a function to generate the fts document. E.g.,

{% highlight plpgsql linenos %} CREATE FUNCTION employee_fts_document(e employee) RETURNS tsvector AS $$ DECLARE employee_document TEXT; BEGIN SELECT concat_ws(’ ‘, e.first_name, e.last_name, e.address_1, e.address_2, e.city, e.state, e.zip) INTO employee_document; RETURN to_tsvector(‘pg_catalog.simple’, employee_document); END; $$ LANGUAGE plpgsql; {% endhighlight %}

We can then reference this function in the trigger.

{% highlight plpgsql linenos %} CREATE FUNCTION employee_fts_document_trigger() RETURNS TRIGGER AS $$ BEGIN NEW.fts_document=employee_fts_document(NEW); RETURN NEW; END; $$ LANGUAGE plpgsql; {% endhighlight %}

Finally, we tell PostgreSQL to execute the trigger on employee updates and inserts.

{% highlight postgresql %} CREATE TRIGGER employee_fts_update_trigger BEFORE UPDATE ON employee FOR EACH ROW EXECUTE PROCEDURE employee_fts_document_trigger(); CREATE TRIGGER employee_fts_insert_trigger BEFORE INSERT ON employee FOR EACH ROW EXECUTE PROCEDURE employee_fts_document_trigger(); {% endhighlight %}

We’re almost done. The last thing we might want to do is to tell PostgreSQL that we’d like to index the fts_document column which contains our full-text search vectors. We’re going to add a GIN (Generalized Inverted Index)-based index to our column. You can read about the GIN-based and GiST-based indices in the PostgreSQL documentation. Here’s how we add our index.

{% highlight postgresql %} CREATE INDEX employee_fts_index ON employee USING gin(fts_document); {% endhighlight %}

That’s it. Our employee table now has an always up-to-date search index that we can use. Before we continue, let’s instantiate the search vectors. We’ll then test everything out to make sure things are working a-ok.

{% highlight psql %} dan=# UPDATE employee SET fts_document=employee_fts_document(id); dan=# SELECT COUNT(*) FROM employee WHERE fts_document @@ ‘ron’; count

 1

(1 row) {% endhighlight %}

Success!

South

Now that we understand the internals, wouldn’t it be nice if we could keep our databases up-to-date without having to manually copy and paste commands on our PostgreSQL shell? South to the rescue!

We’ll need to generate a blank migration.

{% highlight bash %} $ ./manage.py schemamigration app create_employee_fts Created 0001_create_employee_fts.py. You must now edit this migration and add the code for each direction. {% endhighlight %}

Next, open up the migration, and copy in the SQL above in the forwards and backwards methods, as illustrated below.

{% highlight python linenos %}

… imports, etc …

from django.db import connection

class Migration(SchemaMigration):

def forwards(self, orm):
    sql = """

CREATE FUNCTION employee_fts_document(integer) RETURNS tsvector AS $$ DECLARE employee_document TEXT; BEGIN SELECT concat_ws(’ ‘, first_name, last_name, address_1, address_2, city, state, zip) INTO employee_document FROM employee WHERE id=$1; RETURN to_tsvector(‘pg_catalog.simple’, employee_document); END; $$ LANGUAGE plpgsql;

CREATE FUNCTION employee_fts_document_trigger() RETURNS TRIGGER AS $$ BEGIN NEW.fts_document=employee_fts_document(NEW.id); RETURN NEW; END; $$ LANGUAGE plpgsql; """

    cursor = connection.cursor()
    cursor.execute(sql);
    cursor.execute("ALTER TABLE employee ADD COLUMN fts_document tsvector")
    cursor.execute("UPDATE employee SET fts_document=employee_fts_document(id)");
    cursor.execute("CREATE TRIGGER employee_fts_update_trigger BEFORE UPDATE ON employee FOR EACH ROW EXECUTE PROCEDURE employee_fts_document_trigger()")
    cursor.execute("CREATE TRIGGER employee_fts_insert_trigger BEFORE INSERT ON employee FOR EACH ROW EXECUTE PROCEDURE employee_fts_document_trigger()")
    cursor.execute("CREATE INDEX employee_fts_index ON employee USING gin(fts_document)")

def backwards(self, orm):
    cursor = connection.cursor()
    cursor.execute("DROP INDEX employee_fts_index")
    cursor.execute("ALTER TABLE employee DROP COLUMN fts_document")
    cursor.execute("DROP TRIGGER employee_fts_update_trigger ON employee")
    cursor.execute("DROP TRIGGER employee_fts_insert_trigger ON employee")
    cursor.execute("DROP FUNCTION employee_fts_document (integer)")
    cursor.execute("DROP FUNCTION employee_fts_document_trigger ()")

models = {
    """
    App-specific model definitions here...
    """
}

{% endhighlight %}

Now, just run this migration…

{% highlight bash %} $ ./manage.py migrate app {% endhighlight %}

…and watch as your FTS column, indices, and triggers are automagically created.

The Django Admin

OK, so we have our FTS document created, have generated a migration to keep things in sync across machines, and are indexing everything we might possibly want to search on to find an employee. The next step is to use what we have and integrate it into the Django Admin. This assumes that you have an EmployeeAdmin object similar to the below:

{% highlight python %} class EmployeeAdmin(admin.ModelAdmin): search_fields = (“first_name”, “last_name”, “address_1”, “address_2”) # … other admin configuration here …

admin.site.register(models.Employee, EmployeeAdmin) {% endhighlight %}

Our goal is to make searching with the Django Admin use our shiny new FTS index. To do this, we’re going to need to override the get_search_results method in EmployeeAdmin to something like this:

{% highlight python linenos %} class EmployeeAdmin(admin.ModelAdmin): # …

def get_search_results(self, request, queryset, search_term):
    if search_term == '':
        queryset, distinct = super(EmployeeAdmin, self).get_search_results(request, queryset, search_term)
    else:
        queryset, distinct = super(EmployeeAdmin, self) \
                .get_search_results(request, queryset, None)
        queryset = queryset.extra(
            where=["employee.fts_document @@ to_tsquery('simple', %s)"],
            params=[search_term]
        )

    return queryset, distinct

{% endhighlight %}

Looks close, but we’re missing something. to_tsquery will expect a “properly-formed” query, and we can’t always trust or depend on users of the admin to know PostgreSQL search syntax (don’t blame them). We need to somehow sanitize the input before feeding it into PostgreSQL.

Here are some examples of things that we might want to handle:

  • Strip padded whitespace.
  • Remove duplicated whitespace.
  • Add &’s between words that are only separated by spaces.
  • Replace double quotes (which PostgreSQL doesn’t like) to single quotes.
  • Clear up situations where search operators are used ambiguously (i.e., “ron & | swanson”)
  • Add a prefix wildcard to every search term, so that a search for “swan” will return “Ron Swanson” in the search results.

Luckily for you, I’ve already done all of the heavy lifting. Below is a function called sanitize_search_term which will take a string and form it into a proper PostgreSQL search query.

{% highlight python linenos %} import re import string

def sanitize_search_term(term): # Replace all puncuation with spaces. allowed_punctuation = set([’&’, ‘|’, ‘"’, “’”]) all_punctuation = set(string.punctuation) punctuation = “".join(all_punctuation - allowed_punctuation) term = re.sub(r”[{}]+".format(re.escape(punctuation)), " “,
term)

# Substitute all double quotes to single quotes.
term = term.replace('"', "'")
term = re.sub(r"[']+", "'", term)

# Create regex to find strings within quotes.
quoted_strings_re = re.compile(r"('[^']*')")
space_between_words_re = re.compile(r'([^ &|])[ ]+([^ &|])')
spaces_surrounding_letter_re = re.compile(r'[ ]+([^ &|])[ ]+')
multiple_operator_re = re.compile(r"[ &]+(&|\|)[ &]+")

tokens = quoted_strings_re.split(term)
processed_tokens = []
for token in tokens:
    # Remove all surrounding whitespace.
    token = token.strip()

    if token in ['', "'"]:
        continue

    if token[0] != "'":
        # Surround single letters with &'s
        token = spaces_surrounding_letter_re.sub(r' & \1 & ', token)

        # Specify '&' between words that have neither | or & specified.
        token = space_between_words_re.sub(r'\1 & \2', token)

        # Add a prefix wildcard to every search term.
        token = re.sub(r'([^ &|]+)', r'\1:*', token)

    processed_tokens.append(token)

term = " & ".join(processed_tokens)

# Replace ampersands or pipes surrounded by ampersands.
term = multiple_operator_re.sub(r" \1 ", term)

# Escape single quotes
return term.replace("'", "''")

{% endhighlight %}

Combining everything together, we get the following:

{% highlight python linenos %} class EmployeeAdmin(admin.ModelAdmin): # …

def get_search_results(self, request, queryset, search_term):
    if search_term == '':
        queryset, distinct = super(EmployeeAdmin, self).get_search_results(request, queryset, search_term)
    else:
        search_term = sanitize_search_term(search_term)
        queryset, distinct = super(EmployeeAdmin, self) \
                .get_search_results(request, queryset, None)
        queryset = queryset.extra(
            where=["employee.fts_document @@ to_tsquery('simple', %s)"],
            params=[search_term]
        )

    return queryset, distinct

{% endhighlight %}

And…we’re done! Searches in the admin will now use PostgreSQL FTS.

Here’s to speedy searching!


EDIT 12/15/2014: In a previous version of this post, employee_fts_document would pull stale values from the DB while generating the FTS document and performed an extraneous SELECT. Thanks, Roger!