Monday, September 8, 2014

Supercharge Your Python Shell

I’ve been using and working with Python in a professional context for around 8 years. So it came sort of a shock that something so useful, so obvious, was unfamiliar to me until around a week and a half ago.

It involves a little file called .pythonrc that lives in your home directory. To get it to work, you’ll need to define an environment variable called PYTHONSTARTUP, like so:

export PYTHONSTARTUP="$HOME/.pythonrc"

(Major hat tip to kasnalin on Reddit for pointing out that I’d forgotten to include this important piece of info in an earlier version of the post.)

If you’re a Python developer, no doubt you open up a Python shell countless times per day. Whether you use the regular Python shell, or another one such as IPython, it’s all the same. Open up the shell, import a few useful modules, and start playing around.

Maybe you find yourself importing the same modules over and over again. Maybe you leave a single shell open all the time since it’s such a pain to re-type those commands.

I spend a lot of time working with Django, so I’d gotten pretty used to typing ./manage.py shell and then from app import models to start playing around with objects in my database.

Well, I got to wondering what this .pythonrc could do to help me out with this, so I dumped in the following code:

try:
    from app import models
    from django.conf import settings
except:
    print("\nCould not import Django modules.")
else:
    print("\nImported Django modules.")

Not expecting this to work at all, I ran my trusty Django shell, and voilà:

$ ./manage.py shell

Imported Django modules.
Python 2.7.6 (default, Jan  6 2014, 13:22:56)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>>

It’s like magic. No more reverse searching my shell history to import commonly-used modules. If I find myself using one often enough, I just add it into my .pythonrc and it’s there for me every time.

I play around Redis a bunch, so here’s another chunk of code I plopped in there.

from redis import StrictRedis as Redis

r = Redis()

Now, I can just open up the shell and use r to play with Redis.

Here’s another one I’m pretty fond of. I’m a big fan of tab-completion and readline support. So I put this in my .pythonrc too:

try:
    import readline
except ImportError:
    print("Module readline not available.")
else:
    import rlcompleter
    if 'libedit' in readline.__doc__:
        readline.parse_and_bind("bind ^I rl_complete")
    else:
        readline.parse_and_bind("tab: complete")

Works like a charm.

Below is my full .pythonrc, in all of its glory (also available as a Gist).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# vim: set ft=python :

from __future__ import print_function

import json
import sys
import datetime

from redis import StrictRedis as Redis

r = Redis()

try:
    import readline
except ImportError:
    print("Module readline not available.")
else:
    import rlcompleter
    if 'libedit' in readline.__doc__:
        readline.parse_and_bind("bind ^I rl_complete")
    else:
        readline.parse_and_bind("tab: complete")

try:
    from app import models
    from django.conf import settings
except:
    print("\nCould not import Django modules.")
else:
    print("\nImported Django modules.")

try:
    from dateutil.parser import parse as parse_date
except ImportError:
    print("\nCould not import dateutil.")

Monday, September 1, 2014

Getting Started with Django and PostgreSQL Full-Text Search

After hearing a few months back that PostgreSQL had built-in full-text search support, I had been continually searching for reasons to take advantage of it. A few months went by, and when one of our clients at Lionheart started running into issues with slowness using the Django admin’s built-in search support, I knew I had a solid candidate. After all was said and done, searches returned results approximately 20x faster than vanilla Django. Here’s how you can do it too.

The Basics

One of the great things about PostgreSQL full-text search is that there isn’t too much to learn if you’re already pretty familiar with search engines and how they work. The main idea is that you generate an index using a document containing terms you’d like to search on. PostgreSQL provides a lot of flexibility with regards to how you choose to do this, but I opted to store the documents as a field called fts_document on the table we wanted to search on.

PostgreSQL stores search data in a data type called a tsvector. Here’s how the PostgreSQL documentation describes it:

A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word. Sorting and duplicate-elimination are done automatically during input.

Here’s an example of a sentence converted to a tsvector:

=# SELECT to_tsvector('simple', 'The quick brown fox jumps over the lazy dog.');
                                to_tsvector
---------------------------------------------------------------------------
 'brown':3 'dog':9 'fox':4 'jumps':5 'lazy':8 'over':6 'quick':2 'the':1,7
(1 row)

When you want to search for something, you create what’s called a tsquery.

A tsquery value stores lexemes that are to be searched for, and combines them honoring the Boolean operators & (AND), | (OR), and ! (NOT). Parentheses can be used to enforce grouping of the operators

Here’s an example:

dan=# SELECT to_tsquery('simple', 'fox | jumps');
   to_tsquery
-----------------
 'fox' | 'jumps'
(1 row)

When you want to test for a match, you use the @@ operator.

dan=# SELECT to_tsvector('simple', 'The quick brown fox jumps over the lazy dog.') @@ to_tsquery('simple', 'fox');
 ?column?
----------
 t

It’s a match!

The @@ operator also provides a shorthand whereby you don’t need to use the to_tsvector / to_tsquery functions explicitly. E.g.,

dan=# SELECT 'The quick brown fox jumps over the lazy dog.' @@ 'fox';
 ?column?
----------
 t

Sweet! However, you might be confused by the following:

dan=# SELECT 'The quick brown fox jumps over the lazy dog.' @@ 'fox | random';
 ?column?
----------
 f

Huh? We’re searching for either the term “fox” or “random”. This doesn’t seem right…

Well, it turns out that when you specify plain text as an argument to the @@ operator, it wraps it in plainto_tsquery, which functions a little bit differently than the to_tsquery function that we’ve been playing around with above. Notably, it ignores every logical operator except &. So |’s, parentheses, etc., will all be ignored. Personally, I don’t find it to be all that useful.

Also, as you might have noticed above, I provided an additional argument to both the to_tsvector and to_tsquery functions called ‘simple’. This represents the simple text search configuration. Other text search configurations are language-based and strip out stopwards and other lexemes based on what language you might be indexing. Here’s a few other options that I have available on my machine.

dan=# \dF
               List of text search configurations
   Schema   |    Name    |              Description
------------+------------+---------------------------------------
 pg_catalog | danish     | configuration for danish language
 pg_catalog | dutch      | configuration for dutch language
 pg_catalog | english    | configuration for english language
 pg_catalog | finnish    | configuration for finnish language
 pg_catalog | french     | configuration for french language
 pg_catalog | german     | configuration for german language
 pg_catalog | hungarian  | configuration for hungarian language
 pg_catalog | italian    | configuration for italian language
 pg_catalog | norwegian  | configuration for norwegian language
 pg_catalog | portuguese | configuration for portuguese language
 pg_catalog | romanian   | configuration for romanian language
 pg_catalog | russian    | configuration for russian language
 pg_catalog | simple     | simple configuration
 pg_catalog | spanish    | configuration for spanish language
 pg_catalog | swedish    | configuration for swedish language
 pg_catalog | turkish    | configuration for turkish language
(16 rows)

Play around with a few and see which one suits you best.

Django

By now you know how to create search vectors and queries, search on some data, and specify a text search configuration. Let’s get started on connecting these tools with your Django project.

Since tsvector and tsquery are both standard PostgreSQL data types, we can use them just as we might any other data type like int or varchar in a column definition. This is very handy. Otherwise, we would have to regenerate vectors on-the-fly, which might offset any speedup we get from using FTS in the first place. :)

For the sake of moving forward, let’s say we have a model called Employee in our Django project. We’re trying to speed up searches on anything related to an employee. Assume we’re storing their first name, last name, and full address.

1 class Employee(db.Model):
2     first_name = models.CharField(max_length=64)
3     last_name = models.CharField(max_length=64)
4     address_1 = models.TextField()
5     address_2 = models.TextField()
6     city = models.CharField(max_length=64)
7     state = models.CharField(max_length=2)
8     zip = models.CharField(max_length=9)

We’re now going to create an tsvector based on these fields. To do this, we’ll concatenate all of the fields into a list and then flatten the list into a string to feed into the to_tsvector function.

dan=# SELECT to_tsvector('simple', concat_ws(' ', first_name, last_name, address_1, address_2, city, state, zip)) FROM employee;
                               to_tsvector
-------------------------------------------------------------------------
 '123':3 '12345':8 'in':7 'main':4 'pawnee':6 'ron':1 'st':5 'swanson':2
(1 row)

If we wanted to stop here, we could just do FTS matches on the fly, but this would be quite slow.

dan=# SELECT COUNT(*) FROM employee WHERE to_tsvector('simple', concat_ws(' ', first_name, last_name, address_1, address_2, city, state, zip)) @@ 'swanson';
 count
-------
     1
(1 row)

The next step is to store this tsvector in a field and keep it updated when the data changes. First step seems easy enough:

dan=# ALTER TABLE employee ADD COLUMN fts_document tsvector;

Hmm, but how to we keep this updated? There are a bunch of options, but according to the PostgreSQL documention, the best one is to add a trigger on document updates and additions. To make things easy, we’re going to create a function to generate the fts document. E.g.,

1 CREATE FUNCTION employee_fts_document(integer) RETURNS tsvector AS $$
2 DECLARE
3     employee_document TEXT;
4 BEGIN
5     SELECT concat_ws(' ', first_name, last_name, address_1, address_2, city, state, zip) INTO employee_document FROM employee WHERE id=$1;
6     RETURN to_tsvector('pg_catalog.simple', employee_document);
7 END;
8 $$ LANGUAGE plpgsql;

We can then reference this function in the trigger.

1 CREATE FUNCTION employee_fts_document_trigger() RETURNS TRIGGER AS $$
2 BEGIN
3     NEW.fts_document=employee_fts_document(NEW.id);
4     RETURN NEW;
5 END;
6 $$ LANGUAGE plpgsql;

Finally, we tell PostgreSQL to execute the trigger on employee updates and inserts.

CREATE TRIGGER employee_fts_update_trigger BEFORE UPDATE ON employee FOR EACH ROW EXECUTE PROCEDURE employee_fts_document_trigger();
CREATE TRIGGER employee_fts_insert_trigger BEFORE INSERT ON employee FOR EACH ROW EXECUTE PROCEDURE employee_fts_document_trigger();

We’re almost done. The last thing we might want to do is to tell PostgreSQL that we’d like to index the fts_document column which contains our full-text search vectors. We’re going to add a GIN (Generalized Inverted Index)-based index to our column. You can read about the GIN-based and GiST-based indices in the PostgreSQL documentation. Here’s how we add our index.

CREATE INDEX employee_fts_index ON employee USING gin(fts_document);

That’s it. Our employee table now has an always up-to-date search index that we can use. Before we continue, let’s instantiate the search vectors. We’ll then test everything out to make sure things are working a-ok.

dan=# UPDATE employee SET fts_document=fts_document(id);
dan=# SELECT COUNT(*) FROM employee WHERE fts_document @@ 'ron';
 count
-------
     1
(1 row)

Success!

South

Now that we understand the internals, wouldn’t it be nice if we could keep our databases up-to-date without having to manually copy and paste commands on our PostgreSQL shell? South to the rescue!

We’ll need to generate a blank migration.

$ ./manage.py schemamigration app create_employee_fts
Created 0001_create_employee_fts.py. You must now edit this migration and add the code for each direction.

Next, open up the migration, and copy in the SQL above in the forwards and backwards methods, as illustrated below.

 1 # ... imports, etc ...
 2 
 3 from django.db import connection
 4 
 5 class Migration(SchemaMigration):
 6 
 7     def forwards(self, orm):
 8         sql = """
 9 CREATE FUNCTION employee_fts_document(integer) RETURNS tsvector AS $$
10 DECLARE
11     employee_document TEXT;
12 BEGIN
13     SELECT concat_ws(' ', first_name, last_name, address_1, address_2, city, state, zip) INTO employee_document FROM employee WHERE id=$1;
14     RETURN to_tsvector('pg_catalog.simple', employee_document);
15 END;
16 $$ LANGUAGE plpgsql;
17 
18 CREATE FUNCTION employee_fts_document_trigger() RETURNS TRIGGER AS $$
19 BEGIN
20     NEW.fts_document=employee_fts_document(NEW.id);
21     RETURN NEW;
22 END;
23 $$ LANGUAGE plpgsql;
24 """
25 
26         cursor = connection.cursor()
27         cursor.execute(sql);
28         cursor.execute("ALTER TABLE employee ADD COLUMN fts_document tsvector")
29         cursor.execute("UPDATE employee SET fts_document=employee_fts_document(id)");
30         cursor.execute("CREATE TRIGGER employee_fts_update_trigger BEFORE UPDATE ON employee FOR EACH ROW EXECUTE PROCEDURE employee_fts_document_trigger()")
31         cursor.execute("CREATE TRIGGER employee_fts_insert_trigger BEFORE INSERT ON employee FOR EACH ROW EXECUTE PROCEDURE employee_fts_document_trigger()")
32         cursor.execute("CREATE INDEX employee_fts_index ON employee USING gin(fts_document)")
33 
34     def backwards(self, orm):
35         cursor = connection.cursor()
36         cursor.execute("DROP INDEX employee_fts_index")
37         cursor.execute("ALTER TABLE employee DROP COLUMN fts_document")
38         cursor.execute("DROP TRIGGER employee_fts_update_trigger ON employee")
39         cursor.execute("DROP TRIGGER employee_fts_insert_trigger ON employee")
40         cursor.execute("DROP FUNCTION employee_fts_document (integer)")
41         cursor.execute("DROP FUNCTION employee_fts_document_trigger ()")
42 
43     models = {
44         """
45         App-specific model definitions here...
46         """
47     }

Now, just run this migration…

$ ./manage.py migrate app

…and watch as your FTS column, indices, and triggers are automagically created.

The Django Admin

OK, so we have our FTS document created, have generated a migration to keep things in sync across machines, and are indexing everything we might possibly want to search on to find an employee. The next step is to use what we have and integrate it into the Django Admin. This assumes that you have an EmployeeAdmin object similar to the below:

class EmployeeAdmin(admin.ModelAdmin):
    search_fields = ("first_name", "last_name", "address_1", "address_2")
    # ... other admin configuration here ...

admin.site.register(models.Employee, EmployeeAdmin)

Our goal is to make searching with the Django Admin use our shiny new FTS index. To do this, we’re going to need to override the get_search_results method in EmployeeAdmin to something like this:

 1 class EmployeeAdmin(admin.ModelAdmin):
 2     # ...
 3 
 4     def get_search_results(self, request, queryset, search_term):
 5         if search_term == '':
 6             queryset, distinct = super(EmployeeAdmin, self).get_search_results(request, queryset, search_term)
 7         else:
 8             queryset, distinct = super(EmployeeAdmin, self) \
 9                     .get_search_results(request, queryset, None)
10             queryset = queryset.extra(
11                 where=["employee.fts_document @@ to_tsquery('simple', %s)"],
12                 params=[search_term]
13             )
14 
15         return queryset, distinct

Looks close, but we’re missing something. to_tsquery will expect a “properly-formed” query, and we can’t always trust or depend on users of the admin to know PostgreSQL search syntax (don’t blame them). We need to somehow sanitize the input before feeding it into PostgreSQL.

Here are some examples of things that we might want to handle:

  • Strip padded whitespace.
  • Remove duplicated whitespace.
  • Add &’s between words that are only separated by spaces.
  • Replace double quotes (which PostgreSQL doesn’t like) to single quotes.
  • Clear up situations where search operators are used ambiguously (i.e., “ron & | swanson”)
  • Add a prefix wildcard to every search term, so that a search for “swan” will return “Ron Swanson” in the search results.

Luckily for you, I’ve already done all of the heavy lifting. Below is a function called sanitize_search_term which will take a string and form it into a proper PostgreSQL search query.

 1 import re
 2 import string
 3 
 4 def sanitize_search_term(term):
 5     # Replace all puncuation with spaces.
 6     allowed_punctuation = set(['&', '|', '"', "'"])
 7     all_punctuation = set(string.punctuation)
 8     punctuation = "".join(all_punctuation - allowed_punctuation)
 9     term = re.sub(r"[{}]+".format(re.escape(punctuation)), " ", \
10             term)
11 
12     # Substitute all double quotes to single quotes.
13     term = term.replace('"', "'")
14     term = re.sub(r"[']+", "'", term)
15 
16     # Create regex to find strings within quotes.
17     quoted_strings_re = re.compile(r"('[^']*')")
18     space_between_words_re = re.compile(r'([^ &|])[ ]+([^ &|])')
19     spaces_surrounding_letter_re = re.compile(r'[ ]+([^ &|])[ ]+')
20     multiple_operator_re = re.compile(r"[ &]+(&|\|)[ &]+")
21 
22     tokens = quoted_strings_re.split(term)
23     processed_tokens = []
24     for token in tokens:
25         # Remove all surrounding whitespace.
26         token = token.strip()
27 
28         if token in ['', "'"]:
29             continue
30 
31         if token[0] != "'":
32             # Surround single letters with &'s
33             token = spaces_surrounding_letter_re.sub(r' & \1 & ', token)
34 
35             # Specify '&' between words that have neither | or & specified.
36             token = space_between_words_re.sub(r'\1 & \2', token)
37 
38             # Add a prefix wildcard to every search term.
39             token = re.sub(r'([^ &|]+)', r'\1:*', token)
40 
41         processed_tokens.append(token)
42 
43     term = " & ".join(processed_tokens)
44 
45     # Replace ampersands or pipes surrounded by ampersands.
46     term = multiple_operator_re.sub(r" \1 ", term)
47 
48     # Escape single quotes
49     return term.replace("'", "''")

Combining everything together, we get the following:

 1 class EmployeeAdmin(admin.ModelAdmin):
 2     # ...
 3 
 4     def get_search_results(self, request, queryset, search_term):
 5         if search_term == '':
 6             queryset, distinct = super(EmployeeAdmin, self).get_search_results(request, queryset, search_term)
 7         else:
 8             search_term = sanitize_search_term(search_term)
 9             queryset, distinct = super(EmployeeAdmin, self) \
10                     .get_search_results(request, queryset, None)
11             queryset = queryset.extra(
12                 where=["employee.fts_document @@ to_tsquery('simple', %s)"],
13                 params=[search_term]
14             )
15 
16         return queryset, distinct

And…we’re done! Searches in the admin will now use PostgreSQL FTS.

Here’s to speedy searching!

Friday, November 8, 2013

Review: iPhone 5s Case

I normally don’t write reviews, but I figured it’s never a bad time to start.

I’ll jump right into it.

I have never before gotten a case for my iPhone. Cases add bulk, take away from the aesthetic of the phone, and are just another “thing” to keep track of. But when I heard about the leather case that Apple created specifically for the iPhone 5s, I thought that it would be the end of my no-case journey.

It came a couple of days ago in the mail. I ordered from Best Buy online. The cases come in 6 color choices—brown, beige, black, yellow, blue, and red. I went with black, since it was the only one that could really go with space gray. Not to mention, I’d read a bunch of reviews that said the other colors (especially the brighter ones) were prone to picking up dirt quickly.

The case looks and feels great in the hand. The leather is no frills and simple. It’s not the sort of leather you’d see on a cowboy boot, but supposedly it’s real. There is a stamped Apple logo on the back. It looks pretty cool, and gives it more authenticity, I guess.

Putting the case on is quite easy. It really is fitted quite well to the form of the phone. Sort of expected for a case, especially if you’re also the manufacturer of the phone itself, but I suppose it’s still a good thing.

As for bulkiness, it didn’t add much, but it was enough to be noticeable. In terms of weight and form factor, the iPhone 5s with the leather case “felt” like my 4S in the hand. This isn’t necessarily a bad thing (some people might like that), but it’s not totally for me.

The biggest problems I have with the case are more functional and less visual.

1. The leather cover over the power button is a bit too stiff. One should not have to struggle when pressing the power button.
2. The volume buttons are not covered like the power button is, but they’re still quite hard to reach. It’s not impossible by any means, but it’s just hard enough to be annoying.

Given that I wasn’t blown away by the look and feel and the usability problems that it has, I decided to return it. I think I’m still in the market for a case, just not the Apple one.

Monday, July 8, 2013

So long, California, and thanks for all the fish

It’s been almost four years since my wife and I made the decision to move to Los Angeles after college. We came here for a lot of reasons, but the biggest ones were family, the lively tech scene, and the practically perfect weather.

All in all, I really do love this place.

But after nearly four years of calling it home, the time has come to make a change.

Real estate prices are through the roof and state income taxes are amongst the highest in the country. From a business perspective, franchise taxes (which you have to pay even if your business loses money) sting and processing times for even the most routine paperwork take months.

Even outside of ritzy areas like Beverly Hills, houses advertised as “teardowns” by brokers are going for prices in the neighborhood of $1.2MM and up. Homes at every price range are getting multiple all-cash offers within days of going on the market.

Renters are fairly shielded from this reality, but if you’re thinking about becoming a homeowner here, it can be a struggle.

We decided that we didn’t want to struggle. And why should we? More so than ever before, where you live matters less than what you can do. And that’s an amazing thing.

So, where to? The answer was clear to both my wife and I from the beginning: Austin, Texas.

Austin boasts no state income tax, real estate a young couple can actually afford, an exploding tech scene, and beautiful landscapes. Throw in great food, music, and culture, with a taste of Fiber Internet, and you’ve essentially described my perfect city.

As a part of my move, my company, Aurora, has become Lionheart Software, a Texas LLC. I’ll still be doing everything I’m doing now: helping small teams build great software. And there’s a good chance Pushpin will be getting some company.

I’ll be moving out later this month and am excited beyond words. If you’re an Austinite, send me an email. I’d love to meet up.



View the rest of the archives