amqplib and bpgsql on Google Code

I've created Google Code projects for my amqplib and bpgsql packages, to take advantage of their nice infrastructure including issue tracking.

bpgsql 2.0 alpha 1

For many years I've been using bpgsql, my own pure-Python PostgreSQL client, and I've finally sat down and got things somewhat polished up enough to put together as a real package.

One thing that motivated the work was the desire to use in with Django - after seeing psycopg2 do some funny things when used under mod_wsgi. There's no doubt it's slower, but it's much easier to hack on, and might be of interest to people running Djano under other Pythons such as PyPy or Jython. Getting it to pass all the Django unittests really ironed out a lot of bugs, so I think it's in fairly decent shape now.

PostgreSQL full text search with Django

PostgreSQL 8.3 is coming out soon with full text search integrated into the core database system. It's pretty well documented in chapter 12 of the PostgreSQL docs. The docs are a bit intimidating, but it turns out to be pretty easy to use with Django.

Let's say you're doing a stereotypical blog application, named 'blog', and have a model for entries such as:

from django.db import models

class Entry(models.Model):
    title = models.CharField(max_length=128)
    body = models.TextField()

after adding your app, and doing a syncdb, you should have a PostgreSQL table named blog_entry. You'll need to connect to the database with psql, or probably more conveniently with manage.py dbshell to execute a few SQL commands to setup full text searching.

PostgreSQL Setup

First, you'll want to add a column to hold a tsvector for the blog entry, which is a preprocessed version of text optimized for searching.

ALTER TABLE blog_entry ADD COLUMN body_tsv tsvector;

Next, you'll want to add a trigger to update the body_tsv column whenever a record is inserted or updated:

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON blog_entry 
    FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(body_tsv, 'pg_catalog.english', body);

You'll probably also want an index on that field, to make searches more efficient:

CREATE INDEX blog_entry_tsv ON blog_entry USING gin(body_tsv);

Lastly, if you already have records in the blog_entry table, you'll want to update the body_tsv column for those records (any new records will be automatically taken care of by the trigger).

UPDATE blog_entry SET body_tsv=to_tsvector(body);

As a test, to search for the words 'hello world' for example, you could try:

SELECT title FROM blog_entry WHERE body_tsv @@ plainto_tsquery('hello world');

and get back a list of entries with those words in the body.

Full text searching in Django

Full text searching through the Django ORM is possible using the .extra() queryset modifier. To fetch the same entries with 'hello world' we searched for in raw SQL, you'd do something like:

q = 'hello world'
queryset = Entry.objects.extra(
    where=['body_tsv @@ plainto_tsquery(%s)'], 
    params=[q])
for entry in queryset:
    print entry.title

The full text features in PostgreSQL also allow for ranking the search results using the PostgreSQL ts_rank_cd() function, and generating fragments of documents with search terms highlighted using the ts_headline() function. An example of raw SQL for doing this might be:

SELECT title, ts_headline(body, query) AS snippet, ts_rank_cd(body_tsv, query, 32) AS rank
  FROM blog_entry, plainto_tsquery('hello world') AS query
  WHERE body_tsv @@ query
  ORDER BY rank DESC;

The Django ORM can generate an equivalent query with

q = 'hello world'
queryset = Entry.objects.extra(
    select={
        'snippet': "ts_headline(body, query)",
        'rank': "ts_rank_cd(body_tsv, query, 32)",
        },
    tables=["plainto_tsquery(%s) as query"],
    where=["body_tsv @@ query"],
    params=[q]
    ).order_by('-rank')

for entry in queryset:
    print entry.title, entry.snippet, entry.rank

Addendum, 2008-04-30

With the new Queryset Refactor (QS-RF) branch of Django, one of the few backwards-incompatible changes is that you can't use an order_by method with a field generated by an extra method. Instead, ordering by that field should be specified using an order_by parameter in the extra method, as in (the change is only in the last two lines)

Also, having a function call in the 'tables' parameter no longer seems to work, so as a workaround for now I've been repeating plainto_tsquery() 3 times:

q = 'hello world'
queryset = Entry.objects.extra(
    select={
        'snippet': "ts_headline(body, plainto_tsquery(%s))",
        'rank': "ts_rank_cd(body_tsv, plainto_tsquery(%s), 32)",
        },
    where=["body_tsv @@ plainto_tsquery(%s)"],
    params=[q],
    select_params=[q, q],
    order_by=('-rank',)
    )

There is a lot more that can be done with the PostgreSQL full text search feature, check the docs for more info. But hopefully this is enough to get started with.

Addendum, 2009-01-28

Ross Poulton has a writup on Full-text searching in Django with PostgreSQL and tsearch2 - using PostgreSQL 8.1 (on a Debian system) where tsearch2 was a contibuted addon module.