bpgsql 2.0 alpha 1

For many years I've been using bpgsql, my own pure-Python PostgreSQL client, and I've finally sat down and got things somewhat polished up enough to put together as a real package.

One thing that motivated the work was the desire to use in with Django - after seeing psycopg2 do some funny things when used under mod_wsgi. There's no doubt it's slower, but it's much easier to hack on, and might be of interest to people running Djano under other Pythons such as PyPy or Jython. Getting it to pass all the Django unittests really ironed out a lot of bugs, so I think it's in fairly decent shape now.

Django comments

I've tried adding a comments feature to this site, using the stock django.contrib.comments application, mostly following the instructions from the Django Wiki, and also using James Bennet's comment-utils described here.

The Wiki is pretty thorough, but one thing I think it missed was that in the template for displaying comments, you should have a

<a name="c{{ comment.id }}" />

Near the beginning of where your comment starts being output, so the standard get_absolute_url() method for the FreeComment model has a place to point to.

A couple things that were not really talked about in the comment-utils docs were where to get the Python Akismet library, and the need for a comment_utils/comment_notification_email.txt template if you want e-mail notifications. I ended up with something simple like:

Comment: http://127.0.0.1{{ comment.get_absolute_url }}
From: {{ comment.person_name }}

-----
{{ comment.comment }}
-----

Admin: http://127.0.0.1/admin/comments/freecomment/{{comment.id}}/

I'm a bit skeptical about how well the Akismet spam-filtering service will work, but we'll see how it goes.

Migrating Django databases

I started out a few Django projects using SQLite as the DB backend a while back, and decided to upgrade them to PostgreSQL. Turns out using the manage.py dumpdata and loaddata, the switch can be fairly smooth. The steps turned out to be something like:

  • Create empty database in PostgreSQL

  • Copy the existing project's settings.py to a new_settings.py

  • Edit new_settings.py to replace the SQLite connection info with PgSQL info.

  • Run ./manage.py syncdb --settings=new_settings (note that there's not a '.py' at the end of that parameter) to create the PgSQL tables (and also test the connection info). Say 'no' when asked if you want to create a user.

  • Run ./manage.py dbshell --settings=new_settings to get into the psql shell, to clean out a few tables populated by syncdb

    • delete from auth_permission;
    • delete from django_content_type;
  • Dump the data from the old DB backend with: ./manage.py dumpdata >olddata.json

  • Load the data into the new DB backend with: ./manage.py loaddata --settings=new_settings olddata.json

  • Swap the settings files: mv settings.py old_settings.py; mv new_settings.py settings.py

  • Restart the server

Barring any complications, the entire switch can take just a few minutes. Of course, for anything you really care about you'd do some more testing before going live, but that's up to you.

PostgreSQL full text search with Django

PostgreSQL 8.3 is coming out soon with full text search integrated into the core database system. It's pretty well documented in chapter 12 of the PostgreSQL docs. The docs are a bit intimidating, but it turns out to be pretty easy to use with Django.

Let's say you're doing a stereotypical blog application, named 'blog', and have a model for entries such as:

from django.db import models

class Entry(models.Model):
    title = models.CharField(max_length=128)
    body = models.TextField()

after adding your app, and doing a syncdb, you should have a PostgreSQL table named blog_entry. You'll need to connect to the database with psql, or probably more conveniently with manage.py dbshell to execute a few SQL commands to setup full text searching.

PostgreSQL Setup

First, you'll want to add a column to hold a tsvector for the blog entry, which is a preprocessed version of text optimized for searching.

ALTER TABLE blog_entry ADD COLUMN body_tsv tsvector;

Next, you'll want to add a trigger to update the body_tsv column whenever a record is inserted or updated:

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON blog_entry 
    FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(body_tsv, 'pg_catalog.english', body);

You'll probably also want an index on that field, to make searches more efficient:

CREATE INDEX blog_entry_tsv ON blog_entry USING gin(body_tsv);

Lastly, if you already have records in the blog_entry table, you'll want to update the body_tsv column for those records (any new records will be automatically taken care of by the trigger).

UPDATE blog_entry SET body_tsv=to_tsvector(body);

As a test, to search for the words 'hello world' for example, you could try:

SELECT title FROM blog_entry WHERE body_tsv @@ plainto_tsquery('hello world');

and get back a list of entries with those words in the body.

Full text searching in Django

Full text searching through the Django ORM is possible using the .extra() queryset modifier. To fetch the same entries with 'hello world' we searched for in raw SQL, you'd do something like:

q = 'hello world'
queryset = Entry.objects.extra(
    where=['body_tsv @@ plainto_tsquery(%s)'], 
    params=[q])
for entry in queryset:
    print entry.title

The full text features in PostgreSQL also allow for ranking the search results using the PostgreSQL ts_rank_cd() function, and generating fragments of documents with search terms highlighted using the ts_headline() function. An example of raw SQL for doing this might be:

SELECT title, ts_headline(body, query) AS snippet, ts_rank_cd(body_tsv, query, 32) AS rank
  FROM blog_entry, plainto_tsquery('hello world') AS query
  WHERE body_tsv @@ query
  ORDER BY rank DESC;

The Django ORM can generate an equivalent query with

q = 'hello world'
queryset = Entry.objects.extra(
    select={
        'snippet': "ts_headline(body, query)",
        'rank': "ts_rank_cd(body_tsv, query, 32)",
        },
    tables=["plainto_tsquery(%s) as query"],
    where=["body_tsv @@ query"],
    params=[q]
    ).order_by('-rank')

for entry in queryset:
    print entry.title, entry.snippet, entry.rank

Addendum, 2008-04-30

With the new Queryset Refactor (QS-RF) branch of Django, one of the few backwards-incompatible changes is that you can't use an order_by method with a field generated by an extra method. Instead, ordering by that field should be specified using an order_by parameter in the extra method, as in (the change is only in the last two lines)

Also, having a function call in the 'tables' parameter no longer seems to work, so as a workaround for now I've been repeating plainto_tsquery() 3 times:

q = 'hello world'
queryset = Entry.objects.extra(
    select={
        'snippet': "ts_headline(body, plainto_tsquery(%s))",
        'rank': "ts_rank_cd(body_tsv, plainto_tsquery(%s), 32)",
        },
    where=["body_tsv @@ plainto_tsquery(%s)"],
    params=[q],
    select_params=[q, q],
    order_by=('-rank',)
    )

There is a lot more that can be done with the PostgreSQL full text search feature, check the docs for more info. But hopefully this is enough to get started with.

Addendum, 2009-01-28

Ross Poulton has a writup on Full-text searching in Django with PostgreSQL and tsearch2 - using PostgreSQL 8.1 (on a Debian system) where tsearch2 was a contibuted addon module.

mod_scgi redirection

While working on a new Django project, I noticed something odd about running it under mod_scgi: if you were POSTing to a URL, /foo for example, and the view for that URL did a relative redirect, as in django.http.HttpResponseRedirect('/bar'), the 302 redirect wasn't making it back to the browser. Instead, the browser was acting like the result of POST /foo was a 200 OK followed by the data you'd receive from GET /bar, without the browser knowing that it coming from a new location. The big drawback to this is that if you do a reload, the browser tries to POST to /foo again, instead of just GET /bar. The Django docs recommend always responding to POSTs with redirects, just for this reason.

Strictly speaking, redirects should be absolute URLs (see section 14.30 in the HTTP specs), and if you use one of those, it acts as expected. Django is full of relative redirects, but the framework at this time doesn't seem to try and convert them to absolute. There is ticket #987 in the Django Trac that talks about this a bit.

Browsers seem to handle relative redirects OK through, and that behavior doesn't occur with the Django test http server. Having mod_scgi conceal what Django is doing is not so good.

Digging into the mod_scgi sourcecode apache2/mod_scgi.c reveals a section of code that's causing this change:

location = apr_table_get(r->headers_out, "Location");

if (location && location[0] == '/' &&
    ((r->status == HTTP_OK) || ap_is_HTTP_REDIRECT(r->status))) {

    apr_brigade_destroy(bb);

    /* Internal redirect -- fake-up a pseudo-request */
    r->status = HTTP_OK;

    /* This redirect needs to be a GET no matter what the original
    * method was.
    */
    r->method = apr_pstrdup(r->pool, "GET");
    r->method_number = M_GET;

    ap_internal_redirect_handler(location, r);
    return OK;
}

Tossing that section of code causes mod_scgi to leave the relative redirects alone.

Django, SCGI, and AJP

I've been doing a lot with Django lately, and initially set it up using mod_python as the Django docs recommend, but still have some reservations about that kind of arrangement. I'd like to go back to running it under SCGI or something similar.

Django has support builtin for FastCGI, but after trying to install mod_fastcgi in my Apache 2.0.x setup, decided it was a PITA. mod_scgi is quite easy to setup in Apache (even though the documentation is mostly nonexistent). After finding where Django implements its FastCGI support using the flup module, I saw that with just a few minor tweaks Django could be made to support all of flup's protocols, including SCGI and AJP (Apache Jserv Protocol).

AJP turns out to be very interesting because it's included standard with Apache 2.2 as mod_proxy_ajp, and can work with mod_proxy_balancer - meaning you could setup multiple Django instances and have Apache share the load between them.

After testing a bit, I submitted a patch, and will probably switch to running my Django sites as AJP servers managed by daemontools, and frontended by Apache 2.2

Django-powered

Haven't posted anything in a while, because I've been redoing this site in Django. Previously I had a photo-gallery written as a direct mod_python app, the software part was Zope 2.x, and this blog was in PyBlosxom.

mod_python is pretty bare-bones (as it should be), and I've been down on Zope for some time now. PyBlosxom was nice, but I've become quite a Django fan, and felt I could do much more with that framework. So I figured it would be good to do a kind of unification - and learn some more Django at the same time.

I'm using Markdown for editing the bodies of blog entries now, and found it was pretty easy to transfer the old PyBlosxom files into Django database records, with Markdown mostly able to handle the HTML I had entered for those old entries with just a few minor tweaks.

The Django URLs were planned so that Apache would be able to rewrite the old PyBlosxom URLs into the new format - so hopefully existing links will still work. URLs for the old feeds should be handled transparently, but I'm omitting the old entries from the feeds because their links had changed, and didn't want them to reappear as new entries for whoever's subscribed to them.

Returned from PyCon

Got back from PyCon 2006, in mostly one piece. Picked up a terrible cold at the conference, I suppose scrounging food off the same buffet tables as 400 other people wasn't the most hygenic thing in the world.

Attended the mainly web-oriented sessions, came away very impressed with Django. I had sort of blown it off before because I didn't like the look of the templating language, and the ORM seemed weird. But after seeing what's coming in the Removing the Magic branch, I think it will be much much nicer. Was even inspired to spend the little time I had there Monday morning and afternoon sprinting with the Django guys, but I don't see how one can sprint effectively in such a short time with the limited knowledge of the codebase I had. Maybe if I go next year ... and I know more Django ... and can spend more than a day there, then I could accomplish something useful during the time.

The TurboGears guys demonstrated some nice things with AJAX widgets, but the SQLObject part of TG has given me trouble in the past when working with an existing DB, and seems to get in the way more than it helps. Even so, the TG guys, and Ian Bicking seemed pretty cool, so I hope they polish things up a bit more. Maybe SQLObject 2 will be the answer, or maybe a switch to SQLAlchemy (which wasn't represented at the conference), would make TG a nicer environment to work in.

I've been struggling with Zope for some years now. From a user standpoint I guess it's OK, from a programmer standpoint it's a nightmare, both 2.x and 3.x. The documentation and community attitude have rubbed me wrong for a long time. I attended a couple Zope sessions at the conference, but didn't hear anything to inspire me to keep up with it. I'll probably switch what little Zope things I have going to Django/TurboGears/CherryPy/whatever.

The PyParsing presentation on writing an adventure game was interesting, wish I could have attended the more in-depth one but it conflicted with a Django session. PyParsing looks to make a hard job pretty easy, and I'd love to play with it somewhere.

The party at NerdBooks had some decent food, they had a pretty deep selection of books, and the prices on some of the things I looked up were much better than Amazon. Will definitely look there next time I need something.

Lastly, I hope Django or someone who was at the sprint uses the codename "Vacuum Assassin" somewhere. That would just be too cool.