Migrating Django databases

Monday, January 7, 2008

I started out a few Django projects using SQLite as the DB backend a while back, and decided to upgrade them to PostgreSQL. Turns out using the manage.py dumpdata and loaddata, the switch can be fairly smooth. The steps turned out to be something like:

Create empty database in PostgreSQL
Copy the existing project's settings.py to a new_settings.py
Edit new_settings.py to replace the SQLite connection info with PgSQL info.
Run ./manage.py syncdb --settings=new_settings (note that there's not a '.py' at the end of that parameter) to create the PgSQL tables (and also test the connection info). Say 'no' when asked if you want to create a user.
Run ./manage.py dbshell --settings=new_settings to get into the psql shell, to clean out a few tables populated by syncdb
- delete from auth_permission;
- delete from django_content_type;
Dump the data from the old DB backend with: ./manage.py dumpdata >olddata.json
Load the data into the new DB backend with: ./manage.py loaddata --settings=new_settings olddata.json
Swap the settings files: mv settings.py old_settings.py; mv new_settings.py settings.py
Restart the server

Barring any complications, the entire switch can take just a few minutes. Of course, for anything you really care about you'd do some more testing before going live, but that's up to you.

PostgreSQL full text search with Django

Wednesday, January 2, 2008

PostgreSQL 8.3 is coming out soon with full text search integrated into the core database system. It's pretty well documented in chapter 12 of the PostgreSQL docs. The docs are a bit intimidating, but it turns out to be pretty easy to use with Django.

Let's say you're doing a stereotypical blog application, named 'blog', and have a model for entries such as:

from django.db import models

class Entry(models.Model):
    title = models.CharField(max_length=128)
    body = models.TextField()

after adding your app, and doing a syncdb, you should have a PostgreSQL table named blog_entry. You'll need to connect to the database with psql, or probably more conveniently with manage.py dbshell to execute a few SQL commands to setup full text searching.

PostgreSQL Setup

First, you'll want to add a column to hold a tsvector for the blog entry, which is a preprocessed version of text optimized for searching.

ALTER TABLE blog_entry ADD COLUMN body_tsv tsvector;

Next, you'll want to add a trigger to update the body_tsv column whenever a record is inserted or updated:

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON blog_entry 
    FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(body_tsv, 'pg_catalog.english', body);

You'll probably also want an index on that field, to make searches more efficient:

CREATE INDEX blog_entry_tsv ON blog_entry USING gin(body_tsv);

Lastly, if you already have records in the blog_entry table, you'll want to update the body_tsv column for those records (any new records will be automatically taken care of by the trigger).

UPDATE blog_entry SET body_tsv=to_tsvector(body);

As a test, to search for the words 'hello world' for example, you could try:

SELECT title FROM blog_entry WHERE body_tsv @@ plainto_tsquery('hello world');

and get back a list of entries with those words in the body.

Full text searching in Django

Full text searching through the Django ORM is possible using the .extra() queryset modifier. To fetch the same entries with 'hello world' we searched for in raw SQL, you'd do something like:

q = 'hello world'
queryset = Entry.objects.extra(
    where=['body_tsv @@ plainto_tsquery(%s)'], 
    params=[q])
for entry in queryset:
    print entry.title

The full text features in PostgreSQL also allow for ranking the search results using the PostgreSQL ts_rank_cd() function, and generating fragments of documents with search terms highlighted using the ts_headline() function. An example of raw SQL for doing this might be:

SELECT title, ts_headline(body, query) AS snippet, ts_rank_cd(body_tsv, query, 32) AS rank
  FROM blog_entry, plainto_tsquery('hello world') AS query
  WHERE body_tsv @@ query
  ORDER BY rank DESC;

The Django ORM can generate an equivalent query with

q = 'hello world'
queryset = Entry.objects.extra(
    select={
        'snippet': "ts_headline(body, query)",
        'rank': "ts_rank_cd(body_tsv, query, 32)",
        },
    tables=["plainto_tsquery(%s) as query"],
    where=["body_tsv @@ query"],
    params=[q]
    ).order_by('-rank')

for entry in queryset:
    print entry.title, entry.snippet, entry.rank

Addendum, 2008-04-30

With the new Queryset Refactor (QS-RF) branch of Django, one of the few backwards-incompatible changes is that you can't use an order_by method with a field generated by an extra method. Instead, ordering by that field should be specified using an order_by parameter in the extra method, as in (the change is only in the last two lines)

Also, having a function call in the 'tables' parameter no longer seems to work, so as a workaround for now I've been repeating plainto_tsquery() 3 times:

q = 'hello world'
queryset = Entry.objects.extra(
    select={
        'snippet': "ts_headline(body, plainto_tsquery(%s))",
        'rank': "ts_rank_cd(body_tsv, plainto_tsquery(%s), 32)",
        },
    where=["body_tsv @@ plainto_tsquery(%s)"],
    params=[q],
    select_params=[q, q],
    order_by=('-rank',)
    )

There is a lot more that can be done with the PostgreSQL full text search feature, check the docs for more info. But hopefully this is enough to get started with.

Addendum, 2009-01-28

Ross Poulton has a writup on Full-text searching in Django with PostgreSQL and tsearch2 - using PostgreSQL 8.1 (on a Debian system) where tsearch2 was a contibuted addon module.

DHCP Failover

Tuesday, January 1, 2008

I've been setting up DHCP servers at work to use the failover feature available in ISC-DHCP (the net/isc-dhcp3-server port in FreeBSD). That allows for two servers to work together, sharing a pool of addresses and keeping track of leases handed out by both servers. The dhcpd.conf(5) manpage discusses this feature somewhat. I'll jot down some notes here that are a bit more specific about what all had to be done.

Let's assume the two DHCP servers will have the IP addresses 10.0.0.10 and 10.0.0.20 - with .10 being the 'primary' and .20 being the 'secondary' server. It doesn't really matter which is which - although the logs on the 'primary' server seem a bit more complete. We'll also assume these servers are giving out addresses from a pool of numbers 10.0.0.100-10.0.0.200, and are running DNS caches - so that the DHCP clients should be told to use them for DNS servers.

We'll also use TCP port 520 for communications between the DHCP servers, so be sure to allow for that through any firewalls.

Configuration

On the 10.0.0.10 'primary' machine, the /usr/local/etc/dhcpd.conf file might look like:

failover peer "foo" 
    {
    primary;
    mclt 1800;  # only specified in the primary
    split 128;  # only specified in the primary

    address 10.0.0.10;
    port 520;

    peer address 10.0.0.20;
    peer port 520;

    max-response-delay 30;
    max-unacked-updates 10;
    load balance max seconds 3;                
    }

option domain-name-servers 10.0.0.10, 10.0.0.20;

include "/usr/local/etc/dhcp/master.conf";

and the same file on the 'secondary' 10.0.0.20 machine is very similar:

failover peer "foo" 
    {
    secondary;

    address 10.0.0.20;
    port 520;

    peer address 10.0.0.10;
    peer port 520;

    max-response-delay 30;
    max-unacked-updates 10;
    load balance max seconds 3;                
    }

option domain-name-servers 10.0.0.20, 10.0.0.10;

include "/usr/local/etc/dhcp/master.conf";

The failover peer name, "foo" in this example, will also appear in the DHCP pool configuration, and will be used in a script change the failover state later on.

I created a directory /usr/local/etc/dhcp/ to hold the DHCP config files that will be common to both DHCP servers. That way, it's just a matter of copying the entire directory between servers when a change is made. The /usr/local/etc/dhcp/master.conf file I included from the main server config might look something like:

omapi-port 7911;

default-lease-time 16200;  # 4.5 hours
max-lease-time 16200;

subnet 10.0.0.0 netmask 255.255.255.0
        {
        option routers 10.0.0.1;

        pool
                {
                failover peer "foo";
                deny dynamic bootp clients;

                range 10.0.0.100  10.0.0.200;
                }
        }

The deny dynamic bootp clients;directive is required for any failover pool. The omapi-port 7911; directive will be useful later on for when a server needs to be put into the 'partner-down' state because the other server will be off for a while.

To sync and restart the two servers whenever there's a change to the DHCP configuration, I setup the 10.0.0.20 server to allow root logins through SSH from the root account of 10.0.0.10 using public/private keys, and then put a script named restart_dhcp on the 10.0.0.10 server that looks like:

#!/bin/sh
/usr/local/etc/rc.d/isc-dhcpd restart
scp -pr /usr/local/etc/dhcp 10.0.0.20:/usr/local/etc
ssh 10.0.0.20 /usr/local/etc/rc.d/isc-dhcpd restart

That copies the entire /usr/local/etc/dhcp directory, so if you need to break up your config into more files that get included, they'll all be copied over when you do a restart.

Failover

When one server stops unexpectedly, the remaining server will go into a communications-interrupted state, and continue offering up addresses from its half of the DHCP pools, and will renew leases it knows were given out by the other server.

If the downed server will be out for longer than the mclt value from the server config (1800 seconds (30 minutes) in the examples above). You may want to let the surviving server know that it's on its own so that it can use the entire pool of available addresses. This is done by putting the surviving server into partner-down state.

This has to be done after the other server is really down. Doing it before shutting down the other server doesn't work, because the two servers will get themselves back into a normal state very quickly, probably before you get a chance to shut the 2nd server down.

The omshell program can be used to communicate with a running DHCP server daemon to control it in various ways, including changing the failover state. I put this partner-down script on both the primary and secondary servers:

#!/bin/sh
omshell << EOF
connect
new failover-state
set name = "foo"
open
set local-state = 1
update
EOF

so when one server is going to be down for a while, I can connect to the other server and just run that script.

When the downed server comes back up, the two servers automatically start communicating and eventually get themselves back into a normal state. But only after the recovering server has spent mclt time in recover-wait state, where it renews existing leases but won't offer up new ones. So you probably wouldn't want to go into a partner-down state if the other server will be down for less than that amount of time.

Running the partner-down script when both servers are really up and running doesn't seem to do any harm, as mentioned above the two servers will quickly move back into a normal state. This can be seen by watching the DHCP logs.

Clean Failover

It's possible using OMAPI to shut down a server and have the remaining server automatically switch to "partner-down" mode in a clean way, so that when the downed server comes back up both servers quickly move to "normal" mode, without spending the mclt time in recover-wait state. This script does the trick:

#!/bin/sh
omshell << EOF
connect
new control
open
set state = 2
update
EOF

When run, it causes the dhcpd daemon on the current server to shutdown, and the dhcpd daemon on the other server takes over completely the DHCP pools.

Update: I wrote a bit more about DHCP failover, talking about how to deal with a clock sync problem when the machine boots by scheduling a dhcpd restart a few minutes after boot time.

amqplib 0.2

Saturday, December 8, 2007

I noticed the other day that my two RabbitMQ servers were consuming more and more memory - one had gone from an initial 22mb size to over 600mb. As I sat and watched it would grow by 4k or so at regular intervals.

I think what had happened is that I had created an exchange which received lots of messages, and then ran scripts that created automatically-named queues bound to that exchange, but defaulted to not auto-deleting them. I ran these scripts many many times, which left many many queues on the server, all swelling up with lots of messages that would never be consumed. Good thing I caught it, it might have eventually killed my server.

This message in the rabbitmq-discuss list gives useful info on how to get in and see what queues exist on a RabbitMQ server, and how big they are.

It seems to me that having the auto_delete parameter of Channel.queue_declare() default to False is a really bad idea. If you want to keep a queue around after your program exits, I think you should explicitly say so, so I changed the default to True. The Channel.exchange_declare() also has a auto_delete parameter, which I also change the default to True for consistency.

I also did some work on supporting the redirect feature of AMQP, where a server you connect to can tell you to go somewhere else, useful for balancing a cluster. I don't actually have a RabbitMQ cluster, so I put together a utility to fake an AMQP server that tells you to redirect. It works well enough to run the uniitests unchanged against it, each test case being redirected from the fake server to the real server.

With those two changes, I put out a 0.2 release, on my software page and on the Cheeseshop.

amqplib 0.1

Monday, December 3, 2007

I broke down and put together a tarball of my Python AMQP library, and stuck it up as a release 0.1 on the software section of this website, under the section py-amqplib.

Interestingly, someone hit the page and downloaded the tarball less than 3 minutes after I dropped a note about it to the RabbitMQ discussion list - so I guess there's at least some interest out there in this sort of thing :)

AMQP

Wednesday, November 14, 2007

For some time I've been using Spread as a messaging bus between processes and machines using Python bindings, but there are a few things that make it not quite ideal for what I've been trying to do.

There's no access control
Messages are non-persistent - so if a receiver daemon is down and some important message comes through, it's SOL
The wire protocol is not documented, the docs basically just say use the C client library.
The Python bindings to the C library have a glitch of some sort when used in py-exim-localscan, I had to resort to a small ctypes wrapper to get around this.

I ran across the Advanced Message Queuing Protocol(AMQP), with RabbitMQ as one implementation of the protocol, that looks like a better fit for my needs.

There's a Python client library available named QPID, but there are a few issues with that:

Relies on threading, which is trouble when Python is embedded in something else, or if you want to try using it in Stackless Python
Lacking documentation
Has to load a big AMQP XML Spec file, which takes a few seconds.

I decided to take a whack at my own AMQP client, partially as a learning excercise to learn more about the protocol. I wrote a program to take the AMQP 0-8 spec file and statically generate the skeleton of a Python module, and then fleshed it out by hand. The generator is able to put lots of documentation from the spec file into Python docstrings, so the pydoc of this module is fairly decent. Because the module is statically generated, it should be easier to debug than QPID which generates lots of stuff on-the-fly. It's also much faster at making the first connection because it's not parsing the spec file. I also thew in SSL support in since it wasn't too difficult.

It has a ways to go, and some parts are probably naively conceived, but it does seem to work.

The first thing I've used it for is a syslog->AMQP bridge. I've setup my FreeBSD syslogd to feed all info or higher events to a Python daemon, which extracts the date, time, program name, host name, etc and reformats as an AMQP message and published to a 'syslog' topic exchange with the program name as the routing key.

My plan is then to write other daemons that subscribe to the 'sshd' topic for example, and then generate higher-level messages that say things like: 'block IP address xx.xx.xx.xx' in case of failed login attempts. Then i just need one daemon to listen for these firewall control message and make changes to the PF tables.

It's fun stuff. The only weak part is that there's no way to tell if the original syslog message was spoofed, but after that point, AMQP access controls should keep things trustworthy.

See py-amqplib for a Mercurial repository and eventual downloads.

Markdown and Pygments

Sunday, October 7, 2007

This blog is mainly being written as Markdown text stored in a database, and I thought it would be nice to add the ability to use Pygments to add syntax highlighting to various bits of code within the entries.

There are some DjangoSnippets entries on how to do this, notably #360 which first runs text through Markdown to generate HTML and then BeautifulSoup to extract parts marked up in the original pre-Markdown text as <pre class="foo">...</pre> to be run through Pygments and then re-inserted back into the overall Markdown-generated HTML.

The problem with this is that the text within <pre>...</pre> needs to valid HTML with things like: e_mail='<foo@bar.edu>' escaped as e_mail='<foo@bar.edu>', otherwise BeautifulSoup thinks in that example that you have a screwed up <foo> tag and tries to fix that up.

Making sure all the <, &, and other characters special to HTML are escaped within a large chunk of code misses out on the convenience of using Markdown. I decided to go with an arrangement in which regular Markdown code blocks are used, but if the first line begins with pygments:<lexer>, then that block is pygmentized.

So if I enter something like:

Here is some code

    pygments:python
    if a < b:
        print a

It ends up as:

Here is some code

if a < b:
    print a

What I came up with is this derivative of Snippet #360

from htmlentitydefs import name2codepoint
from HTMLParser import HTMLParser
from markdown import markdown
from BeautifulSoup import BeautifulSoup
from pygments.lexers import LEXERS, get_lexer_by_name
from pygments import highlight
from pygments.formatters import HtmlFormatter

# a tuple of known lexer names
_lexer_names = reduce(lambda a,b: a + b[2], LEXERS.itervalues(), ())

# default formatter
_formatter = HtmlFormatter(cssclass='source')    

class _MyParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.text = []
    def handle_data(self, data):
        self.text.append(data)
    def handle_entityref(self, name):
        self.text.append(unichr(name2codepoint[name]))

def _replace_html_entities(s):
    """
    Replace HTML entities in a string
    with their unicode equivalents.  For
    example, '&amp;' is replaced with just '&'

    """
    mp = _MyParser()
    mp.feed(s)
    mp.close()
    return u''.join(mp.text)  

def markdown_pygment(txt):
    """
    Convert Markdown text to Pygmentized HTML

    """
    html = markdown(txt)
    soup = BeautifulSoup(html)
    dirty = False
    for tag in soup.findAll('pre'):
        if tag.code:
            txt = tag.code.renderContents()
            if txt.startswith('pygments:'):
                lexer_name, txt = txt.split('\n', 1)
                lexer_name = lexer_name.split(':')[1]
                txt = _replace_html_entities(txt)
                if lexer_name in _lexer_names:
                    lexer = get_lexer_by_name(lexer_name, stripnl=True, encoding='UTF-8')
                    tag.replaceWith(highlight(txt, lexer, _formatter))
                    dirty = True
    if dirty:
        html = unicode(soup)

    return html

Stackless Python and Sockets

Monday, September 24, 2007

I've been intrigued by Stackless Python for a while, and finally got around to installing it one one of my machines. FreeBSD doesn't have a port available, so after creating an ezjail to isolate the installation, it was just a matter of fetching and extracting stackless-251-export.tar.bz2 and doing a standard ./configure && make && make install

The installation looks pretty much like a normal Python installation on FreeBSD, with a /usr/local/bin/python binary and libraries in /usr/local/lib/python2.5

Networking is something I especially wanted to check out with Stackless, and the examples on the Stackless website mostly make use of a stacklesssocket.py module which is a separate download. That module has unittests built in as the module's main function, but when running it on my FreeBSD 7.0-CURRENT box, it died with an exception ending in:

File "stacklesssocket.py.ok", line 286, in handle_connect
  self.connectChannel.send(None)
AttributeError: 'NoneType' object has no attribute 'send'

after doing some digging, I found that stacklesssocket.py has a dispatcher class which is a subclass of a class by the same name in Python's asyncore.py module. stacklesssocket.dispatcher.connect() calls asyncore.dispatcher.connect() which may directly call the object's handle_connect() method before returning back to stacklesssocket.dispatcher.connect(). However stacklesssocket.dispatcher.connect() doesn't setup that channel until after the call to asyncore.dispatcher.connect() returns. So when handle_connect() tries to send a message over a channel that doesn't exist yet, an exception is raised.

This trivial patch seems to fix the problem - only sending a message over the channel if it exists (which should only happen if there's another tasklet waiting on it back in a stacklesssocket.dispatcher.connect() method).

--- stacklesssocket.py.orig       2007-09-18 20:58:02.000835000 -0500
+++ stacklesssocket.py  2007-09-18 22:03:13.370709131 -0500
@@ -282,7 +282,7 @@

    # Inform the blocked connect call that the connection has been made.
    def handle_connect(self):
-        if self.socket.type != SOCK_DGRAM:
+        if (self.socket.type != SOCK_DGRAM) and self.connectChannel:
            self.connectChannel.send(None)

    # Asyncore says its done but self.readBuffer may be non-empty

With that patch, the unittests run successfully - at least on my box.

Building ports on old FreeBSDs - revised

Sunday, July 22, 2007

This is a revision of an earlier post which has instructions that no longer work.

Are you running an older version of FreeBSD, and getting errors like this when you try to build a port?

"/usr/ports/Mk/bsd.port.mk", line 2416: warning: String comparison operator should be either == or !=
"/usr/ports/Mk/bsd.port.mk", line 2416: warning: String comparison operator should be either == or !=
"/usr/ports/Mk/bsd.port.mk", line 2416: Malformed conditional (((${OSVERSION} < 504105 || (${OSVERSION} 
    >= 600000 && ${OSVERSION} < 600103) || (${OSVERSION} >= 700000 && ${OSVERSION} < 700012)) && 
    ${PKGORIGIN} != "ports-mgmt/pkg_install") || exists(${LOCALBASE}/sbin/pkg_info))

If so, it's because the ports maintainers have started using expressions in the ports Makefiles which are not understood by the versions of make that come with old FreeBSDs.

The official recommended fix would be to upgrade your FreeBSD, but if that's not practical you can at least install a newer version of make to get by for a bit longer. This can be done in just a few minutes with two main steps: temporarily bring back an older /usr/ports/Mk which is compatible with FreeBSD 4.x - and then build and install the devel/make port which used to be present in the ports tree.

The ports tree is in CVS, so it's possible to checkout older revisions of selected directories. This FreeBSD Handbook page lists the anonymous CVS repositories available. For this example I'm going to use anoncvs@anoncvs1.FreeBSD.org:/home/ncvs We only need two small directories, so it probably doesn't really matter which one you use.

It seems like the first commit to the ports infrastructure which broke compatibility happened around Feb 5th, 2007 - so let's backup the current /usr/ports/Mk and check out one from Feb 4th:

cd /usr/ports
mv Mk Mk.original
cvs -d anoncvs@anoncvs1.FreeBSD.org:/home/ncvs co -D "04 Feb 2007" -d Mk ports/Mk

You should now be able to build some ports, at least those that don't use incompatible syntax in their individual Makefiles and don't require a newer ports infrastructure. Now, if you don't already have a devel/make port, use CVS to bring that back too, then build and install it:

cd /usr/ports/devel
cvs -d anoncvs@anoncvs1.FreeBSD.org:/home/ncvs co -D "04 Feb 2007" -d make ports/devel/make
cd make
make install
make clean

Lastly, set your system to use the new ports make in place of the old system make, and do some cleanup:

cd /usr/bin
mv make make.old
ln -s /usr/local/bin/make .

cd /usr/ports
rm -rf Mk
mv Mk.original Mk
rm -rf /usr/ports/devel/make

You should now be in better shape for trying to build new ports. 4.x isn't officially supported anymore by the ports maintainers, so there may be some individual port breakage - but at least you're over the first hurdle.

Logitech QuickCam Pro 5000 on a Mac Mini

Thursday, June 28, 2007

I was recently working on finding a reasonable webcam that would work with an Intel Mac Mini, since Apple no longer sells the iSight and the prices for them on eBay are outrageous.

After reading reports that newer versions of the Logitech QuickCam Pro 5000 (labeled as "Vista Ready") worked on Macs, I picked one up and tried it out on my Intel MacBook laptop, running OSX 10.4.8 at the time I believe.

Despite what I had read on various postings, the camera didn't show up at all, although the built-in microphone worked OK, appearing as something like "Unknown USB Audio Device". The camera worked OK in Windows, so I figured the hardware was OK - and returned it to the store and started looking again.

After a while, 10.4.9 came out, and supposedly included updates that supported more webcams. I bought another QuickCam Pro 5000, and this time found that the camera and microphone worked in iChat, but the version of Skype I had at the time (2.6.0.137) only saw the microphone.

I figured this was good enough and took it out to the person with the Mini. While there, I updated OSX to 10.4.10. When I plugged in the QuickCam, I found that the camera worked, but now the microphone didn't show up at all. When plugged into my 10.4.9 laptop, the camera and mic worked fine. Apparently Apple broke something in the 10.4.10 update. (there's a discussion of it here.)

After poking around, I found the AppleUSBAudio system extension, which seemed like a likely suspect. By replacing it with the same extension from 10.4.9, I was able to get the mic working - it went something like:

sudo -s
(type in password)

cd /System/Library/Extensions
kextunload AppleUSBAudio.kext

(backup the AppleUSBAudio.kext directory somewhere else)
(copy the 10.4.9 AppleUSBAudio.kext directory to this directory)

(permissions got changed moving between machines, fix that up)
chown -R root:wheel AppleUSBAudio.kext  

kextload AppleUSBAudio.kext

Plugged in the webcam, and now both camera and mic work. Tried a newer Skype - 2.6.0.148, and that works too.

So I think we'll be able to chat between the Mac Mini and the MacBook for a while - at least until Apple updates the OS again.

In a followup entry, I replaced this camera with a Logitech QuickCam Communicate STX.