"man" annoyance

I work on several different Unix-type machines, some FreeBSD, some Linux, and there are always small differences between them that can be very annoying.

One that's driven me crazy many times is how on some machines, when using man to read a manpage, after getting to the part I'm interested in, and hitting 'q' to quit - the contents of the manpage completely disappear and the screen is restored to what was shown before I ran man. Other machines I work on don't do that - the contents of the manpage remain on the screen so you can see them as you type the next command.

After finally getting fed up, I looked into this and it turns out the machines that were acting the way I like have the PAGER environment variable set to more, so man uses that instead of less which has the screen-restore-on-quit behavior. Adding:

PAGER=more; export PAGER

to .bashrc seems to have done the trick on my Ubuntu box. Seems the FreeBSDs already had this by default.

It's not a big deal, but I'm just jotting this down in case anyone else has the same annoyance. Good luck finding this in Google though, if you're just using words like 'more', 'less', and 'man' ;)

More DHCP Failover

Earlier I wrote about DHCP failover, but there's another thing I thought I might mention that could be useful to others....

I had a problem in that one of my servers' CMOS clocks tends to be a bit off, maybe 90 seconds. When dhcpd starts up, it is unable to enter a normal failover state because of the time difference between it and the other dhcpd server.

I have

ntpdate_enable="YES"
ntpdate_flags="-b x.x.x.x"

in my /etc/rc.conf, along with running openntpd, but for some reason ntpdate wasn't setting the clock at boot time, and by the time openntpd got the clock tuned up, dhcpd had given up on trying to re-establish failover. Restarting dhcpd by hand later on always worked OK.

I think what was happening was that the network jack this server was plugged into wasn't coming alive quick enough to be up and running when ntpdate tried to do its thing. Something to do with the Cisco switch not having portfast enabled.

I don't have access to do anything about the switches, so I came up with the workaround of adding a simple script /usr/local/etc/rc.d/000.afterboot.sh to schedule a job to run a few minutes after the machine boots - to adjust the clock and restart dhcpd. It looks something like:

#!/bin/sh
at now + 5 minutes <<EOF
/etc/rc.d/ntpdate restart
/usr/local/etc/rc.d/isc-dhcpd restart
EOF

It's a bit of a kludge, but seems to do the trick.

DHCP Failover

I've been setting up DHCP servers at work to use the failover feature available in ISC-DHCP (the net/isc-dhcp3-server port in FreeBSD). That allows for two servers to work together, sharing a pool of addresses and keeping track of leases handed out by both servers. The dhcpd.conf(5) manpage discusses this feature somewhat. I'll jot down some notes here that are a bit more specific about what all had to be done.

Let's assume the two DHCP servers will have the IP addresses 10.0.0.10 and 10.0.0.20 - with .10 being the 'primary' and .20 being the 'secondary' server. It doesn't really matter which is which - although the logs on the 'primary' server seem a bit more complete. We'll also assume these servers are giving out addresses from a pool of numbers 10.0.0.100-10.0.0.200, and are running DNS caches - so that the DHCP clients should be told to use them for DNS servers.

We'll also use TCP port 520 for communications between the DHCP servers, so be sure to allow for that through any firewalls.

Configuration

On the 10.0.0.10 'primary' machine, the /usr/local/etc/dhcpd.conf file might look like:

failover peer "foo" 
    {
    primary;
    mclt 1800;  # only specified in the primary
    split 128;  # only specified in the primary

    address 10.0.0.10;
    port 520;

    peer address 10.0.0.20;
    peer port 520;

    max-response-delay 30;
    max-unacked-updates 10;
    load balance max seconds 3;                
    }

option domain-name-servers 10.0.0.10, 10.0.0.20;

include "/usr/local/etc/dhcp/master.conf"; 

and the same file on the 'secondary' 10.0.0.20 machine is very similar:

failover peer "foo" 
    {
    secondary;

    address 10.0.0.20;
    port 520;

    peer address 10.0.0.10;
    peer port 520;

    max-response-delay 30;
    max-unacked-updates 10;
    load balance max seconds 3;                
    }

option domain-name-servers 10.0.0.20, 10.0.0.10;

include "/usr/local/etc/dhcp/master.conf"; 

The failover peer name, "foo" in this example, will also appear in the DHCP pool configuration, and will be used in a script change the failover state later on.

I created a directory /usr/local/etc/dhcp/ to hold the DHCP config files that will be common to both DHCP servers. That way, it's just a matter of copying the entire directory between servers when a change is made. The /usr/local/etc/dhcp/master.conf file I included from the main server config might look something like:

omapi-port 7911;

default-lease-time 16200;  # 4.5 hours
max-lease-time 16200;

subnet 10.0.0.0 netmask 255.255.255.0
        {
        option routers 10.0.0.1;

        pool
                {
                failover peer "foo";
                deny dynamic bootp clients;

                range 10.0.0.100  10.0.0.200;
                }
        }

The deny dynamic bootp clients;directive is required for any failover pool. The omapi-port 7911; directive will be useful later on for when a server needs to be put into the 'partner-down' state because the other server will be off for a while.

To sync and restart the two servers whenever there's a change to the DHCP configuration, I setup the 10.0.0.20 server to allow root logins through SSH from the root account of 10.0.0.10 using public/private keys, and then put a script named restart_dhcp on the 10.0.0.10 server that looks like:

#!/bin/sh
/usr/local/etc/rc.d/isc-dhcpd restart
scp -pr /usr/local/etc/dhcp 10.0.0.20:/usr/local/etc
ssh 10.0.0.20 /usr/local/etc/rc.d/isc-dhcpd restart

That copies the entire /usr/local/etc/dhcp directory, so if you need to break up your config into more files that get included, they'll all be copied over when you do a restart.

Failover

When one server stops unexpectedly, the remaining server will go into a communications-interrupted state, and continue offering up addresses from its half of the DHCP pools, and will renew leases it knows were given out by the other server.

If the downed server will be out for longer than the mclt value from the server config (1800 seconds (30 minutes) in the examples above). You may want to let the surviving server know that it's on its own so that it can use the entire pool of available addresses. This is done by putting the surviving server into partner-down state.

This has to be done after the other server is really down. Doing it before shutting down the other server doesn't work, because the two servers will get themselves back into a normal state very quickly, probably before you get a chance to shut the 2nd server down.

The omshell program can be used to communicate with a running DHCP server daemon to control it in various ways, including changing the failover state. I put this partner-down script on both the primary and secondary servers:

#!/bin/sh
omshell << EOF
connect
new failover-state
set name = "foo"
open
set local-state = 1
update
EOF

so when one server is going to be down for a while, I can connect to the other server and just run that script.

When the downed server comes back up, the two servers automatically start communicating and eventually get themselves back into a normal state. But only after the recovering server has spent mclt time in recover-wait state, where it renews existing leases but won't offer up new ones. So you probably wouldn't want to go into a partner-down state if the other server will be down for less than that amount of time.

Running the partner-down script when both servers are really up and running doesn't seem to do any harm, as mentioned above the two servers will quickly move back into a normal state. This can be seen by watching the DHCP logs.

Clean Failover

It's possible using OMAPI to shut down a server and have the remaining server automatically switch to "partner-down" mode in a clean way, so that when the downed server comes back up both servers quickly move to "normal" mode, without spending the mclt time in recover-wait state. This script does the trick:

#!/bin/sh
omshell << EOF
connect
new control
open
set state = 2
update
EOF

When run, it causes the dhcpd daemon on the current server to shutdown, and the dhcpd daemon on the other server takes over completely the DHCP pools.


Update: I wrote a bit more about DHCP failover, talking about how to deal with a clock sync problem when the machine boots by scheduling a dhcpd restart a few minutes after boot time.

Viewing a man file

This is one of those little things that I just want to jot down for myself so I have it written down until I learn it for good. To view a man file, that's not installed in the regular man file locations, just run

nroff -man filename | more

Stupidly simple, but unfortunately not mentioned in the manpage for man.

Restoring Boot Sectors in FreeBSD

At work the other day, we had a long power outage, and afterwards one of our FreeBSD 5.2.1 boxes refused to come back up. It'd power up, go through the BIOS stuff, show the FreeBSD boot manager that lets you select which slice to boot, but when you hit F1, the screen would go black and the machine would reset.

Booted off the 5.2.1 install CD, and after entering fixit mode, was able to mount the disk and see that the files seemed to be intact. Couldn't run fsck though, the 5.2.1 CD seemed to be missing fsck_4.2bsd.

FreeSBIE 1.1 on the other hand, was able to fsck the disk, but that didn't solve the problem. Next guess was that something in the /boot directory was hosed. I'd setup the machine to do weekly dumps of the root partition to another machine, and was able to extract /boot from a few days before and pull it back onto this machine over the network using FreeSBIE, but it still wouldn't boot.

Next theory was that something in the boot sectors was bad. First tried restoring the MBR (Master Boot Record) from copy that's kept in /boot - even though it was working well enough to show the F1 prompt to select the slice. Wanted to keep what 5.2.1 had been using, so mounted the non-booting disk readonly and made sure to have boot0cfg use the copy there instead of anything that might have been on the FreeSBIE disc.

mkdir /foo
mount -r /dev/twed0s1a /foo
boot0cfg -B -b /foo/boot/boot0 /dev/twed0
reboot

Unfortunately, that didn't help. Each slice (partition in non-BSD terminology) also has boot sectors, and to restore them, turns out you use the bsdlabel (a.k.a. disklabel) utility. Again from FreeSBIE:

mkdir /foo
mount -r /dev/twed0s1a /foo
bsdlabel -B -b /foo/boot/boot /dev/twed0s1
reboot

That did it. Apparently something in the slice's boot sectors was messed up.

Getting rid of ugly fonts in Firefox

Lately I've been using Firefox on DragonFlyBSD with xorg installed from pkgsrc, and one thing that bugged me was that when reading Advogato, the fonts on that page looked like crap. The CSS stylesheet shows "lucida" as the preferred font, and my machine evidently was using a bitmap font for that.

At first I thought, just get rid of the bitmapped fonts from the FontPaths listed in /etc/X11/xorg.conf, but surprisingly that didn't seem to have any effect, at least on Firefox.

Secondly, I tried just removing those bitmap font directories completely, such as /usr/pkg/xorg/lib/X11/fonts/75dpi/ and that did work, but seemed a little clumsy in that an update to xorg would probably replace them.

Finally, stumbled across Fontconfig's files, and saw that there is a whole separate configuration of font paths and such, starting in /usr/pkg/etc/fontconfig/, which explains why changing the xorg.conf FontPath didn't work. Turns out there are even some optional configs in /usr/pkg/etc/fontconfig/conf.d/ including a no-bitmaps.conf which will cause fontconfig to "blacklist" the bitmap fonts.

The Fontconfig user manual mentions that things in conf.d/ are processed if they begin with decimal digits. So to enable that no-bitmaps.conf, I just made a symlink.

cd /usr/pkg/etc/fontconfig/conf.d
ln -s no-bitmaps.conf 10barryp-no-bitmaps.conf

Then, just had to stop/restart Firefox to see the results.

It would be nice to be a bit more selective about what gets blacklisted, so that non-Roman characters not supported in the scalable fonts on my machine would have some chance of displaying. I'll have to work on that.

Automatically backup installed FreeBSD packages

A while ago I threw together this script to automatically create package files for all installed ports on a FreeBSD box. That way, if a portupgrade doesn't work out, you can delete the broken package, and pkg_add the backup.

Stick this in /usr/local/etc/periodic/daily, and the system will automatically bundle up copies of the installed software and stick them in /usr/local/packages if they don't already exist in there.

#!/bin/sh
#
# Make sure backups exist of all installed FreeBSD packages
#
# 2005-03-20 Barry Pederson <bp@barryp.org>
#

ARCHIVE="/usr/local/packages"

#
# Figure out which pkg_tools binaries to use
#
if [ -f /usr/local/sbin/pkg_info ]
then
    PKG_TOOLS="/usr/local/sbin"
else
    PKG_TOOLS="/usr/sbin"
fi

#
# Make sure backup directory exists
#
if [ ! -d $ARCHIVE ]
then
    mkdir $ARCHIVE
fi

cd $ARCHIVE

for p in `${PKG_TOOLS}/pkg_info -E "*"`
do
    if [ ! -f ${p}.tgz ]
    then
        ${PKG_TOOLS}/pkg_create -b ${p}
    fi
done

Getting PyBlosxom SCGI working under Lighttpd

Took another whack at getting PyBlosxom/SCGI working with Lighttpd, this time with better success. (I'm still getting up-to-speed with Lighttpd). This is working with the exact same SCGI setup I was working on the other day.

To elaborate a bit, the setup I'm trying to achieve is to:

  • Have the blog to be completely under "/blog/" in the URL namespace
  • Not get it confused with anything else that begins with "/blog" such as "/blog2".
  • Use "/blog/static/" URLs for serving static resources like CSS stylesheets and images off the disk (instead of running those requests through PyBlosxom's CGI code).

This is what I ended up with, seems to work fairly well, and I'm impressed with how Lighttpd makes it easy to put together a understandable configuration.

#
# External redirection to add a trailing "/" if exactly 
# "/blog" is requested
#
url.redirect = (
                "^/blog$" => "http://barryp.org:81/blog/",
               )

#
# The PyBlosxom Blog, lives under the "/blog/" url namespace
#
$HTTP["url"] =~ "^/blog/" {
    #
    # Static resources served from the disk
    #
    $HTTP["url"] =~ "^/blog/static/" {
        alias.url = ("/blog/static/" => "/data/blog/static/")
    }

    #
    # Everything non-static goes through SCGI
    #
    $HTTP["url"] !~ "^/blog/static/" {
        scgi.server = ( "/blog" => (
                                     (
                                     "host" => "127.0.0.1",
                                     "port" => 8040,
                                     "check-local" => "disable",
                                     )
                                   )
        )
    }
}

Doing things the DJB way

While doing a bit more searching for daemontools info, found the djb way website, which has some nice writeups on daemontools and djbdns (which I also use a fair amount).

mod_python segfault on FreeBSD

I've been testing mod_python 3.2.x betas as requested by the developers on their mailing list. Unfortunately there seems to be some subtle memory-related but that only occurs on FreeBSD (or at least FreeBSD the way I normally install it along with Apache and Python).

Made some mention of it here and an almost identical problem is reported for MacOSX, even down to the value 0x58 being at the top of the backtrace.

Did a lot of poking around the core with gdb and browsing of the mod_python and Apache sourcecode, but never quite saw where the problem could be. Took another approach and started stripping down the big mod_python testsuite, and found that the test that was failing ran fine by itself, but when it ran after another test for handling large file uploads - then it would crash.

So I suspect there's a problem in a whole different area of mod_python, that's screwing something up in memory that doesn't trigger a segfault til later during the connectionhandler test. My latest post to the list covers some of that.