Posts by madssj.

Figuring out largest/smallest/median filesizes

I had to get some statistics about file sizes today, but couldn’t really find a tool for the job, so naturally, I wrote one.

import os, sys, re
from os.path import join, getsize, exists
 
def median(numbers):
    s = sorted(numbers)
    l = len(numbers)
    if l % 2 == 0:
        a, b = s[l / 2 - 1 : l / 2 + 1]
        if a != b:
            return a + b / 2.0
        else:
            return a
    else:
        return s[l / 2]
 
sizes = []
req_re = None
target = '.'
 
if len(sys.argv) > 1:
    target = sys.argv[1]
 
if len(sys.argv) == 3:
    req_re = re.compile(sys.argv[2])
 
for root, dirs, files in os.walk(target):
    for name in files:
        absp = join(root, name)
        if exists(absp):
            if not req_re or req_re.search(absp):
                sizes.append(getsize(absp))
 
num = len(sizes)
total = sum(sizes)
 
print "Num files: %d" % num
print "Average  : %0.2f KB" % ((total / num) / 1024.0)
print "Median   : %0.2f KB" % (median(sizes) / 1024.0)
print "Min      : %0.2f KB" % (min(sizes) / 1024.0)
print "Max      : %0.2f KB" % (max(sizes) / 1024.0)

Usage should be self-explanatory.

AppleScript application to slow network

I keep doing the same ipfw commands over and over. Enough of that, here is my first applescript application every. Probably filled with bugs and other scary things, and I’m probably not the first one to do this, but I think I’m the first to stick the source out there.

property FLUSH_TEXT : "Quit and flush"
property SET_TEXT : "Set speed"
 
-- be damn carefull what you input here, it will run as root
on ipfwLimit(bandwidth)
	my ipfwFlush()
 
	do shell script "ipfw pipe 1 config bw " & bandwidth & "KB" with administrator privileges
	do shell script "ipfw add 10 pipe 1 tcp from any 80 to me" with administrator privileges
	do shell script "ipfw add 11 pipe 1 tcp from me to any 80" with administrator privileges
end ipfwLimit
 
-- flush any ipfw rules
on ipfwFlush()
	do shell script "ipfw -f flush" with administrator privileges
end ipfwFlush
 
on main()
	set question to display dialog "Control your http traffic speed" buttons {FLUSH_TEXT, SET_TEXT} default button 2
	set answer to button returned of question
 
	if answer is equal to FLUSH_TEXT then
		my ipfwFlush()
	end if
 
	if answer is equal to SET_TEXT then
		set bandwidth_question to display dialog "Enter bandwidth in KB/s (don't do something stupid like entering \"; rm -rf /)" default answer "56"
		set bandwidth to text returned of bandwidth_question
 
		my ipfwLimit(bandwidth)
 
		my main()
	end if
end main
 
my main()

Building postgresql8x and psycopg2 for x86_64 and i386 on Snow Leopard (OS X 10.6)

I’ve recently installed Apple’s new 64 bit OS Snow Leopard, on my work computer. I use postgresql extensivly together with python, and usually use apple’s bundled python2.5 for working with django.

As the daredevil I am, I wanted to recompile all my macports to use the new 64 bit system, and therefore deleted them all, and made a fresh install of macports. After building the postgresql81 port, I was about to build the psycopg2 python postgresql driver for python 2.5, when it gave me a warning about not being able to find some symbols in the postgresql library it had linked to. I quickly realized that this might be an architecture problem, and sure enough, it turns out that python 2.5 is a i386/ppc and python 2.6 is x86_64/i386/ppc binary, as can be seen here:

$ file `which python`
/usr/bin/python: Mach-O universal binary with 3 architectures
/usr/bin/python (for architecture x86_64):	Mach-O 64-bit executable x86_64
/usr/bin/python (for architecture i386):	Mach-O executable i386
/usr/bin/python (for architecture ppc7400):	Mach-O executable ppc
$ file `which python2.5`
/usr/bin/python2.5: Mach-O universal binary with 2 architectures
/usr/bin/python2.5 (for architecture i386):	Mach-O executable i386
/usr/bin/python2.5 (for architecture ppc7400):	Mach-O executable ppc

The solution seemed so simple. Recompile postgresql81 for both architectures, and let the linker figure out the rest.

Building the postgresql81 port as the +universal variant, does not work. It has something to do with the fact, that the linker (ld) does not know how to produce a binary for multiple architectures. After a good nights sleep, the solution was only a trac ticket away.

So, to build a i386 and x86_64 version of postgresql8x via macports, you have to patch the Portfile, which is located in /opt/local/var/macports/sources/rsync.macports.org/release/ports/databases/postgresql81.

That can be done like this – notice that the patch seem to place the files wrong, so we’re moving them as well:

$ cd /opt/local/var/macports/sources/rsync.macports.org/release/ports/databases/postgresql81
$ curl -s http://trac.macports.org/raw-attachment/ticket/14619/combined_updated_universal.patch | sudo patch
$ sudo mkdir files/
$ sudo mv ld.sh files/
$ sudo mv patch_pg_config_h files/

Now you can go ahead and build the postgresql81 port with both architectures, like so:

$ sudo port install postgresql81 +universal

And then, finally, we can build the psycopg2 extension for python:

$ wget http://initd.org/pub/software/psycopg/psycopg2-2.0.12.tar.gz
$ tar zxf psycopg2-2.0.12.tar.gz
$ cd psycopg2-2.0.12
$ sudo python2.5 setup.py install
$ sudo python2.5 setup.py clean
$ sudo python2.6 setup.py install

And you’re off.

Managing multiple AWS identities

I’m running multiple different project on AWS which was so much of a pain to use, as I often find myself having to use the identity of project-a together with the official amazon ec2 tools.

To help myself manage the multiple identities, I wote a set of bash functions, called:

  • aws_load <config-name> – loads configuration from config-name
  • ec2ssh <instance-number-in-ec2din-list> – ssh’s into a given instance, with the root key
  • ec2scp – a shorthand for scp -i <keyfile>

I keep the configuration files in the directory ~/amazon/conf/name.sh and keypairs in ~/amazon/keypairs/ but that should be obvious to change.

To change or load an identity, one simply calls the function from a shell prompt like so:

mads@workmads ~ % aws_load some-identity
loaded certificate ...
loaded /Users/mads/amazon/conf/some-identity.sh (...)

I hope someone finds this as useful as I do.

Functions (could be placed in .bashrc or .zshrc).

function aws_load {
    if [ -n "$1" ]; then
		ec2_configurations="$HOME/amazon/conf"
		ec2_keys="$HOME/amazon/keypairs"
		conf="$ec2_configurations/$1.sh"
		if [ -x "$conf" ]; then
			unset AMAZON_ID AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_CERT EC2_PRIVATE_KEY EC2_CERT AWS_KEYPAIR_NAME
 
			source $conf
 
			if [ -n "$AWS_KEYPAIR_NAME" ]; then
				export AWS_SSH_KEY="$ec2_keys/id_rsa_${AWS_KEYPAIR_NAME}-keypair"
			fi
 
			if [ -n "$AWS_CERT" ]; then
				export EC2_PRIVATE_KEY=~/.ec2/pk-$AWS_CERT.pem
				export EC2_CERT=~/.ec2/cert-$AWS_CERT.pem
 
				echo "loaded certificate $AWS_CERT"
			fi
 
			echo "loaded $conf ($AMAZON_ID)"
		else
			echo "configuration $conf not found (or not executable)"
		fi
    else
        echo "usage: aws_load <configuration name>"
    fi
}
 
function ec2ssh {
    if [ -n "$1" ]; then
        HOST="`ec2din | awk '/i-/ {print $4}' | tail +$1 | head -n 1`"
        ssh -i $AWS_SSH_KEY -l root ${HOST}
    else
        echo "Please write a number"
    fi
}
 
function ec2scp {
	scp -i $AWS_SSH_KEY $@
}

Configuration “file” template to be placed in ~/amazon/conf/<config-name>.sh:

#!/bin/sh
 
export AMAZON_ID=""
export AWS_ACCESS_KEY_ID=""
export AWS_SECRET_ACCESS_KEY=""
export AWS_CERT=""
export AWS_KEYPAIR_NAME=""

Happy identity switching.

Detaching a running process on *nix (or how to make a process continue to run after logging out)

Today, I had to copy 70 GiB of data from a ext3 filesystem to a XFS filesystem. This involved a lot of small files. After a couple of hours of waiting, I thought it’d be best to just leave it running, and resume my activities the day after. But oh nooo, I forgot to run it in a screen. More… »

Poormans cloudfront with EC2 and varnish

Recently (10-20 minutes ago), amazon couldfront (a cdn) stopped sending dns replies in europe:

% dig -t ns cloudfront.net

; <<>> DiG 9.4.3-P1 <<>> -t ns cloudfront.net
;; global options:  printcmd
;; connection timed out; no servers could be reached

I was going to do a guide to set up a varnish to replace cloudfront temporarily (and did actually set up the instance, and software – I might do the guide and ami anyway) when I realized, that I (as well as most other people) can just change the relevant url to point to the S3 bucket. Problem solved. That will, however, not be as fast as either cloudfront itself, or a varnish cached backend.

Should anyone be interested in how varnish is setup to handle failures from cloudfront, I’ll happily do an ami.

Django – sharing a memcached instance

Until recently I’ve been using the file:// django cache, but that has a “problem” when multiple users needs to manipulate the cache (think uid 80 writes a key, that uid 1000 wants to delete).

My problem with the memcached:// django cache provider has been, that it cannot handle being used on a shared memcached instance, because of the danger of key collissions.

More… »

Keeping ssh connections alive

I’ve got nothing more to say than:

mads@workmads ~ % cat .ssh/config 
ServerAliveInterval 60

Happy ssh’ing.

Python http_head method

Seeing as there is no really easy way to do a HTTP HEAD request from python, I wrote up the following small method:

In advance I’d like to apologize for the method that assemblies the request path.

Update: Added handling of redirects.

def http_head(url):
    import httplib
    import urlparse
 
    redirects = 0
 
    while redirects < 10:
        scheme, netloc, path, query, fragment = urlparse.urlsplit(url)
 
        if scheme == 'https':
            conn = httplib.HTTPSConnection(netloc)
        else:
            conn = httplib.HTTPConnection(netloc)
 
        conn.request("HEAD", "%s%s%s%s%s" % (path, query and "?" or "", query, 
                                             fragment and "#" or "", fragment))
 
        res = conn.getresponse()
 
        if res.status in (301, 302) and res.getheader('location'):
            url = res.getheader('location')
            redirects += 1
        else:
            break
 
    return res.status, res.reason

Formatting xml with xmllint

I keep forgetting how to format and indent xml from the command line. The tool xmllint does a fine job of doing just that, which has saved me numerous times whilst working with sports results. So. Much. Data.

Running

xmllint --format <file>

will re-format and re-indent the xml in the input file, and check it for various errors while doing it.