On Linux systems it’s common to use ssh-agent (manpage) to store SSH keys. When you run the agent its output can be used to configure it within your shell. The variables defined allow ssh and ssh-add to find the agent once it’s started.

$ ssh-agent
SSH_AUTH_SOCK=/tmp/ssh-RRekSLl29890/agent.29890; export SSH_AUTH_SOCK;
echo Agent pid 29891;

I use an Ubuntu VM, and regularly log in using PuTTY. It’s easy to connect to the agent on login, but I wanted a way to keep connecting back to the same one so I could add keys and leave them there. There’s loads of solutions available, but I wanted to try and write a simpler version.


  • Store the agent settings into a file under ~/.ssh
  • Load the file on login
  • Check whether the agent can be found, and if it can’t start it

This is slightly more subtle than it first appears: it’s necessary to cope with the file not existing and containing out-of-date data, and I wanted to do this as tidily as possible.

Here’s my solution:

[ -f ~/.ssh/agent_env ] && . ~/.ssh/agent_env
kill -0 $SSH_AGENT_PID 2> /dev/null || . <(ssh-agent | tee ~/.ssh/agent_env)

First we check if the file exists:

[ -f ~/.ssh/agent_env ]

If it does then we execute its contents using a bash shortcut (the ‘.‘):

. ~/.ssh/agent_env

If the file did exist then we should have a value for the SSH_AGENT_PID. We check whether it’s possible to send a signal to the process this identifies using kill -0:

kill -0 $SSH_AGENT_PID

The nice thing about this is that if the variable isn’t defined then we end up passing garbage on to kill, and it’ll return false in the same way as if the process didn’t exist. We add a redirect to /dev/null so we don’t see its helpful error message.

kill -0 $SSH_AGENT_PID 2> /dev/null

Now we just need to run the agent, store its output in the file, and use that output to configure the shell. We can use a bash trick to load the agent and configure the shell in one tidy step:

. <(ssh-agent)

The redirect here has the effect of attaching the ssh-agent output to a file descriptor, and then returning the path to that file descriptor. It’s convenient when you want to pass the output of a command to one which is expecting a filename.

Finally we insert a call to tee which writes out our file.

. <(ssh-agent | tee ~/.ssh/agent_env)
Posted in Computing | Leave a comment

Creating a VM to mimic EC2

I spend a lot of time working with Amazon’s EC2 (Elastic Compute Cloud) virtual servers. Configuring a new server takes up a surprisingly large amount of time when done manually, and so I use Puppet to automate the process. This works pretty well, but after creating lots of new configuration code I always think it’s worth testing the complete installation on a local Virtual Machine. I’ve built a VM which looks quite a lot like an empty Ubuntu 12.04 cloud image, but have always worried that the CD installation might have a few key differences. Here’s a way to work round some of these.

EC2 instances use a different set of sources for APT, so the first step is to sync these up. Wipe the current /etc/apt/sources.list, and replace with an example from the cloud image you’re trying to emulate. I used the following lines:

deb precise main
deb-src precise main
deb precise-updates main
deb-src precise-updates main

deb precise universe
deb-src precise universe
deb precise-updates universe
deb-src precise-updates universe

deb precise-security main
deb-src precise-security main
deb precise-security universe
deb-src precise-security universe

Now a quick apt-get update will get you installing the same packages as EC2. The next logical step is to reinstall all the current packages to match. I thought this would be pretty easy, and went for the following invocation:

dpkg --list |
grep ^ii |
awk '{ print($2) }' |
xargs apt-get install --reinstall

This works as follows:

  1. Get the current list of packages
  2. Return only those which are actually installed
  3. Drop everything in the output but the second column (the column containing the package name)
  4. Pipe into xargs which in turn calls APT.

This got things moving and I’d soon downloaded 397 fresh packages. Unfortunately things then stopped with the (un)helpful message xargs: apt-get: terminated by signal 11. I tried calling APT separately, and got slightly further: Segmentation fault (core dumped). Did I say this got me further?

Reinstalling packages individually seemed to work, so I developed the theory that APT chokes when fed too many packages in one go. This bug seemed to be related. I worked around it with a simple modification to the xargs command to take ten packages at a time:

dpkg --list |
grep ^ii |
awk '{ print($2) }' |
xargs -L 10 apt-get install --reinstall

I decided to take another look at the Launchpad bug report linked earlier. This states:

When trying to –reinstall multiple packages with apt-get, it crashes with a segmentation fault. The command line: apt-get –reinstall install <2-or-more-packages...>

Having used Ubuntu for free for many years I thought it might be about time I gave something back to the community, so I thought I’d take another look. I’ve posted a comment with some more information after getting rather more involved with GDB and apport then I would have liked!

I genuinely don’t know if there is any difference between the packages for EC2 and those in the standard repositories (I rather suspect there isn’t), but if there is then I definitely don’t want to find out about it when trying to fix an urgent problem.

Posted in Computing | Leave a comment

Exotic Data Structures

I came across this recent comment on Hacker News today, and thought the data structures therein worth advertising further:

Smushed list

Size O(1). The smushed list is a list of variables (of the same type), stored in a single variable of that type. To produce the smushed list, simply XOR all the elements of the list together, then store. To get a value back, simply XOR the smushed list by all the elements other than the one you want. Smushing is also embarrassingly parallel (you can smush two halves separately and then smush the results) so producing smushed lists is blazingly fast.

Unlinked list

O(n). This is slightly faster than a linked list, and acts as a “black box”. Simply allocate nodes that are not linked to each other in any way. The data normally stays out of the way of your program, but in case of a core dump you can find it again. NOTE: If your language does reference-counting this will not work. Get a real language that does what you say.


Search trees are good at bisecting data, but they are not really conducive to a random walk for inspiration. Begin by constructing a binary search tree, keeping track of all the nodes you’ve added, and simply add a third, random, pointer to each node – have it point at a random node somewhere in the tree. In the search algorithm, either follow the left, right, or random node, depending on how much meandering you are interested in doing. The journey is the destination.

I’m reasonably confident I’ve seen these before somewhere but can’t figure out where.

Posted in Computing | Leave a comment

PHP documentation and Sphinx

Documentation is important, but commonly forgotten. Writing any code requires filling your brain with information – what the language is doing – what the API you’re calling is doing – what you want the code to do – and most of this information will fall out again a few days later. A month later a bug is found, and this information needs to be acquired all over again. Luckily you thoroughly documented everything that the code was doing, allowing you to pick up where you left off and fix the bug in seconds! Hurrah! Well, maybe.

To make this process easier, a variety of systems have been developed to build documentation automatically. In general you simply include some extra comments in the sourcecode explaining what you’re up to, and then something else extracts this information and formats it elegantly for the web Javadoc is a standard tool for Java developers, and the syntax has been recycled for other C-style languages, giving us JSDoc (for JavaScript) and PHPDoc. Perl’s POD and Pyhon’s PyDoc seem to have slightly different solutions which emphasise writing a section of documentation at the start of a file rather than merging it into the source, although Python also uses docstrings to directly document parts of the code.

phpDocumentor was the first PHP documentation system I met. It worked pretty well, although the format of the documentation it output left a little to be desired. A good example is the PHP Client Library for Google’s Adwords API.

A rival soon came along in the form of DocBlox, currently used for the Zend Framework documentation. This created much more elegant output.

The two projects have now merged together under the name phpDocumentor 2, which delivers the best documentation yet (do try it).

This system is great, and I always include markup within code to allow simple documentation to be generated. Unfortunately it’s quite hard to write good documentation when it has to be wedded to the structure of a project’s sourcecode. The Zend Framework folks provide automatically-generated documentation, but also write a more thorough reference guide. I wanted a more sophisticated documentation generator.

Sphinx was originally created to create new documentation for the Python language. I’d always assumed it was only really suitable for Python projects, but soon started to see it in more places. The new phpDocumentor documentation is built with Sphinx (see the footer at the bottom), as is the current documentation for Zend Framework 2, and more PHP-centric projects seem to be making use of it.

I soon installed Sphinx and used its quickstart script to get going. Syntax highlighting of PHP source code is supported – Sphinx uses Pygments for this, but I was still lacking the ability to properly cross-reference code. Sphinx uses “domains” to allow you to document parts of the code such as classes or functions, and then allows easy cross-references between these. I found that the Python domain worked pretty well for PHP, but wondered if a better option was available.

Sphinx supports domains for Python, C/C++, JavaScript and “reStructuredText” (the Sphinx markup) out of the box, and more are available in the sphinx-contrib repository, including Ada, Erlang and Ruby. Sadly there was no PHP domain listed in the readme, but I managed to stumble on one by looking through the source (how ironic that a repository for a documentation generator is poorly documented…) The readme for the domain suggested to install it using Python’s easy_install, and I soon found the project itself in the Python Package Index.

Installing all this on Ubuntu is pretty simple:

sudo apt-get install python-sphinx python-setuptools
sudo easy_install pip
sudo pip install sphinxcontrib-phpdomain
cd /path/where/documentation/project/lives

After answering some questions about your project  you’ll find a directory structure with an automatically generated You’ll need to make a couple of changes to it as follows:

# Add any Sphinx extension module names here [...]
extensions = ['sphinxcontrib.phpdomain']

# The name of the default domain.
primary_domain = 'php'

# The default language to highlight source code in.
highlight_language = 'php'

Assuming you allowed the quickstart to create the makefile, you can now test everything works:

make html

… and point your browser at the html documentation. Take a look at the makefile itself to see what formats you can use for your documentation – there’s a lot of options. reStructuredText itself is pretty simple, and the Sphinx documentation itself is good. Documentation for the PHP domain is OK, but misses some bits out. It allows you to document the following objects:

  • ‘function’ cross-referenced with ‘func’
  • ‘global’ cross-referenced with ‘global’
  • ‘const’ cross-referenced with ‘const’
  • ‘method’ cross-referenced with ‘meth’
  • ‘class’ cross-referenced with ‘class’
  • ‘attr’ cross-referenced with ‘attr’
  • ‘exception’ cross-referenced with ‘exc’
  • ‘namespace’ cross-referenced with ‘ns’
  • ‘interface’ cross-referenced with ‘interface’

 Now all you need to do is write something…

Posted in Computing | 1 Response

Developing in a Linux Virtual Machine on Windows (3)

This post is part of a series about my development stack – I use an Ubuntu Server VM running on Windows.

  • Part 1 showed how to set up the VM
  • Part 2 demonstrated how we could share code on the host system with the guest in order to execute it in a Linux environment

There’s some sort of grim inevitably to progress with this series so far: I put together a post showing how to use VMWare Player, only for them to release a new version and render all my screenshots out-of-date. I thought it’d be worth adding some more to show some of the more obvious differences.

Taking a first look at things, the new version looks pretty familiar. It’s lost the thick blue borders and skin, and has a refreshing Aero feel.

Version 4

Version 5

 Perhaps the most interesting change is the prominent new notifications about licensing. I’m not using the software commercially so it’s not a problem for me, but I was curious to see how a licence would cost. Clicking through we reach an unhelpful page offering me “VMWare Fusion for Mac OSX”. Looks like VMWare need to figure things out…

I had no problems starting Virtual Machines created under version 4 of the player (I won’t go so far as to say this was a relief, but it’s certainly helpful!) Once running a VM, the interface looks much cleaner.

One change worth noting is that accessing the virtual machine settings now takes an extra click – it’s now found under the single “Player” menu:

While exploring around I also noticed something I’d not found before – the ability to download and install a “Virtual Appliance” directly. Clicking the menu option takes us to a marketplace for VM images.

There’s an appliance for Ubuntu Server 12.04, but looking at the comments it seems they charge for this, so I’ll stick to installing it manually.

Posted in Computing | Leave a comment

Developing in a Linux Virtual Machine on Windows (2)

This is the continuation of a previous post where we set up an Ubuntu virtual machine in VMWare Player on a Windows host. To recap, what we’ve got so far is:

  • An easily-accessible Ubuntu server.
  • Folders on the host shared with the guest. We can edit code on the host machine and then test the changes in the guest directly, without having to upload / copy / publish / sync anything.


Running a VM in this way has one profound annoyance: once you’re using the server, you can’t get out! The player captures your mouse and keyboard, and all input is directed to the player. You can get them back again by pressing Ctrl+Alt, but this process keeps getting in the way – stopping easy Alt+Tabbing and disrupting my working rhythm. I also had an additional problem: I use a customised version of the Dvorak keyboard layout, and didn’t fancy the effort required to set this up on Linux. Luckily there’s an easy solution.

The defacto-standard secure shell client on Windows is PuTTY, and this is great for connecting to remote servers. Just because the guest system is running on the same machine doesn’t mean I can’t pretend it’s miles away! Before you connect you’ll have to install an SSH daemon:

sudo apt-get install openssh-server

Now connecting is easy – just put the hostname in and click “Open”:

On some networks you may find that the hostname doesn’t resolve, or it may take a little time for your router to figure out what the name is. You can easily work around this by connecting directly to the IP address of your virtual machine – find it by entering the ifconfig command at the shell. Looking this up every time you boot up the VM rapidly becomes tedious and so you may want to assign a static IP. This is a pretty simple process from the shell, although you may have to persuade your router to play nicely with this process. Once you’ve fixed the IP you can save it in PuTTY as a session, or even add a meaningful name to your Windows hosts file to make it permanent.

Playing with the PuTTY settings is a worthwhile process. I usually set up a couple of changes in the “Default Settings” session to make my life better:

  • Window->Behaviour, set “Full screen on Alt-Enter” to on
  • Connection->Data, set “Terminal-type string” to “xterm-color” to enable the colourful shell prompt

We’ve done a lot but not made a lot of progress from the where we started. We’ve got easier access to a Linux server than ever before, but it doesn’t yet do much. Here’s what I do to get things going:

locale-gen en_GB.UTF-8
update-locale LANG=en_GB.UTF-8
apt-get install apache2 php5 mysql-server libapache2-mod-php5 php5-cli php5-curl php5-gd php5-mysql
echo "AddDefaultCharset utf-8" > /etc/apache2/conf.d/charset
mv /etc/php5/apache2/php.ini{,.old}
cp /usr/share/doc/php5-common/examples/php.ini-development /etc/php5/apache2/php.ini
a2enmod rewrite
service apache2 restart

Once that’s all finished you can test by navigating to your server in a web browser.


The final step is to make the webserver serve up the shared files from the host. You can easily dig in and edit Apache’s configuration files to accomplish this, but an alternative scheme is just to add a symlink:

sudo ln -s /mnt/hgfs/www /var/www/

(This assumes you setup the shared drive as I did in part 1, i.e. under the name “www”). Now we can access the shared files in the webserver at http://leighserver/www/, and immediately preview changes made in the host.

In future posts I’ll take a look at how we make those changes in the host, and what development tools there are to make that easier.

Posted in Computing | 1 Response

Database design 101: Is your field big enough?

When designing a database schema, it’s worth trying to choose sensible sizes for each column. It’s obviously crucial to make sure that your field gives enough space to accommodate the data you want to store in it, but it’s good to keep the sizes down as far as possible in order to improve performance. Increasing the size of each row is going to come with a cost in storage and processing time. As might be obvious from the title, I learnt a lesson about this a couple of days ago.

Some months ago I built a new API, and we were particularly keen to be able to analyse its performance and the number of queries. Given we had a database connection handy, it seemed most convenient to log requests directly into a table. With hindsight this was probably a bad idea:

  • Ultimately it’s going to be less performant than just writing to a log file
  • As the requests pile up the table gets bigger, making backups huge and soon necessitating a regular cronjob to purge out older entries

On the other hand, logging to the database does give a massive advantage in terms of the ease of querying the data. With a few SQL statements it’s trivial to analyse load per hour, see how it varies over time, and group by other logged parameters (such as the service queried). We also built a plugin for Munin to allow us to produce realtime graphs of API load. While a log file can be queried in a similar way using the linux shell, that’s a rather less trivial process.

I built up the table as follows:

CREATE TABLE `api_requests` (
    `service` VARCHAR(10) DEFAULT NULL,
    `query_string` VARCHAR(1024) NOT NULL DEFAULT '',
    `time_taken` FLOAT(9,6) NULL DEFAULT NULL,
    PRIMARY KEY (`id`)

This all seems OK so far. So which field is too small?

  • The service field looks suspicious – there’s only 10 characters allowed. This turns out not to be a problem as the only services we’ve built have short names.
  • What about the IP address? 20 characters is more than enough for IPv4, and we’re not being queried over IPv6.
  • I guess the query string could be too short…?

If it’s not yet obvious the answer was the primary key, the ID. A MySQL MEDIUMINT UNSIGNED allows values up to 16,777,215, and after a few months the id grew to this level. At this point the server refused to insert any more rows with a cryptic error message about a unique key violation (the AUTO_INCREMENT value can’t grow any higher, and it tries to insert the new row with the same ID as an old one). A few minutes later (after a lucky catch by our API validation system) and the field size was big enough to last the next 300 years.

It’s obviously impossible to predict the future, but I think I’ll be erring a little more on the side of caution when designing future schemas.

Posted in Computing | Leave a comment

Virtualisation and web hosting – update

I wrote recently about my efforts to find a cheap VPS host. Recently I’ve come across which aims to make this process easier. It looks pretty good, but doesn’t seem to help much with the really small case – there’s no AWS Micro EC2 instances, for example.

Posted in Computing | Leave a comment

Developing in a Linux Virtual Machine on Windows

Sadly for this post, but luckily for the rest of us, VMWare have now released version 5 of the Player software. Everything seems to work as before and most functionality appears in the same place, but the screenshots are now out-of-date.

I like Windows. I know my way around it, I’ve tried the alternatives and I still like it.

The most important part of writing any code is testing it, and it’s hard to test properly on a remote machine. Every change has to be uploaded (over a slow connection) before it can be run. It’s obviously a bad idea to test on a live site, so now we need two remote machines. Now the cost is mounting along with the frustration.

I tend to work in PHP, with sites hosted on Apache and using MySQL databases. As it turns out, these can all be installed on Windows system, and there’s easy packages available to do that (like WAMP or XAMPP). I’ve tried one of these (can’t remember which), and despite a clunky interface everything did seem to work. I soon ran into problems, however, when maintaining and developing PHP command-line scripts build for a UNIX machine, finding that:

  1. Everything’s in the wrong place. The current scripts expected to look in /usr/bin, but this directory didn’t exist. The closest match might be C:\Program Files\, which isn’t the same at all.
  2. Shell scripts written for BASH don’t work on cmd.exe. Nothing I could do was going to make them work.

These problems aside, testing code in one environment and then deploying to another doesn’t seem like a good idea. I needed access to a UNIX box.

The most obvious option one is to install a UNIX system locally and work on that. This sounds fantastic – you edit code and it’s already uploaded! Unfortunately for me this isn’t an option ’cause I like Windows. Luckily I’d heard of virtualisation. It didn’t take long to start up a Linux virtual machine which ran cleanly within Windows, and here’s how you can do it too.

There’s a wide variety of host software for running virtual machines within Windows, and I’d already heard of VirtualBox, VMWare Player and Virtual PC. The latter doesn’t officially support Linux guests, and a brief comparison led me to pick VMWare’s Player. Installing this was very easy, and I soon found myself ready to start creating a virtual machine.

Choosing the option to create the VM starts a wizard to guide you through the settings. To install an operating system you’d commonly need an installation CD, but the player can cope with .iso image files directly, so burning the physical CD isn’t necessary. I grabbed the latest version of Ubuntu server from their download site and was ready to go.

I’m probably over-suspicious, but I tend not to like letting applications help me out with operating system installations, and VMWare Player will jump in and ask you if you want help with the process. The best way to get it out of the way is to avoid telling it what you’re up to. I chose the option to install the operating system later. (I have had experience of things not working right after using the guided installer).

Again we keep our intentions secret, avoiding telling the player that we’re installing Linux.

At the end of the wizard, click the “Customize Hardware” button to tweak the VM’s hardware settings.

Again I’m probably being over-suspicious, but I think operating systems like an easy life and so my first step was to remove any unnecessary hardware (e.g. USB controllers / printers). Once you’ve done that it’s worth bumping up the available memory to a sensible level, and it’s finally time to tell the player what CD image we want to put in the drive (click to zoom in).

I want to set up a webserver and this will expect to be connected to a network in a normal manner. I therefore chose to bridge the host network adapter, allowing the guest operating system to connect directly on to the network.

Once you’ve done this you’re ready to boot up. The CD image will be used directly, and the virtual machine will boot into the Ubuntu setup program. This is fairly easy to work through, and there’s lots of help and support on the Ubuntu website. Towards the end of the process you’ll find an option to automatically install a load of software. My suspicious nature kicked in again and I decided not to bother, opting to install what I wanted manually later.

VMWare Player comes with a suite of software (the VMWare Tools) which we’ll need. Hiding the operating system we’re installing kept the process smooth, but prevented the tools installing automatically. This is easily rectified: first shut down the VM, then run VMWare Player again. You’ll have an option to edit the virtual machine settings.

Choose “Linux” and “Ubuntu 64-bit” (I installed the 64-bit variety of Ubuntu).

Now the player knows what operating system is running it will offer to install the tools.

Before we do this we need to get the operating system ready. Run the following as root:

apt-get update
apt-get upgrade
apt-get dist-upgrade
apt-get autoremove
apt-get install build-essential
shutdown –r now

When the machine reboots you’re ready to install the tools. You may need to change the filenames as version numbers change – use the shell’s tab autocompletion to help with this.

mount /dev/cdrom /media/cdrom
cp /media/cdrom/VMwareTools-8.8.4-743747.tar.gz .
umount /dev/cdrom
tar -xzf VMwareTools-8.8.4-743747.tar.gz

You should be able to get by just by selecting all the default options. I’ve had a few problems getting this to work recently, possibly because the latest Ubuntu comes with version 3 of the Linux kernel whereas the tools build “Using kernel 2.6 build system”. Making sure your system is totally up-to-date usually does the trick.

We’ve now got an easily-accessible shell, but nothing else. We’ve not got the file sharing which had been one of our original aims. This is easy to fix now the tools are installed. You can get to the Virtual Machine Settings dialog from the Virtual Machine menu even though the VM is still running, and set up a file share.

Create a shared folder, and remember its name. I’ve got some web code in E:\Documents\Code\www, and I shared this with the VM. Click OK until you’re back at the shell. VMWare calls this file sharing the “Host Guest File System”, and puts all shares in /mnt/hgfs. A simple ls /mnt/hgfs/www showed that everything was working properly and the host’s files were showing through.

Now we’ve got a Linux server running locally, and I’ve shared files from my host system. I can edit any code from within Windows, and the files will appear immediately on the guest, making it dead easy to debug code. There’s just a few more steps to cover to complete the installation, and we’ll take a look at these next time.

Posted in Computing | 2 Responses

Migration – Further Progress

Now my diary is settling in on its new server, I thought I’d add a couple of thoughts:

  • It now runs far faster, feeling much more snappy and responsive. No other users are competing for server resources, and the server’s a lot nearer (Ireland rather than the US).
  • NameCheap have spoilt themselves, and their server has been listed on a spam blacklist at least four times in the past month. Not being able to send email is really irritating.
  • Microsoft are so pleased I’ve signed up for Azure that they’re trying to ‘phone me “to find out about your plans with regards to Windows Azure platform and welcome you as a new user.” Although this is probably an attempt to sell me more, it feels a lot more welcoming than, well, every other online service I’ve ever used.
Posted in Computing | Leave a comment

Migration: Aches and Pains

I recently migrated my diary to a new server, and was amused by some of the problems I faced doing so. Deploying PHP applications should be a simple affair – set up the database, upload the code – but moving to a newer version of the language brought many problems. PHP now bothers to check for errors in a lot of circumstances where it previously didn’t, causing warnings and notifications to pop up all over the place (and filling my logs with nonsense). PHP has always fought hard to try and be the least secure computing platform in existence, but even I was amazed to see how many SQL injection vulnerabilities I’d managed to include in the code when I originally wrote it a few years ago. While it’s pleasing that I’ve learnt so much in a short space of time (I now know what an “SQL injection vulnerability” is, for example), I can’t help feeling a little bit frustrated about the situation. Just a couple of examples:

  • While other languages provide proper abstractions to access databases (e.g. the fantastic SQLAlchemy library for Python), PHP positively encourages you to just mysql_connect() and start firing queries into the database (at least that page now carries a warning).
  • Some languages give you a framework to access data passed in by the user. At least PHP doesn’t just spread them around anymore, but it does make them easily available – so easily that it’s natural to grab hold and start using them, even if that’s the security equivalent of leaving the front door wide open.

Apologies for what’s turned into a rather snarky rant, but I’m in good company. It turns out there’s a whole community of PHP-haters: you can find out more via,,, and PHP Turtles for examples. I’ve recently really enjoyed reading this post:

I can’t even say what’s wrong with PHP, because— okay. Imagine you have uh, a toolbox. A set of tools. Looks okay, standard stuff in there.

You pull out a screwdriver, and you see it’s one of those weird tri-headed things. Okay, well, that’s not very useful to you, but you guess it comes in handy sometimes.

You pull out the hammer, but to your dismay, it has the claw part on both sides. Still serviceable though, I mean, you can hit nails with the middle of the head holding it sideways.

You pull out the pliers, but they don’t have those serrated surfaces; it’s flat and smooth. That’s less useful, but it still turns bolts well enough, so whatever.

And on you go. Everything in the box is kind of weird and quirky, but maybe not enough to make it completely worthless. And there’s no clear problem with the set as a whole; it still has all the tools.

Now imagine you meet millions of carpenters using this toolbox who tell you “well hey what’s the problem with these tools? They’re all I’ve ever used and they work fine!” And the carpenters show you the houses they’ve built, where every room is a pentagon and the roof is upside-down. And you knock on the front door and it just collapses inwards and they all yell at you for breaking their door.

That’s what’s wrong with PHP.

Web development, especially secure web development, isn’t easy. Even helpful articles are full of errors: this page recommends SHA1 for password hashing (and assumes that the only problem with the algorithm is hash collisions), even though that technique went out with the ark. What’s irritating is that PHP makes it incredibly easy: easy enough to shoot yourself in the foot again and again.

Posted in Computing | Leave a comment


A link recently bubbled up the frontpage of Hacker News, and I found myself rather curious to find out more. The headline rather proudly proclaimed: “RJSON: compress JSON to JSON”. JSON is a useful and simple way of transmitting data between machines on the Internet, and is commonly-used by many dynamic websites today (see Wiki for more details). Surely a way of compressing these transmissions would speed up the world? Hurrah!

Let’s take a look at the page itself. Basically, it’s a library which lets you get from here:

	"id": 7,
	"tags": ["programming", "javascript"],
	"users": [
		{"first": "Homer", "last": "Simpson"},
		{"first": "Hank", "last": "Hill"},
		{"first": "Peter", "last": "Griffin"}
	"books": [
		{"title": "JavaScript", "author": "Flanagan", "year": 2006},
		{"title": "Cascading Style Sheets", "author": "Meyer", "year": 2004}

To here:

	"id": 7,
	"tags": ["programming", "javascript"],
	"users": [
		{"first": "Homer", "last": "Simpson"},
		[2, "Hank", "Hill", "Peter", "Griffin"]
	"books": [
		{"title": "JavaScript", "author": "Flanagan", "year": 2006},
		[3, "Cascading Style Sheets", "Meyer", 2004]

The page also helpfully links to alternative compression schemes called JSON DB and CJSON, which each produce slightly dXMLifferent compressed output.

Let’s consider a number of questions:

  • Does this really compress the file?
  • Is it useful?
  • Should I use it?

Does this really compress the file?

When comparing the two compressed versions I was curious to note that they use a slightly different example for compression. The RJSON site used an abridged version of the example on the JSON DB page. Let’s take a look at these examples and how they compress. For the purposes of the analysis I’ve used UNIX line-endings and not minified the examples in any way. To produce the data I used the online demos for RJSON and CJSON.

  • Example 1 (Short version from RJSON page)
    • Original: 340 bytes
    • RJSON (claimed): 279 bytes
    • RJSON (via demo): er, 549 bytes
    • CJSON (via demo): 308 bytes
  • Example 2 (Longer version from JSON DB page)
    • Original: 711 bytes
    • RJSON: 725 bytes
    • JSON DB: 548 bytes
    • CJSON: 429 bytes

Let’s be charitable and assume that RJSON really works fine, but the demo is broken. If we take the claimed compression then we save an underwhelming (340 – 279) / 340 = 17.9% of the file size. JSON DB manages 22.9%, although I suspect much of this improvement is caused by the larger example input which contains  more repeated strings. These algorithms target these repeated strings (the keys for the JSON objects) and so perform better with a larger input. The compression performance will tend to a limit as the size increases, and the back of an envelope shows that this could be quite a large limit for RJSON if we keep adding more “users” to their example (around 50%). JSON DB and CJSON work similarly, and I’m not going to discuss them further.

I can’t help feeling we’re missing something crucial here – with such large potential gains why is the filesize increasing? Clearly the examples are far too short, but they also have a crucial flaw: the whitespace. Indentation and newlines make it clear what’s going on, but don’t represent the theoretical compression we could have by removing those as well. JSMin is a tool for stripping such spaces, and there’s an online version too to make things easy.

  • Example 1
    • Original: 340 bytes
    • Original minified: 284 bytes
    • RJSON minified: 232 bytes
  • Example 2
    • Original: 711 bytes
    • Original minified: 456 bytes
    • RJSON minified: 326 bytes
  • Example 3, made from the first example by retaining only the user section and repeating it 31 times:
    • Original: 4015 bytes
    • Original minified: 3243 bytes
    • RJSON minified: 1632 bytes

Here’s the start of Example 3:

"users": [
	{"first": "Homer", "last": "Simpson"},
	{"first": "Hank", "last": "Hill"},
	{"first": "Peter", "last": "Griffin"},
	{"first": "Homer", "last": "Simpson"},
	{"first": "Hank", "last": "Hill"},

Now we’re getting somewhere, with percentage savings of 18.3%, 28.5% and 49.7%. It’s easy to show that, as we repeat the user section more times, this ratio will tend to a maximum of around 50.5%.

Finally, it’d be useful to compare to another compression algorithm to keep things in context, and gzip is a good example. Minifying and gzipping we get:

  • Example 1: 208 bytes
  • Example 2: 260 bytes
  • Example 3: 127 bytes

RJSON is beaten marginally in the first case, convincingly in the second case, but resoundingly in the third. The astonishing compression for the third example is another measurement error: gzip is really good at compressing repeated strings, and 32 copies of the same data compress really well.

Is it useful?

Compression Rate

Well, we’ve got some compression, but not as good as gzip. Clearly the system will be useful in situations where gzip isn’t available, but isn’t worthwhile otherwise.


To use the algorithm we need the code (which, ironically, needs to be sent to the client before it can be used). The code for gzip is already built in to web browsers, meaning we get that for free when downloading data from websites. There’s still some use, however, because data sent from the browser to the server won’t be compressed.


The first question in my mind on seeing these schemes was: “Why?” The developer of the application has control over the data being transmitted, so here’s a quick way to compress the file: change the field names. Changing “first” to “f” and “last” to “l” for Example 3 above drops the filesize to 2571 bytes, a 20.7% saving for virtually no effort. This clearly isn’t as good, but has one crucial advantage: we’re still transmitting JSON, and don’t need any special code to decompress it. Restructuring the code further could have similar gains to the RJSON output without mangling the data too much.

Should I use it?

JSON is already a pretty slender protocol (when compared, say, to XML), so does it really need much compression? Using short keys and only sending the minimum of data required will save a large amount of transmission with very little effort. I see two big disadvantages with any JSON compression scheme implemented in code:

  • It’s implemented in code. Code is really dangerous – all programmers should try really hard to write less code. Code needs to be maintained, adds to the complexity of a project, and contains bugs. Your transmission is failing: now you’ve got a whole extra chunk of code that could be causing the problem.
  • It’s not as transparent. At least the data is still JSON, but it’s not the clear, elegant JSON which is really easy to read. The first step in investigating the failing transmission would be to take a look at what’s being transmitted, but that is now a trickier proposition.

As far as I’m concerned, anything which introduces complexity at the same time as making your application harder to debug is a lose-lose proposition, especially when the gains are reasonably marginal compared with common-sense optimisations. So is this trade-off ever worth it? The answer, of course, depends on what you’re building.

Posted in Computing | Leave a comment

Virtualisation and web hosting

I’ve been gradually starting to migrate bits of this site over to a VPS (Virtual Private Server) hosted on Amazon’s AWS. I’d originally come across the concept of a VPS at a previous job which had been offered a Slicehost instance for test purposes. While I’d experimented with Ubuntu on the desktop before, and had used a Linux shell routinely on shared hosting machines, I’d never really had a reason to learn what was going on behind the scenes. The combination of root access with a reason to use it provided a substantial motivation to learn more, and the fantastic Slicehost articles soon had me running GeoServer on Tomcat behind an Apache reverse proxy. No matter what it all meant – an exciting new world beckoned, and I wanted my own part of it.

I’d heard of AWS, and investigating more turned up an unexpected bonus. Amazon’s masterplan (which is actually a really good deal) is to offer a “Free Usage Tier“, providing enough for a free VPS for a whole year. I signed up immediately and soon got things going, learning my way around both Amazon’s infrastructure and Ubuntu server.

Although I’ve made great use of the server for learning and testing, is still looking pretty spartan, and this is in part due to my reluctance to go too far with AWS. While I could have migrated this whole site (including hosting, databases, DNS settings, and data storage) to AWS, the end of the free year would have resulted in either costly bills or a time-consuming migration. In addition, my current hosts (Namecheap) have always done a pretty good job for a pretty low price. I’ve installed Linux on the desktop again (Xubuntu this time), and use VMWare’s Player to run a local copy of Ubuntu server from within Windows. Did I need to continue with a VPS: could I justify the cost?

A number of factors have pushed me further towards using a VPS. As I’ve gained knowledge of Linux, I’ve increasingly wanted to use servers set up my own way. If I need another binary or plugin then I can install it. If I need Apache reconfiguring, I can do it. Namecheap have moved to a new server and terminated my shell access (moving a big folder over FTP takes ages), and their server has started to return 503 responses occasionally when it’s overloaded. I’ve also recently discovered an annoying problem where removing a subdomain doesn’t quite work (they’ve failed to configure a default virtual host, meaning that random sites get returned under my subdomains until DNS entries expire). I don’t have those problems. I’ve therefore started to migrate more things over, with my diary having made the leap already, despite requiring substantial reworking (goodness knows how I managed to write in that many SQL injection vulnerabilities – a story for another day).

The question now becomes one of price – while I can afford £12.48 a month (at current exchange rates), I’d like to do better. I think some of the better options are:

  • AWS offers “Reserved Instances“. Basically I can pay £64.69 upfront to reduce the monthly cost to £4.59 (averaging out to £6.39 pm). This is a lot better, but at the expense of signing up for three years – if I want to stop the instance then a fraction of the upfront payment will be wasted.
  • Hetzner also offer a VPS for £6.39pm (by coincidence). This is without the 3-year signup, but is slightly inferior in terms of RAM (and presumably processing power and network connectivity too).

There’s recently been quite a bit of buzz around Windows Azure, which is now offering Linux VPSes. It also comes with a 3-month free trial, so seemed worth having a go. Perhaps most astonishingly, it looks like the price is currently the best – if I’m reading things right an “XS” non-Windows costs only £5.68 per month! (Even if I’m not reading things right there’s a cushion of nearly £1 left for extra costs like data storage).

So I now have two VPSes ( is also pretty spartan), with still not much to do with them… I think the current plan is to get to know Azure’s infrastructure a bit better (the management console is incredibly slick) and then see which bill is the biggest.

Posted in Computing | 3 Responses

Dvorak – settled in

Current wpm98.

I’ve been enjoying trying to get my particular flavour of Dvorak working on Linux, and will post more details of how it’s going in the near future…

Posted in Computing | Leave a comment

Dvorak Day n

Current wpm: 56 (up from 20).

Starting to get to grips with things now. I wrote out all the letters on sticky-labels and affixed them to the keyboard, which really did make an enormous difference. I no longer had to feel around for a key, or look at a separate diagram, but simply needed to look down at the keys themselves.

Posted in Computing | Leave a comment

Dvorak Day 4

Current wpm: 20 (stable from 20).

A weekend away from the computer has prevented any further speed improvements. Maybe tomorrow will see a better result. I’m starting to remember where all the keys are, but am not at all fluent and have to make heavy use of the “backspace” key.

On a side note, in these days of hackers and malware, Dvorak is noticeably less secure. This is because I can no longer be bothered to type in passwords, and therefore never log out of anything.

Posted in Computing | Leave a comment

Dvorak Day 3

Current wpm: 20 (up from 12).

Did some serious practice today. Also made another keyboard layout so that I don’t have to re-learn all the punctuation. Will this process continue until all the keys are restored?

Using the computer is very frustrating. Typing long passages is tiring, and it’s really annoying when doing something as simple as renaming a file to have to stop and think about every keypress. Most infuriating is that something which used to be second nature is now really hard.

I probably won’t be able to get to the computer tomorrow, so will take a well-earned rest from typing.

Posted in Computing | Leave a comment

Dvorak Day 2

Current wpm: 12 (down from 100 or so).

Using the MS Keyboard Layout Creator, I’ve put together a UK Dvorak Variant. Am considering using speech recognition instead of typing, which is now tedious.

Posted in Computing | Leave a comment

The Dvorak Simplified Keyboard

I’m currently typing this very slowly having changed my keyboard layout to the Dvorak Simplified Keyboard. It’s proving quite slow-going, but will hopefully speed up my typing and reduce the strain on my wrists. Next report to follow.

Posted in Computing | Leave a comment

It’s OK, the software knows best

I find the attitude of some software developers ridiculously arrogant. They all think that they know best. They know the best algorithms, they write in the best language, and above all, they make the best decisions.

Sometimes, however, I’m not so sure.

Today I decided to clear out the small toolbar which acts as a menu bar in Internet Explorer 7. This shares the same space as the tabs for each page I have open, so I like it kept small and tidy so it doesn’t infringe on my browsing. I don’t need the “Home Page” button as I can Google-search via the address bar. I certainly don’t need a Skype button; if I want to make a call I’ll start the application myself.

Next time I start IE, what do I find? Yes, that’s right, the Skype add-on has restored its toolbar button, despite me having just removed it seconds ago. It’s OK, the developers know best. They know that I couldn’t possibly do without their piece of software.

Well, tough luck, I know best. I’ve disabled all your browser plug-ins, rendering your button-replacing system impotent. Ha!

Posted in Computing | Leave a comment