Saving space in large git repositories

In mid-2017 I set up an automated scrape of a frequently-updated website. Every day my script would crawl the website, download its contents, and commit these to GitHub. This allowed me to back up not just the site contents but the complete history of changes.

This scrape did its thing for nearly 4 years until I came to decommission the server where it runs. I was a little surprised to find that the repository had grown to well over 1 GB despite the site’s only containing around 80 MB of data. The root volume of the server was only 8 GB so this scrape was using a pretty big proportion of disk!

I shouldn’t have been surprised: storing the entire history of a website will quickly add up, especially over such a long period. I wanted to keep the entire history of the site, but I realised I didn’t need to store it on the server itself (GitHub does a fine job of hosting repositories, after all). It was time to go digging for a better solution.

git shallow clone

I’d heard about the idea of a “shallow clone,” where one clones only recent commits from a repository rather than the whole thing. git clone supports the --depth option which allowed me to clone only the most recent commit from the repository.


Create a shallow clone with a history truncated to the specified number of commits.

Git – git-clone Documentation (

Let’s take a look how this works. First I’m going to create a source repository that has a couple of commits. You can skip this step if you’d prefer to experiment with a real repository.

# Make a bare "remote" repository that does an impression of GitHub

leigh:~$ mkdir remote
leigh:~$ cd remote
leigh:~/remote$ git init --bare
Initialized empty Git repository in /home/leigh/remote/
leigh:~/remote$ cd ..

# Clone the "remote" and add some commits

leigh:~$ git clone file:///home/leigh/remote local
Cloning into 'local'...
warning: You appear to have cloned an empty repository.
leigh:~$ cd local
leigh:~/local$ git commit --allow-empty -m 'First commit'
[master (root-commit) 3c315ce] First commit
leigh:~/local$ git commit --allow-empty -m 'Most recent commit'
[master dc999d5] Most recent commit
leigh:~/local$ git push
Enumerating objects: 3, done.
Counting objects: 100% (3/3), done.
Delta compression using up to 12 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 251 bytes | 251.00 KiB/s, done.
Total 3 (delta 1), reused 0 (delta 0)
To file:///home/leigh/remote
 * [new branch]      master -> master

# Take a look at the results

leigh:~/local$ git log
commit dc999d56edcb14345da39ea25799879dadc406c7 (HEAD -> master)
Author: Leigh Simpson <>
Date:   Sat Feb 13 17:26:26 2021 +0000

    Most recent commit

commit 3c315ceef3d3b5da0e02b0ea0249dfd2052175b3
Author: Leigh Simpson <>
Date:   Sat Feb 13 17:26:14 2021 +0000

    First commit

Now let’s clone this repository again, but only capture the most recent commit:

# Clear out the original copy

leigh:~/local$ cd ..
leigh:~$ rm -rf local

# Clone again, passing --depth

leigh:~$ git clone --depth 1 file:///home/leigh/remote local
Cloning into 'local'...
remote: Enumerating objects: 2, done.
remote: Counting objects: 100% (2/2), done.
remote: Total 2 (delta 0), reused 0 (delta 0)
Receiving objects: 100% (2/2), done.

# Take a look at the results

leigh:~$ cd local
leigh:~/local$ git log
commit dc999d56edcb14345da39ea25799879dadc406c7 (grafted, HEAD -> master, origin/master, origin/HEAD)

Author: Leigh Simpson <>
Date:   Sat Feb 13 17:26:26 2021 +0000

    Most recent commit

This is useful: we can clone all the files in a repository but ignore all its history.

What next?

I now know how to clone only a single commit, making it much easier to migrate this script to a new server (I have to download only 80 MB rather than > 1 GB).

Unfortunately this doesn’t quite solve the entire problem. In another four years I’ll have accumulated another 1 GB of new commits.

# Add another commit

leigh:~/local$ git commit --allow-empty -m 'New commit'
[master 2555676] New commit
leigh:~/local$ git push
Enumerating objects: 1, done.
Counting objects: 100% (1/1), done.
Writing objects: 100% (1/1), 187 bytes | 187.00 KiB/s, done.
Total 1 (delta 0), reused 0 (delta 0)
To file:///home/leigh/remote
   dc999d5..2555676  master -> master

# See what we have

leigh:~/local$ git log
commit 2555676490bee5e32b109dc8653596b4bd0de206 (HEAD -> master, origin/master, origin/HEAD)
Author: Leigh Simpson <>
Date:   Sat Feb 13 17:37:57 2021 +0000

    New commit

commit dc999d56edcb14345da39ea25799879dadc406c7 (grafted)
Author: Leigh Simpson <>
Date:   Sat Feb 13 17:26:26 2021 +0000

    Most recent commit

New commits are added to the history and stored locally as usual. Usefully I discovered that git fetch also supports the --depth option:


Limit fetching to the specified number of commits from the tip of each remote branch history. If fetching to a shallow repository created by git clone with --depth=<depth> option (see git-clone[1]), deepen or shorten the history to the specified number of commits. Tags for the deepened commits are not fetched.

Git – git-fetch Documentation (

Let’s try it!

leigh:~/local$ git fetch --depth 1
remote: Total 0 (delta 0), reused 0 (delta 0)
leigh:~local$ git log
commit 2555676490bee5e32b109dc8653596b4bd0de206 (grafted, HEAD -> master, origin/master, origin/HEAD)
Author: Leigh Simpson <>
Date:   Sat Feb 13 17:37:57 2021 +0000

    New commit

This is perfect: we can create and push a new commit and then throw away previous revisions. GitHub retains the full history of the crawl and I use a lot less disk space on my server.

Putting it together

I ended up with a script that looks a little like this:

#!/usr/bin/env /bin/bash

cd $(dirname "$0")

# Download content
wget --config wgetrc

# Craft a new commit and push
git add .
git commit -m "Update for $(date +%Y-%m-%d)"
git push

# Trim history to the specified number of commits and garbage-collect
git fetch --depth 1
git gc

I use wget to do the scrape, and configure it using a local file (called wgetrc). The git gc call at the end shouldn’t really be necessary but doesn’t hurt.

Other thoughts

Running out of disk is a pretty disastrous situation for a server, and I’m always keen to minimise this risk. The scrape job described above opens up an interesting attack vector: if the site owner were to upload a large file then my script would happily try to download it. In the process it would consume all available disk space and bring my server to a halt!

An easy way to resolve this is to move the scrape onto its own dedicated disk. The server will then carry on running even if that disk fills up. The server is an EC2 instance running on Amazon Web Services so this was trivially easy: I created a new volume, attached it to the instance, and mounted it within the operating system. This is a good pattern for any directory that may grow without bound: even logs can explode in volume during an incident.

If the disk does fill up then I still want to know about it so I can fix the scrape. This is also pretty simple using AWS. I use CloudWatch Logs and the agent process can monitor metrics such as disk space. I monitor all disk volumes within CloudWatch and trigger alerts when disks start to fill up.

I’ll follow if this doesn’t work but hope this won’t be for a few more years.

Posted in Uncategorized | Leave a comment

Mail blacklists and third-party dependencies

Mail blacklists are a sad fact of life. They exist because of spam and work like this:

  1. I send an email from my computer
  2. my computer connects to the server that hosts my email
  3. my server connects to the server that hosts the recipient’s mail
  4. the recipient server looks up my server’s details in some blacklists
  5. if my server is found in a blacklist then my mail is rejected

Disreputable servers that send a lot of spam end up on blacklists and find that they can’t send mail any more. Problem solved!

What about false positives?

A false positive occurs when a blacklist contains a server that it shouldn’t: a server is labelled as a bad actor when it’s actually polite and friendly. I’ve been lucky so far: I only send a small amount of personal email from my server so haven’t run into trouble.

I have seen this problem under different circumstances:

  • I used to send mail from a shared server used by many other people. This worked well for several years until the server started to be blacklisted every few days (almost certainly because another person using the server was sending spam). I eventually had to change my hosting arrangements to resolve the problem. This is a false positive because I wasn’t sending spam – a shared server was blacklisted because of one bad actor.
  • My server used to use a blacklist (NJABL) that was shut down. When that occurred my server started rejecting all mail sent to it: every message was marked as spam. I reconfigured the server to stop using the defunct blacklist.
  • By way of counterexample: an email account on a server owned by my then employer was compromised and used to send spam. The system worked as it should in this case and the server was blacklisted (taking down our corporate email with it). This was a true positive, but after resetting the password on the compromised account it still took a few days to get the server removed from all the blacklists that listed it.

The second issue described above happened last week to a mail provider used by many of my university friends.

What happened?

I tried to send an email to a contact and received the following bounce message:

Reporting-MTA: dns;
X-Postfix-Queue-ID: <redacted>
X-Postfix-Sender: rfc822; <redacted>
Arrival-Date: Sun, 31 Jan 2021 15:35:49 +0000 (UTC)

Final-Recipient: rfc822; <redacted>
Original-Recipient: rfc822;<redacted>
Action: failed
Status: 5.7.1
Remote-MTA: dns;
Diagnostic-Code: smtp; 554 5.7.1 Service unavailable; Client host
    [] blocked using

My server (at the time of writing this is, with the address is trying to send a message to my friend at Their server ( rejected the message because it thinks my server is listed on a blacklist (hosted at

This was a serious worry: if my server is on a blacklist then I can’t send mail to anyone! I navigated to to see what was going on and found this:

Oops. Their domain name expired and the blacklist went offline. According to the domain’s WHOIS record this happened at 05:00 on 30th January at which point their site was replaced with their domain registrar’s holding page.

Why does an expiring domain break email?

Domain names expire all the time. Microsoft famously allowed to expire in 2003 and Foursquare went completely offline in 2010 for the same reason. Why is this so catastrophic for email? To understand this we need to know more how blacklists work.

Mail blacklists are also known as Domain-Name System-based Blackhole Lists (DNSBL) because they use DNS – the system that we use every day to find sites on the web. DNS converts friendly domain names like into server IP addresses like (which happens to be the server providing for me today).

Mail servers check blacklists by converting server IP addresses into domain names. Here’s what happened while I was trying to email my friend:

  • My server (, connected to to send my message
  • The recipient server reverses the sender address and appends the blacklist domain, in this case obtaining
  • The recipient server made a DNS request for the above address and received a response:
  • The positive response indicated my server was in the blacklist and my message was rejected.

What is this magic It turns out this is the registrar’s domain holding page! When the domain name expired the registrar started to return that page for any subdomains of This meant that every server on the internet was temporarily included in the blacklist!

How big was this issue?

Mail servers using the SpamCop blacklist rejected all messages during the outage. This lasted from the expiration of the domain name until its renewal later that day (potentially slightly longer because DNS responses are commonly cached). If their domain expired at 05:00 and was renewed at around 18:00 then the incident might have lasted for 13 hours.

It’s not really possible to determine how many servers are configured to use the SpamCop blacklist. Large mail providers such as Google and Microsoft maintain their own blacklists but many smaller providers will be using public lists provided for free. SpamCop is frequently recommended so is likely to be popular.

What can we do about it?

The uptime of a service can not exceed that of its critical dependencies.

If the above estimate of 13 hours is accurate then SpamCop’s uptime dropped to 98% for the month of February. The email service’s uptime cannot exceed this limit: their service was down for the entirety of the SpamCop outage and may have been down at other times for their own maintenance work.

How can we manage this risk?


Companies manage risks posed by their suppliers by agreeing formal contracts with documented terms. If lost money because a supplier was negligent then perhaps they can recover their losses. In practice this wouldn’t have helped: SpamCop doesn’t work like this.


If the risk is significant enough then no doubt commercial blacklists will exist that do provide such contractual guarantees.

Monitoring and alerting

If a server is rejecting 100% of the email it receives then there’s almost certainly something wrong! The rate at which mail is being rejected is an important metric that should be measured. Dispatching an alert to an administrator could allow any problem to be resolved quickly.

Regardless of the technical aspect, it’s prudent for businesses to measure the performance of their suppliers (even free ones)!


If a blacklist is unreliable then we should stop using it. Unfortunately it’s difficult to determine the reliability (or otherwise) of a particular blacklist. MxToolBox lists 94 separate blacklists. How can we know which of these are safe to use? In practice administrators rely on recommendations from others (as described above). SpamCop isn’t a tiny independent company or even a free service but is owned by Cisco Systems: a multinational tech giant. As we’ve observed already this doesn’t guarantee the ability to renew domain registrations on time!

Companies can eliminate the risks posed by third-party providers by bringing work back in-house. Large email providers do exactly this, but the cost of establishing an independent email blacklist is likely to be significant.

Perhaps we should stop using all blacklists? Sadly this isn’t a great option: blacklists are an important weapon in the fight against spam.


The cost of SpamCop (free) is greatly outweighed by its benefit. Despite last week’s incident the risk of failure remains low. Alternatives are expensive or present the same risks.

Until last week I’d not heard of SpamCop, but based on their reputation I will soon be configuring my email server to use their blacklist. At least they’re unlikely to let their domain name expire again.

Posted in Uncategorized | Leave a comment

Why I’m leaving WhatsApp

TLDR: privacy. I’m on Signal now.

I’ve had a love-hate relationship with Facebook for many years. I still have an account but rarely log in. It sometimes sends me emails and occasionally I open up a private browser window to take a look. I don’t want them following me around the Internet (their tracking tags are pervasive).

WhatsApp was acquired by Facebook in 2014 but any personal data shared with the service could be kept separately. Now that’s changing.

WhatsApp has long prided itself on its commitment to security and privacy, with encrypted conversations and other important technologies integrated into the app.

But the new announcement has sparked fears of the exact opposite: that people’s information is not being kept secret but instead shared with Facebook.

WhatsApp new privacy terms: What do new rules really mean for you? | The Independent

I’m lucky to live in the UK which (despite Brexit) still provides strong privacy protections based on the GDPR:

“There are no changes to WhatsApp’s data sharing practices in the European region (including UK) arising from the updated Terms of Service and Privacy Policy,” a spokesperson said.

…and indeed WhatsApp do advertise different terms for European users. Despite this I’m still worried. I used to trust WhatsApp; now I don’t. Some sources also suggest that UK data will begin to fall under WhatsApp’s US jurisdiction in the future (despite GDPR being part of UK law):

I don’t know how true that is, but still think it’s time to move. From now on I’ll be on Signal, and am pleased that many others are making the same choice:


I should make it clear that the content of WhatsApp messages remains confidential (and encrypted). The following data are still up for grabs:

  • [Other peoples’] phone numbers in your mobile address book, including those of both the users of our Services and your other contacts. You confirm you are authorized to provide us such numbers. [emphasis mine]
  • a favorites lists of your contacts
  • groups and broadcast lists
  • how you use our services, how you interact with others using our Services, and the like
  • whether you are online, when you last used our Services

These data are still important! If you don’t believe this then take a look at Why Metadata Matters from the EFF or the linked article ‘We Kill People Based on Metadata’.

Metadata — data about your data — is almost as powerful as the actual data.

Cyjax CISO Ian Thornton-Trump via WhatsApp Beaten By Apple’s New iMessage Privacy Update (

Why Signal?

Signal is an independent nonprofit. We’re not tied to any major tech companies, and we can never be acquired by one either.

Signal >> Home

This was exposed by the new privacy labels that Apple have added to their app store. Here’s a comparison, with Signal on the left and WhatsApp on the right:

Here’s a more interesting reason, which luckily doesn’t apply to me!

Do I really expect everybody to follow me?

I’m not naïve enough to think that all my contacts will migrate to Signal. I know I’m taking the risk of missing out on news or other communications.

Migrations have to start somewhere, and if I can influence just one person to follow me then that’s still a victory.

More coverage

Shortly after Facebook acquired WhatsApp for $19 billion in 2014, its developers built state-of-the-art end-to-end encryption into the messaging app.

In 2016, WhatsApp gave users a one-time ability to opt out of having account data turned over to Facebook. Now, an updated privacy policy is changing that. Come next month, users will no longer have that choice.

Under the new terms, Facebook reserves the right to share collected data with its family of companies.

WhatsApp gives users an ultimatum: Share data with Facebook or stop using the app | Ars Technica

Posted in Uncategorized | Leave a comment

New year, new blog

I know it doesn’t look very new, but this blog is now running on brand new hosting with brand new PHP. Hurrah!

I might even update it occasionally…

Posted in Uncategorized | Leave a comment

Some handy Eclipse plugins (2)

This is an updated version of for Eclipse Mars.

One of the most exciting features of Eclipse is the plugin ecosystem. Eclipse has been around for ages, and there are plenty of fantastic plugins adding additional programming languages and features. Here’s a list of some I’ve found useful:


Puppet is a great way to automate the configuration of new servers. Geppetto (the puppet maker) is an IDE for this: providing syntax highlighting and integration with the Puppet forge.

Json Editor

Support for JSON.

Mars Updates

Eclipse provides much of its core functionality via plugins, and there’s plenty available to extend the default configuration.

  • Update site:
  • Documentation:
  • Tick:
    • C/C++ Autotools suport
    • C/C++ Development Tools
    • C/C++ Unit Testing Support
    • Dynamic Languages Toolkit – ShellEd IDE
    • Eclipse Web Developer Tools
    • Eclipse XML Editors and Tools
    • Eclipse XSL Developer Tools
    • JavaScript Development Tools
    • PHP Development Tools (PDT)
    • Subversive SVN Team Provider [Subversion integration]
    • Web Page Editor
    • … anything else you fancy!

Markdown Editor

Simple syntax highlighting and document outline for Markdown.


An IDE for Python.

ReST Editor

As advertised last post, syntax highlighting makes editing ReStructured Text a lot easier…


Support for LaTeX.

I hope to update this list as I discover new plugins: feel free to add suggestions in the comments.

Posted in Computing | Leave a comment

Myjson Crawler

What is

Myjson describes itself as “a simple JSON store for your web or mobile app”. You can save JSON data using an interface on the site itself  (example) or programmatically via an API  (example). Each saved piece of JSON can be accessed via a randomly-generated URL, but unfortunately the random part of the URL is very short: it seems to be three or four characters drawn from a limited alphabet. This means that it’s easy to guess valid URLs, or even to enumerate all of them. OWASP describe this class of problem as an “insecure direct object reference”, and list this at fourth place in their 2013 Top 10 list of security risks. It’s not fair to criticise Myjson for this as they never advertised their system as secure, but I think it is fair to take a look at what people are storing.

Crawler implementation

The most obvious way (to me, at least) to implement a crawler is as follows:

  1. Generate a list of URLs using scripting language du jour
  2. Use xargs and cURL to crawl.

This has a couple of advantages:

  • It’s really simple
  • xargs has a handy multi-threaded mode allowing us to crawl several pages in parallel.

Unfortunately that would be too easy, so I decided to use JavaScript.

How it works

Full implementation available on Github

We’re going to need a function to output results. I decided to output HTML of this form:

<dt>aaa</dt><dd>{ "json" : "that came from", "url" : "" }</dd>
<!-- ... -->

Here’s a tiny utility function to create this output:

* Adds a row to the list of results with the query and the response
* @param {string} query
* @param {string} response
var outputResult = function (bin, contents) {
    'use strict';

    var binElement = document.createElement('dt'),
    contentsElement = document.createElement('dd');

    binElement.textContent = bin || '';
    contentsElement.textContent = contents || '';


We’ll also need a function to crawl the site. XMLHttpRequest is the obvious tool. We write output as each response comes back using an anonymous function which
closes over the current bin name.

* Looks up the contents of a myjson bin and adds them to the list of results
* @param {string} bin
var lookupBin = function (bin) {
    'use strict';

    var xhr = new XMLHttpRequest();'GET', '' + bin);
    xhr.onload = function () {
        if (this.status === 200) {
            outputResult(bin, this.responseText);

Finally we need to iterate over the possible bin names. Some nested loops would handle this well enough, but it’s probably tidier to encapsulate this functionality. Here’s a function to iterate over an alphabet of characters:

* Iterates over a list of characters
* @param {string} alphabet List to iterate across
* @param {string} prefix String to prepend before calling CALLBACK
* @param {function} callback Callback function, called with current string
var iterateCharacters = function (alphabet, prefix, callback) {
    'use strict';
    var i;
    for (i = 0; i &lt; alphabet.length; i = i + 1) {
        callback(prefix + alphabet[i]);

For each character in the alphabet we prepend an existing string and then pass the result on to the defined callback. Iterating over all three-character bin names is then simple. This example calls our output function directly without crawling each bin:

var alphabet = 'abcdefghijklmnopqrstuvwxyz0123456789';

var run = function () {
    'use strict';

    iterateCharacters(alphabet, '', function (string) {
        iterateCharacters(alphabet, string, function (string) {
            iterateCharacters(alphabet, string, outputResult);


Finally we need an HTML document to host all this:

<!DOCTYPE html>
    <dl id="results"></dl>
    <script type="text/javascript" src="script.js"></script>

Browsers try hard to download data as quickly as possible, and our crawl runs in parallel across several threads without extra effort.



I restricted the crawl to include three-character names beginning with the letters ‘a’ to ‘d’. The above code will crawl all combinations of three-character bin names, and can also be easily extended to crawl four- and five-character names. It’s sensible to reduce the scope, however:

  • Crawling lots of bins takes lots of time.
  • It isn’t nice to load heavily.
  • Thoughts of Weev make me nervous.
  • Brief Analysis

    The sample includes 1,637 rows. The top ten JSON strings are as follows:

    String Count
    “{}” 226
    “{“key”:”value”}” 92
    “{“foo”:”bar”}” 42
    “{“hello”:”world”}” 34
    “{“key”:”value”,”key2″:”value2″}” 30
    “{“glossary”:{“title”:”example glossary”,… 29
    “{“key_updated”:”value_updated”}” 26
    “[]” 23
    “{“test”:”test”}” 17
    “{“key”:”updated value”,”key2″:”updated value2″}” 16

    We can therefore estimate that around 14% of bins contain only the empty object. Many of the examples above seem likely to have been created to test the service, and 69% of the extracting strings contain only 50 characters or fewer.

    It will be interesting to run a similar scrape in the future and see if the distribution of data changes:
    how many people are using this service as intended?

    Scope for Evil

    The API supports updating JSON data by sending an HTTP PUT request. It would only take a few minutes to overwrite all data stored by Myjson. Myjson doesn’t advertise a secure service, and they obviously aren’t worried that data is disclosed. They ought to be worried that somebody might trash everything they have stored.

Posted in Computing | Leave a comment

AWS CloudFront via CloudFormation

Amazon Web ServicesCloudFormation is a great way to define stacks of related resources. I don’t tend to find myself making more than one version of each stack, but have still seen some big advantages:

  • I no longer have to configure resources through the AWS management console, saving a heap of time.
  • Stack configuration is now in source control so all changes are logged.
  • I’ve learnt a great deal more about AWS and how its components interact.

Unfortunately I’ve had to pay for these with another heap of time: spent learning how to use CloudFormation and how to deal with it when things don’t quite work. I’ve wasted a lot of time trying to set up CloudFront CDN distributions, and thought I’d write up a couple of the gremlins I found in case this proves useful to anyone (including me).

S3 domain name isn’t a suitable domain name

Pointing a distribution at an S3 bucket is harder than you might think. All you need is the domain name, but CloudFormation won’t give it up easily.

According to the documentation:

Returns the DNS name of the specified bucket.

Unfortunately that’s not quite what happens. As described on the AWS forum, the domain name is slightly different and CloudFront won’t accept it. Unfortunately you have to build it yourself:

{ "Fn::Join" : [ "", [ { "Ref" : "BucketName" }, "" ] ] }

It sometimes doesn’t work

Unfortunately I can’t currently do any better than that. This forum post implies that adding aliases might break things but I’ve managed to define distributions with aliases.

A second or so after the distribution starts being created it decides to delete itself again, with the message “Resource creation cancelled”. One day I’ll try to put together a reduced testcase. Try defining a very small template, and adding in other resources by updating the stack after it’s worked for the first time. This is good general advice for CloudFormation: test parts of a large template in isolation to save time on each iteration.

Posted in Computing | Leave a comment

Downgrading VMWare Tools

I’ve run into an amusing issue with the tools that come packaged with VMware Player. If I removed content from the middle of a file on the host, this change wasn’t replicated on the guest. Here the file would be reduced in size, but by removing bytes from the end rather than the middle, which isn’t entirely helpful.

I’m experimenting with downgrading the version of VMware Tools, and found handy instructions of how to do this here:

  1. Go to and navigate to the version you want. The tools installation is bundled in a tar under the packages folder.
  2. Grab the file, extract it to get an executable, and run it.
  3. Go to VMware player and choose Player -> Manage -> Reinstall VMware Tools…
Posted in Computing | Leave a comment

AWS Architecture Diagrams

AWS release amazing architecture diagrams. The best of these must surely be the diagram of the architecture used to host the Obama for America campaign. The diagrams look great, but aren’t accessible to others as there’s no publicly downloadable set of icons… until now. Someone has created their own for download:

Posted in Uncategorized | Leave a comment

Eclipse Memory Limits

Update: I’ve had some instability issues since installing this fix, and have now reverted.

As I use Eclipse more and more, I’ve occasionally run into memory and garbage-collector limits when working on large projects or files. This page provides the solution: bump up the limits in eclipse.ini (found in the Eclipse programme directory):


Apart from this I’ve been pretty happy with Eclipse. I’m stuck on the Juno version at the moment as some plugins don’t seem to play nicely with Kepler, and haven’t found the time to work out how to upgrade cleanly. I’m certainly preferring it to Aptana at the moment, primarily due to the plugin support. The wider userbase also comes with a corresponding improvement in the quantity and quality of online resources and documentation.

Update: I’ve had some instability issues since installing this fix, and have now reverted.

Posted in Computing | Leave a comment

Service Testing

Here’s a collection of services, and ways to go about testing them:

25: SMTP

Simple Mail Transfer Protocol.

Test with telnet:

$ telnet 25
Connected to
Escape character is '^]'.
220 ESMTP Postfix (Ubuntu)

Check it responds:

250-SIZE 20480000
250 DSN

Try sending a message to a local user:

250 2.1.0 Ok
$ RCPT TO: <>
250 2.1.5 Ok
354 End data with <CR><LF>.<CR><LF>
$ Email body goes here
$ .
250 2.0.0 Ok: queued as 18E73288

Try logging in:

334 VXNlcm5hbWU6
$ [ base64-encoded username ]
334 UGFzc3dvcmQ6
$ [ base64-encoded password ]
235 2.7.0 Authentication successful


221 2.0.0 Bye
Connection closed by foreign host.

You can test STARTTLS (where a secure channel is negotiated for an existing connection) functionality using OpenSSL‘s s_client:

$ openssl s_client -connect -crlf -starttls smtp
... loads of stuff
250 DSN
$ EHLO ...

80: HTTP

Hypertext Transfer Protocol.

Test with telnet:

$ telnet 80
Connected to
Escape character is '^]'.
$ GET / HTTP/1.1
$ Host:
HTTP/1.1 301 Moved Permanently
Date: Sun, 11 Aug 2013 19:54:30 GMT
Server: Apache
Vary: Accept-Encoding
Content-Length: 233
Connection: close
Content-Type: text/html; charset=iso-8859-1

<title>301 Moved Permanently</title>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="">here</a>.</p>
Connection closed by foreign host.

110: POP3

Post Office Protocol.

Test with telnet:

$ telnet 143
Connected to
Escape character is '^]'.
+OK Dovecot ready.

Try logging in:

$ USER [ username ]
$ PASS [ password ]
+OK Logged in.

List messages:

+OK 20 messages:
... loads of stuff



143: IMAP

Internet Message Access Protocol.

Test with telnet:

$ telnet 143
Connected to
Escape character is '^]'.

Try logging in:

$ a1 LOGIN [ username ] [ password ]

List folders:

$ a2 LIST "" "*"
... loads of stuff

Find out what’s in the Inbox:

* FLAGS (\Answered \Flagged \Deleted \Seen \Draft)
* OK [PERMANENTFLAGS ()] Read-only mailbox.
* OK [UNSEEN 10] First unseen.
* OK [UIDVALIDITY 1345668496] UIDs valid
* OK [UIDNEXT 2543] Predicted next UID
a3 OK [READ-ONLY] Select completed.


* BYE Logging out
a5 OK Logout completed.

443: HTTPS

Just like HTTP, but use s_client:

$ openssl s_client -connect
... loads of stuff

465: SMTPS

Check using s_client:

$ openssl s_client -connect -crlf
... loads of stuff
220 ESMTP Postfix (Ubuntu)

993: IMAPS

Check using s_client:

$ openssl s_client -connect
... loads of stuff

995: POP3S

Check using s_client:

$ openssl s_client -connect
... loads of stuff
+OK Dovecot ready.


Almost everything can be driven via Telnet. If you need TLS, use s_client.

There’s a really handy OpenSSL command-line summary at

Posted in Computing | Leave a comment

Undocumented Features

I was amused to discover grumbling around the Internet about Google’s weather API, and their decision to remove it. This describes what happened:

Last month, Google announced plans to shutter iGoogle, among a bunch of other services. Many developers and users were (and still are) outraged, but at least they have some time to breathe: iGoogle isn’t going away until November 1, 2013. That means there are still 15 months left to adjust and export your data. Yet some changes are already starting to take effect: the company’s private Weather API, for example, appears to be dead in the water.

There’s some real gems further down:

Why should Google expect support tickets? The big clue is in the word “undocumented.” More from the article:

Web developer Jordan Stephens isn’t even bothering to look for alternatives. In fact, CurrentlyIn has been completely shut down as a result of the Google Weather API outage, according to an error message from the service.

Indeed, navigating to the CurrentlyIn site gets brings us the following:

currentlydown : (

Google has shut down its undocumented weather API (which was used by this site). will be down for the forseeable future.

The statement “let’s use this undocumented API from Google to get weather information for free” seems like a great idea, but undocumented features aren’t really features at all. Anything without documentation is an implementation detail, and subject to change without notice. In this case, change happened to include removal.

There’s some great examples at The Old New Thing:

Three examples off the top of my head of the consequences of grovelling into and relying on undocumented structures.

Defragmenting things that can’t be defragmented

In Windows 2000, there are several categories of things that cannot be defragmented. Directories, exclusively-opened files, the MFT, the pagefile… That didn’t stop a certain software company from doing it anyway in their defragmenting software. They went into kernel mode, reverse-engineered NTFS’s data structures, and modified them on the fly. Yee-haw cowboy! And then when the NTFS folks added support for defragmenting the MFT to Windows XP, these programs went in, modified NTFS’s data structures (which changed in the meanwhile), and corrupted your disk.
Of course there was no mention of this illicit behavior in the documentation. So when the background defragmenter corrupted their disks, Microsoft got the blame.


I hope you understand why I tend to go ballistic when people recommend relying on undocumented behavior. These weren’t hobbyists in their garage seeing what they could do. These were major companies writing commercial software.

Posted in Computing | Leave a comment

Some handy eclipse plugins

Update: this is currently tested for Eclipse Juno, although I hope to update this for Kepler in the near future.

I’ve been making more use of Eclipse recently as an IDE. I’ve been using Aptana (which is based on Eclipse) for a few years now, but I’m spending less time coding in PHP so it now makes sense to graduate onto the bigger tool, especially as not all Eclipse plugins play nicely with Aptana. One of the most exciting features of Eclipse is the plugin ecosystem. Eclipse has been around for ages, and there are plenty of fantastic plugins adding additional programming languages and features. Here’s a list of some I’ve found useful:

AWS Toolkit

Would you like to manage your Amazon Web Services resources directly while writing code? This sounds a little excessive, but is actually quite useful: Amazon’s management console is slow and will only show lists a few items at a time. This plugin makes it easy to find particular details quickly among a huge lists of items, and also comes with syntax highlighting for CloudFormation templates. It’s not entirely supported on the latest version of Eclipse (Juno), but the parts I need all work well.


Haskell support for Eclipse, including syntax highlighting and direct support for loads of Haskell tools and frameworks directly (cabal, Hoogle, HLint, HTF, Alex, Happy, UUAGC, Snap, Yesod, …). You can run code directly from the IDE, calling out to GHCi.


Puppet is a great way to automate the configuration of new servers. Geppetto (the puppet maker) is an IDE for this: providing syntax highlighting and integration with the Puppet forge.

Json Tools

Support for JSON.

  • Update site:
  • Documentation:
  • Tick:
    • Json Tools

Juno Updates

Eclipse provides much of its core functionality via plugins, and there’s plenty available to extend the default configuration.

  • Update site:
  • Documentation:
  • Tick:
    • C/C++ Development Tools
    • Eclipse Web Developer Tools
    • Eclipse XML Editors and Tools
    • Eclipse XSL Developer Tools
    • JavaScript Development Tools
    • PHP Development Tools (PDT)
    • Subversive SVN Team Provider [Subversion integration]
    • Web Page Editor
    • … anything else you fancy!

Markdown Editor

Simple syntax highlighting and document outline for Markdown.


An IDE for Python.

ReST Editor

As advertised last post, syntax highlighting makes editing ReStructured Text a lot easier…


BASH syntax highlighting.


Support for LaTeX.

I hope to update this list as I discover new plugins: feel free to add suggestions in the comments.

Posted in Computing | Leave a comment

ReST Syntax Highlighting

I’ve talked before about writing documentation using Sphinx. Sphinx uses ReST (ReStructured Text) which is great, but sometimes a pain to edit without any form of syntax highlighting. Luckily it was pretty easy to track down an Eclipse plugin to do this.

First add a new software source URL like this:


Then tick the box to install the plugin:



Posted in Computing | 1 Response


I’ve recently started working with MongoDB at work: it forms a core part of our tracking and reporting infrastructure, and all events that we track are slotted tidily into a Mongo database. Mongo has been getting some criticism of late, and while most of this has been largely misdirected (not using the right tool for the job doesn’t mean the tool was the problem), this piece was particularly interesting.

MongoDB does not give us much control over where data is placed, so the frequently accessed data (or data that is scanned together) may be spread over a large area. When scanning data only once, there is no way to prevent that data evicting the more frequently accessed data from memory. Once the frequently accessed data is no longer in memory, MongoDB becomes IO bound and lock contention becomes an issue.

My initial introduction to MongoDB was MongoDB in Action. I remember being struck by this quote:

Database tuning, which in most RDBMSs means tinkering with a wide array of parameters controlling memory allocation and the like, has become something of a black art. MongoDB’s design philosophy dictates that memory management is better handled by the operating system than by a DBA or application developer. Thus, data files are mapped to a system’s virtual memory using the mmap() system call. This effectively offloads memory management responsibilities to the OS kernel.

This sounds great. Why bother configuring when the kernel will probably do a better job than you ever could? It turns out this may have been a poor design decision: the kernel does a good job at managing memory, but doesn’t do the best job because it doesn’t know enough about how that memory is used within MongoDB.

It’s always difficult to separate the hype from the reality with any new technology, as this issue shows.


2. Assume basic engineering principles applied throughout due to HEAVY MARKETING SUGGESTING AWESOMENESS.

3. Spend 6 months fighting plebbery across the spectrum, mostly succeed.







Posted in Computing | Leave a comment

AWS Summit

I’ve just been to this year’s AWS Summit in London. I arrived at the event to find a suspiciously large density of suit-wearing managers. After shaking off the initial fear that I might be at the wrong event, I forged on…

The Business Design Centre is an agreeable sort of place for a conference. There’s plenty of space for cows (it did start life as the Royal Agricultural Hall), and there’s probably a joke about conference delegates and sheep around here somewhere. Lots of people complained about the queue to get in (the organisers had the bright idea of printing name tags as people arrived rather than having them ready), but I’d arrived slightly early and didn’t have to wait for long. I soon tracked down some breakfast and loaded up a sandwich. Note to BDC: £1.10 is too expensive for a sausage (even if they were rather tasty). 60p is certainly too much for two slices of slightly-stale white bread.

Initial worries about the dress sense of those attending were unfortunately completely justified, as the first keynote speech turned out to be an extended sales pitch for cloud computing. I didn’t quite understand the point of this – surely lots of people attending had received the invitation, like me, because they were already AWS customers? A late night, early start, and 1½ hours of boredom made the decision to sit next to my manager rather risky, but I did manage to remain awake for the entire thing, mostly by reading Twitter.

The keynote was punctuated by testimonials from current customers. Interest was maintained by the dubious choice of guests – first a chap from News International and then one from an oil company. These interludes turned out to be the most irritating aspect of the day – most speakers took full advantage of the opportunity to pitch their product, and few said much more than “we like AWS ’cause it’s cheap and it scales.” Even the more-technical talks were hobbled by this requirement, with customer talks failing to fit in to the remainder of the content, and rarely adding any information of interest.

Twitter remained great fun, with the Chinese whispers soon getting out of control:

I’m not sure I was helping…

A lecture about Amazon’s information security was more interesting, and it was illuminating to hear of some of what they do:

  • Staff are only granted access to any system for 60 days until their rights must be renewed by their manager.
  • SSH access to production servers requires a change ticket or issue number, and all activity is logged.
  • Any hard disks leaving their facilities must be physically shredded or destroyed before they may do so.

This lecture was better – but still felt like it was for managers (“don’t worry you can trust us with your data”) rather than developers (“look at our security, it’s cool”).

The afternoon promised more interest, with deeper studies of particular AWS products. Some of these talks were great (presentations about DynamoDB and OpsWorks being highlights of the day), and delivered on the promise of a technology conference – with more detailed information (DynamoDB indexing) and a live demo (using OpsWorks to deploy a web stack during the lecture). Other talks weren’t, with an “Advanced Topics” lecture about “Architecting for High Availability” covering little more than what was in the product overview pages for Elastic Load Balancing and Auto Scaling.

Ultimately I was expecting a tech conference which gave some deeper insight into AWS products, and thought that exposure to the AWS team might well provide that. Unfortunately most of the content was pitched at a very low level. I don’t necessarily think this Amazon’s fault: I evidently wasn’t the target audience, but I was a little bored. I can’t complain too much – the food was pretty good for a free conference!

The next day an email arrived inviting me to supply feedback on my experience, and I thought I might as well do so (the offer of a free Kindle didn’t sway my decision at all, *ahem*). Their survey was hosted by a third party site run by a company called “Qualtrics,” but quality was mostly lacking. For a start, radio buttons aren’t meant to do this:


Oh well, I could at least supply some feedback at the end:


… or not – as the input box was nowhere to be found.

I think it’s fair to say my appreciation of the day was fairly mixed. I got a day off work, and some free stuff (stickers, food and beer). I paid for it though, as I’m sure I’m stupider now.

There was one astonishingly cool feature – the presentations all used really cool little graphics (see for a great example). It’d be really cool if these were freely available.

Posted in Computing | Leave a comment


On Linux systems it’s common to use ssh-agent (manpage) to store SSH keys. When you run the agent its output can be used to configure it within your shell. The variables defined allow ssh and ssh-add to find the agent once it’s started.

$ ssh-agent
SSH_AUTH_SOCK=/tmp/ssh-RRekSLl29890/agent.29890; export SSH_AUTH_SOCK;
echo Agent pid 29891;

I use an Ubuntu VM, and regularly log in using PuTTY. It’s easy to connect to the agent on login, but I wanted a way to keep connecting back to the same one so I could add keys and leave them there. There’s loads of solutions available, but I wanted to try and write a simpler version.


  • Store the agent settings into a file under ~/.ssh
  • Load the file on login
  • Check whether the agent can be found, and if it can’t start it

This is slightly more subtle than it first appears: it’s necessary to cope with the file not existing and containing out-of-date data, and I wanted to do this as tidily as possible.

Here’s my solution:

[ -f ~/.ssh/agent_env ] && . ~/.ssh/agent_env
kill -0 $SSH_AGENT_PID 2> /dev/null || . <(ssh-agent | tee ~/.ssh/agent_env)

First we check if the file exists:

[ -f ~/.ssh/agent_env ]

If it does then we execute its contents using a bash shortcut (the ‘.‘):

. ~/.ssh/agent_env

If the file did exist then we should have a value for the SSH_AGENT_PID. We check whether it’s possible to send a signal to the process this identifies using kill -0:

kill -0 $SSH_AGENT_PID

The nice thing about this is that if the variable isn’t defined then we end up passing garbage on to kill, and it’ll return false in the same way as if the process didn’t exist. We add a redirect to /dev/null so we don’t see its helpful error message.

kill -0 $SSH_AGENT_PID 2> /dev/null

Now we just need to run the agent, store its output in the file, and use that output to configure the shell. We can use a bash trick to load the agent and configure the shell in one tidy step:

. <(ssh-agent)

The redirect here has the effect of attaching the ssh-agent output to a file descriptor, and then returning the path to that file descriptor. It’s convenient when you want to pass the output of a command to one which is expecting a filename.

Finally we insert a call to tee which writes out our file.

. <(ssh-agent | tee ~/.ssh/agent_env)
Posted in Computing | Leave a comment

Creating a VM to mimic EC2

I spend a lot of time working with Amazon’s EC2 (Elastic Compute Cloud) virtual servers. Configuring a new server takes up a surprisingly large amount of time when done manually, and so I use Puppet to automate the process. This works pretty well, but after creating lots of new configuration code I always think it’s worth testing the complete installation on a local Virtual Machine. I’ve built a VM which looks quite a lot like an empty Ubuntu 12.04 cloud image, but have always worried that the CD installation might have a few key differences. Here’s a way to work round some of these.

EC2 instances use a different set of sources for APT, so the first step is to sync these up. Wipe the current /etc/apt/sources.list, and replace with an example from the cloud image you’re trying to emulate. I used the following lines:

deb precise main
deb-src precise main
deb precise-updates main
deb-src precise-updates main

deb precise universe
deb-src precise universe
deb precise-updates universe
deb-src precise-updates universe

deb precise-security main
deb-src precise-security main
deb precise-security universe
deb-src precise-security universe

Now a quick apt-get update will get you installing the same packages as EC2. The next logical step is to reinstall all the current packages to match. I thought this would be pretty easy, and went for the following invocation:

dpkg --list |
grep ^ii |
awk '{ print($2) }' |
xargs apt-get install --reinstall

This works as follows:

  1. Get the current list of packages
  2. Return only those which are actually installed
  3. Drop everything in the output but the second column (the column containing the package name)
  4. Pipe into xargs which in turn calls APT.

This got things moving and I’d soon downloaded 397 fresh packages. Unfortunately things then stopped with the (un)helpful message xargs: apt-get: terminated by signal 11. I tried calling APT separately, and got slightly further: Segmentation fault (core dumped). Did I say this got me further?

Reinstalling packages individually seemed to work, so I developed the theory that APT chokes when fed too many packages in one go. This bug seemed to be related. I worked around it with a simple modification to the xargs command to take ten packages at a time:

dpkg --list |
grep ^ii |
awk '{ print($2) }' |
xargs -L 10 apt-get install --reinstall

I decided to take another look at the Launchpad bug report linked earlier. This states:

When trying to –reinstall multiple packages with apt-get, it crashes with a segmentation fault. The command line: apt-get –reinstall install <2-or-more-packages...>

Having used Ubuntu for free for many years I thought it might be about time I gave something back to the community, so I thought I’d take another look. I’ve posted a comment with some more information after getting rather more involved with GDB and apport then I would have liked!

I genuinely don’t know if there is any difference between the packages for EC2 and those in the standard repositories (I rather suspect there isn’t), but if there is then I definitely don’t want to find out about it when trying to fix an urgent problem.

Posted in Computing | Leave a comment

Exotic Data Structures

I came across this recent comment on Hacker News today, and thought the data structures therein worth advertising further:

Smushed list

Size O(1). The smushed list is a list of variables (of the same type), stored in a single variable of that type. To produce the smushed list, simply XOR all the elements of the list together, then store. To get a value back, simply XOR the smushed list by all the elements other than the one you want. Smushing is also embarrassingly parallel (you can smush two halves separately and then smush the results) so producing smushed lists is blazingly fast.

Unlinked list

O(n). This is slightly faster than a linked list, and acts as a “black box”. Simply allocate nodes that are not linked to each other in any way. The data normally stays out of the way of your program, but in case of a core dump you can find it again. NOTE: If your language does reference-counting this will not work. Get a real language that does what you say.


Search trees are good at bisecting data, but they are not really conducive to a random walk for inspiration. Begin by constructing a binary search tree, keeping track of all the nodes you’ve added, and simply add a third, random, pointer to each node – have it point at a random node somewhere in the tree. In the search algorithm, either follow the left, right, or random node, depending on how much meandering you are interested in doing. The journey is the destination.

I’m reasonably confident I’ve seen these before somewhere but can’t figure out where.

Posted in Computing | Leave a comment

PHP documentation and Sphinx

Documentation is important, but commonly forgotten. Writing any code requires filling your brain with information – what the language is doing – what the API you’re calling is doing – what you want the code to do – and most of this information will fall out again a few days later. A month later a bug is found, and this information needs to be acquired all over again. Luckily you thoroughly documented everything that the code was doing, allowing you to pick up where you left off and fix the bug in seconds! Hurrah! Well, maybe.

To make this process easier, a variety of systems have been developed to build documentation automatically. In general you simply include some extra comments in the sourcecode explaining what you’re up to, and then something else extracts this information and formats it elegantly for the web Javadoc is a standard tool for Java developers, and the syntax has been recycled for other C-style languages, giving us JSDoc (for JavaScript) and PHPDoc. Perl’s POD and Pyhon’s PyDoc seem to have slightly different solutions which emphasise writing a section of documentation at the start of a file rather than merging it into the source, although Python also uses docstrings to directly document parts of the code.

phpDocumentor was the first PHP documentation system I met. It worked pretty well, although the format of the documentation it output left a little to be desired. A good example is the PHP Client Library for Google’s Adwords API.

A rival soon came along in the form of DocBlox, currently used for the Zend Framework documentation. This created much more elegant output.

The two projects have now merged together under the name phpDocumentor 2, which delivers the best documentation yet (do try it).

This system is great, and I always include markup within code to allow simple documentation to be generated. Unfortunately it’s quite hard to write good documentation when it has to be wedded to the structure of a project’s sourcecode. The Zend Framework folks provide automatically-generated documentation, but also write a more thorough reference guide. I wanted a more sophisticated documentation generator.

Sphinx was originally created to create new documentation for the Python language. I’d always assumed it was only really suitable for Python projects, but soon started to see it in more places. The new phpDocumentor documentation is built with Sphinx (see the footer at the bottom), as is the current documentation for Zend Framework 2, and more PHP-centric projects seem to be making use of it.

I soon installed Sphinx and used its quickstart script to get going. Syntax highlighting of PHP source code is supported – Sphinx uses Pygments for this, but I was still lacking the ability to properly cross-reference code. Sphinx uses “domains” to allow you to document parts of the code such as classes or functions, and then allows easy cross-references between these. I found that the Python domain worked pretty well for PHP, but wondered if a better option was available.

Sphinx supports domains for Python, C/C++, JavaScript and “reStructuredText” (the Sphinx markup) out of the box, and more are available in the sphinx-contrib repository, including Ada, Erlang and Ruby. Sadly there was no PHP domain listed in the readme, but I managed to stumble on one by looking through the source (how ironic that a repository for a documentation generator is poorly documented…) The readme for the domain suggested to install it using Python’s easy_install, and I soon found the project itself in the Python Package Index.

Installing all this on Ubuntu is pretty simple:

sudo apt-get install python-sphinx python-setuptools
sudo easy_install pip
sudo pip install sphinxcontrib-phpdomain
cd /path/where/documentation/project/lives

After answering some questions about your project  you’ll find a directory structure with an automatically generated You’ll need to make a couple of changes to it as follows:

# Add any Sphinx extension module names here [...]
extensions = ['sphinxcontrib.phpdomain']

# The name of the default domain.
primary_domain = 'php'

# The default language to highlight source code in.
highlight_language = 'php'

Assuming you allowed the quickstart to create the makefile, you can now test everything works:

make html

… and point your browser at the html documentation. Take a look at the makefile itself to see what formats you can use for your documentation – there’s a lot of options. reStructuredText itself is pretty simple, and the Sphinx documentation itself is good. Documentation for the PHP domain is OK, but misses some bits out. It allows you to document the following objects:

  • ‘function’ cross-referenced with ‘func’
  • ‘global’ cross-referenced with ‘global’
  • ‘const’ cross-referenced with ‘const’
  • ‘method’ cross-referenced with ‘meth’
  • ‘class’ cross-referenced with ‘class’
  • ‘attr’ cross-referenced with ‘attr’
  • ‘exception’ cross-referenced with ‘exc’
  • ‘namespace’ cross-referenced with ‘ns’
  • ‘interface’ cross-referenced with ‘interface’

 Now all you need to do is write something…

Posted in Computing | 1 Response