Myjson Crawler

What is myjson.com

Myjson describes itself as “a simple JSON store for your web or mobile app”. You can save JSON data using an interface on the site itself  (example) or programmatically via an API  (example). Each saved piece of JSON can be accessed via a randomly-generated URL, but unfortunately the random part of the URL is very short: it seems to be three or four characters drawn from a limited alphabet. This means that it’s easy to guess valid URLs, or even to enumerate all of them. OWASP describe this class of problem as an “insecure direct object reference”, and list this at fourth place in their 2013 Top 10 list of security risks. It’s not fair to criticise Myjson for this as they never advertised their system as secure, but I think it is fair to take a look at what people are storing.

Crawler implementation

The most obvious way (to me, at least) to implement a crawler is as follows:

  1. Generate a list of URLs using scripting language du jour
  2. Use xargs and cURL to crawl.

This has a couple of advantages:

  • It’s really simple
  • xargs has a handy multi-threaded mode allowing us to crawl several pages in parallel.

Unfortunately that would be too easy, so I decided to use JavaScript.

How it works

Full implementation available on Github

We’re going to need a function to output results. I decided to output HTML of this form:

<dl>
<dt>aaa</dt><dd>{ "json" : "that came from", "url" : "myjson.com/aaa" }</dd>
<!-- ... -->
</dl>

Here’s a tiny utility function to create this output:

/**
* Adds a row to the list of results with the query and the response
* @param {string} query
* @param {string} response
*/
var outputResult = function (bin, contents) {
    'use strict';

    var binElement = document.createElement('dt'),
    contentsElement = document.createElement('dd');

    binElement.textContent = bin || '';
    contentsElement.textContent = contents || '';

    document.getElementById('results').appendChild(binElement);
    document.getElementById('results').appendChild(contentsElement);
};

We’ll also need a function to crawl the site. XMLHttpRequest is the obvious tool. We write output as each response comes back using an anonymous function which
closes over the current bin name.

/**
* Looks up the contents of a myjson bin and adds them to the list of results
* @param {string} bin
*/
var lookupBin = function (bin) {
    'use strict';

    var xhr = new XMLHttpRequest();

    xhr.open('GET', 'http://api.myjson.com/bins/' + bin);
    xhr.onload = function () {
        if (this.status === 200) {
            outputResult(bin, this.responseText);
        }
    };
    xhr.send();
};

Finally we need to iterate over the possible bin names. Some nested loops would handle this well enough, but it’s probably tidier to encapsulate this functionality. Here’s a function to iterate over an alphabet of characters:

/**
* Iterates over a list of characters
* @param {string} alphabet List to iterate across
* @param {string} prefix String to prepend before calling CALLBACK
* @param {function} callback Callback function, called with current string
*/
var iterateCharacters = function (alphabet, prefix, callback) {
    'use strict';
    var i;
    for (i = 0; i &lt; alphabet.length; i = i + 1) {
        callback(prefix + alphabet[i]);
    }
};

For each character in the alphabet we prepend an existing string and then pass the result on to the defined callback. Iterating over all three-character bin names is then simple. This example calls our output function directly without crawling each bin:

var alphabet = 'abcdefghijklmnopqrstuvwxyz0123456789';

var run = function () {
    'use strict';

    iterateCharacters(alphabet, '', function (string) {
        iterateCharacters(alphabet, string, function (string) {
            iterateCharacters(alphabet, string, outputResult);
        });
    });
};

run();

Finally we need an HTML document to host all this:

<!DOCTYPE html>
<html>
  <head></head>
  <body>
    <dl id="results"></dl>
    <script type="text/javascript" src="script.js"></script>
  </body>
</html>

Browsers try hard to download data as quickly as possible, and our crawl runs in parallel across several threads without extra effort.

Results

Scope

I restricted the crawl to include three-character names beginning with the letters ‘a’ to ‘d’. The above code will crawl all combinations of three-character bin names, and can also be easily extended to crawl four- and five-character names. It’s sensible to reduce the scope, however:

  • Crawling lots of bins takes lots of time.
  • It isn’t nice to load myjson.com heavily.
  • Thoughts of Weev make me nervous.
  • Brief Analysis

    The sample includes 1,637 rows. The top ten JSON strings are as follows:

    String Count
    “{}” 226
    “{“key”:”value”}” 92
    “{“foo”:”bar”}” 42
    “{“hello”:”world”}” 34
    “{“key”:”value”,”key2″:”value2″}” 30
    “{“glossary”:{“title”:”example glossary”,… 29
    “{“key_updated”:”value_updated”}” 26
    “[]“ 23
    “{“test”:”test”}” 17
    “{“key”:”updated value”,”key2″:”updated value2″}” 16

    We can therefore estimate that around 14% of bins contain only the empty object. Many of the examples above seem likely to have been created to test the service, and 69% of the extracting strings contain only 50 characters or fewer.

    It will be interesting to run a similar scrape in the future and see if the distribution of data changes:
    how many people are using this service as intended?

    Scope for Evil

    The API supports updating JSON data by sending an HTTP PUT request. It would only take a few minutes to overwrite all data stored by Myjson. Myjson doesn’t advertise a secure service, and they obviously aren’t worried that data is disclosed. They ought to be worried that somebody might trash everything they have stored.

Posted in Computing | Leave a comment

AWS CloudFront via CloudFormation

Amazon Web ServicesCloudFormation is a great way to define stacks of related resources. I don’t tend to find myself making more than one version of each stack, but have still seen some big advantages:

  • I no longer have to configure resources through the AWS management console, saving a heap of time.
  • Stack configuration is now in source control so all changes are logged.
  • I’ve learnt a great deal more about AWS and how its components interact.

Unfortunately I’ve had to pay for these with another heap of time: spent learning how to use CloudFormation and how to deal with it when things don’t quite work. I’ve wasted a lot of time trying to set up CloudFront CDN distributions, and thought I’d write up a couple of the gremlins I found in case this proves useful to anyone (including me).

S3 domain name isn’t a suitable domain name

Pointing a distribution at an S3 bucket is harder than you might think. All you need is the domain name, but CloudFormation won’t give it up easily.

According to the documentation:

DomainName
Returns the DNS name of the specified bucket.
Example: mystack-mybucket-kdwwxmddtr2g.s3.amazonaws.com

Unfortunately that’s not quite what happens. As described on the AWS forum, the domain name is slightly different and CloudFront won’t accept it. Unfortunately you have to build it yourself:

{ "Fn::Join" : [ "", [ { "Ref" : "BucketName" }, ".s3.amazonaws.com" ] ] }

It sometimes doesn’t work

Unfortunately I can’t currently do any better than that. This forum post implies that adding aliases might break things but I’ve managed to define distributions with aliases.

A second or so after the distribution starts being created it decides to delete itself again, with the message “Resource creation cancelled”. One day I’ll try to put together a reduced testcase. Try defining a very small template, and adding in other resources by updating the stack after it’s worked for the first time. This is good general advice for CloudFormation: test parts of a large template in isolation to save time on each iteration.

Posted in Computing | Leave a comment

Downgrading VMWare Tools

I’ve run into an amusing issue with the tools that come packaged with VMware Player. If I removed content from the middle of a file on the host, this change wasn’t replicated on the guest. Here the file would be reduced in size, but by removing bytes from the end rather than the middle, which isn’t entirely helpful.

I’m experimenting with downgrading the version of VMware Tools, and found handy instructions of how to do this here:

  1. Go to http://softwareupdate.vmware.com/cds/vmw-desktop/player/ and navigate to the version you want. The tools installation is bundled in a tar under the packages folder.
  2. Grab the file, extract it to get an executable, and run it.
  3. Go to VMware player and choose Player -> Manage -> Reinstall VMware Tools…
Posted in Computing | Leave a comment

AWS Architecture Diagrams

AWS release amazing architecture diagrams. The best of these must surely be the diagram of the architecture used to host the Obama for America campaign. The diagrams look great, but aren’t accessible to others as there’s no publicly downloadable set of icons… until now. Someone has created their own for download: http://blog.domenech.org/2013/04/aws-diagrams-adobe-illustrator-object-collection-first-release.html.

Posted in Uncategorized | Leave a comment

Eclipse Memory Limits

Update: I’ve had some instability issues since installing this fix, and have now reverted.

As I use Eclipse more and more, I’ve occasionally run into memory and garbage-collector limits when working on large projects or files. This page provides the solution: bump up the limits in eclipse.ini (found in the Eclipse programme directory):

-Xms512m
-Xmx1024m
-XX:PermSize=64m
-XX:MaxPermSize=128m

Apart from this I’ve been pretty happy with Eclipse. I’m stuck on the Juno version at the moment as some plugins don’t seem to play nicely with Kepler, and haven’t found the time to work out how to upgrade cleanly. I’m certainly preferring it to Aptana at the moment, primarily due to the plugin support. The wider userbase also comes with a corresponding improvement in the quantity and quality of online resources and documentation.

Update: I’ve had some instability issues since installing this fix, and have now reverted.

Posted in Computing | Leave a comment

Service Testing

Here’s a collection of services, and ways to go about testing them:

25: SMTP

Simple Mail Transfer Protocol.

Test with telnet:

$ telnet simpleigh.com 25
Trying 46.137.167.228...
Connected to simpleigh.com.
Escape character is '^]'.
220 aws.simpleigh.com ESMTP Postfix (Ubuntu)

Check it responds:

$ EHLO example.com
250-aws.simpleigh.com
250-PIPELINING
250-SIZE 20480000
250-ETRN
250-STARTTLS
250-AUTH PLAIN LOGIN
250-AUTH=PLAIN LOGIN
250-ENHANCEDSTATUSCODES
250-8BITMIME
250 DSN

Try sending a message to a local user:

$ MAIL FROM: <nobody@example.com>
250 2.1.0 Ok
$ RCPT TO: <blog@simpleigh.com>
250 2.1.5 Ok
$ DATA
354 End data with <CR><LF>.<CR><LF>
$ Email body goes here
$ 
$ .
$ 
250 2.0.0 Ok: queued as 18E73288

Try logging in:

$ AUTH LOGIN
334 VXNlcm5hbWU6
$ [ base64-encoded username ]
334 UGFzc3dvcmQ6
$ [ base64-encoded password ]
235 2.7.0 Authentication successful

Exit:

$ QUIT
221 2.0.0 Bye
Connection closed by foreign host.

You can test STARTTLS (where a secure channel is negotiated for an existing connection) functionality using OpenSSL‘s s_client:

$ openssl s_client -connect simpleigh.com:25 -crlf -starttls smtp
... loads of stuff
250 DSN
$ EHLO ...

80: HTTP

Hypertext Transfer Protocol.

Test with telnet:

$ telnet simpleigh.com 80
Trying 46.137.167.228...
Connected to simpleigh.com.
Escape character is '^]'.
$ GET / HTTP/1.1
$ Host: simpleigh.com
$ 
HTTP/1.1 301 Moved Permanently
Date: Sun, 11 Aug 2013 19:54:30 GMT
Server: Apache
Location: http://www.simpleigh.com/
Vary: Accept-Encoding
Content-Length: 233
Connection: close
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="http://www.simpleigh.com/">here</a>.</p>
</body></html>
Connection closed by foreign host.

110: POP3

Post Office Protocol.

Test with telnet:

$ telnet simpleigh.com 143
Trying 46.137.167.228...
Connected to simpleigh.com.
Escape character is '^]'.
+OK Dovecot ready.

Try logging in:

$ USER [ username ]
+OK
$ PASS [ password ]
+OK Logged in.

List messages:

$ LIST
+OK 20 messages:
... loads of stuff

Exit:

$ QUIT
DONE

143: IMAP

Internet Message Access Protocol.

Test with telnet:

$ telnet simpleigh.com 143
Trying 46.137.167.228...
Connected to simpleigh.com.
Escape character is '^]'.
* OK [CAPABILITY IMAP4rev1 LITERAL+ SASL-IR LOGIN-REFERRALS ID ENABLE IDLE AUTH=PLAIN AUTH=LOGIN] Dovecot ready.

Try logging in:

$ a1 LOGIN [ username ] [ password ]
a1 OK [CAPABILITY IMAP4rev1 LITERAL+ SASL-IR LOGIN-REFERRALS ID ENABLE IDLE SORT SORT=DISPLAY THREAD=REFERENCES THREAD=REFS MULTIAPPEND UNSELECT CHILDREN NAMESPACE UIDPLUS LIST-EXTENDED I18NLEVEL=1 CONDSTORE QRESYNC ESEARCH ESORT SEARCHRES WITHIN CONTEXT=SEARCH LIST-STATUS] Logged in

List folders:

$ a2 LIST "" "*"
... loads of stuff

Find out what’s in the Inbox:

$ a3 EXAMINE INBOX
* FLAGS (\Answered \Flagged \Deleted \Seen \Draft)
* OK [PERMANENTFLAGS ()] Read-only mailbox.
* 20 EXISTS
* 1 RECENT
* OK [UNSEEN 10] First unseen.
* OK [UIDVALIDITY 1345668496] UIDs valid
* OK [UIDNEXT 2543] Predicted next UID
* OK [HIGHESTMODSEQ 1] Highest
a3 OK [READ-ONLY] Select completed.

Exit:

$ a5 LOGOUT
* BYE Logging out
a5 OK Logout completed.
closed

443: HTTPS

Just like HTTP, but use s_client:

$ openssl s_client -connect simpleigh.com:443
... loads of stuff

465: SMTPS

Check using s_client:

$ openssl s_client -connect simpleigh.com:465 -crlf
... loads of stuff
220 aws.simpleigh.com ESMTP Postfix (Ubuntu)

993: IMAPS

Check using s_client:

$ openssl s_client -connect simpleigh.com:993
... loads of stuff
* OK [CAPABILITY IMAP4rev1 LITERAL+ SASL-IR LOGIN-REFERRALS ID ENABLE IDLE AUTH=PLAIN AUTH=LOGIN] Dovecot ready.

995: POP3S

Check using s_client:

$ openssl s_client -connect simpleigh.com:995
... loads of stuff
+OK Dovecot ready.

Summary

Almost everything can be driven via Telnet. If you need TLS, use s_client.

There’s a really handy OpenSSL command-line summary at http://www.madboa.com/geek/openssl/.

Posted in Computing | Leave a comment

Undocumented Features

I was amused to discover grumbling around the Internet about Google’s weather API, and their decision to remove it. This describes what happened:

Last month, Google announced plans to shutter iGoogle, among a bunch of other services. Many developers and users were (and still are) outraged, but at least they have some time to breathe: iGoogle isn’t going away until November 1, 2013. That means there are still 15 months left to adjust and export your data. Yet some changes are already starting to take effect: the company’s private Weather API, for example, appears to be dead in the water.

There’s some real gems further down:

Why should Google expect support tickets? The big clue is in the word “undocumented.” More from the article:

Web developer Jordan Stephens isn’t even bothering to look for alternatives. In fact, CurrentlyIn has been completely shut down as a result of the Google Weather API outage, according to an error message from the service.

Indeed, navigating to the CurrentlyIn site gets brings us the following:

currentlydown : (

Google has shut down its undocumented weather API (which was used by this site).

currentlyin.net will be down for the forseeable future.

The statement “let’s use this undocumented API from Google to get weather information for free” seems like a great idea, but undocumented features aren’t really features at all. Anything without documentation is an implementation detail, and subject to change without notice. In this case, change happened to include removal.

There’s some great examples at The Old New Thing:

Three examples off the top of my head of the consequences of grovelling into and relying on undocumented structures.

Defragmenting things that can’t be defragmented

In Windows 2000, there are several categories of things that cannot be defragmented. Directories, exclusively-opened files, the MFT, the pagefile… That didn’t stop a certain software company from doing it anyway in their defragmenting software. They went into kernel mode, reverse-engineered NTFS’s data structures, and modified them on the fly. Yee-haw cowboy! And then when the NTFS folks added support for defragmenting the MFT to Windows XP, these programs went in, modified NTFS’s data structures (which changed in the meanwhile), and corrupted your disk.
Of course there was no mention of this illicit behavior in the documentation. So when the background defragmenter corrupted their disks, Microsoft got the blame.

[...]

I hope you understand why I tend to go ballistic when people recommend relying on undocumented behavior. These weren’t hobbyists in their garage seeing what they could do. These were major companies writing commercial software.

Posted in Computing | Leave a comment

Some handy eclipse plugins

Update: this is currently tested for Eclipse Juno, although I hope to update this for Kepler in the near future.

I’ve been making more use of Eclipse recently as an IDE. I’ve been using Aptana (which is based on Eclipse) for a few years now, but I’m spending less time coding in PHP so it now makes sense to graduate onto the bigger tool, especially as not all Eclipse plugins play nicely with Aptana. One of the most exciting features of Eclipse is the plugin ecosystem. Eclipse has been around for ages, and there are plenty of fantastic plugins adding additional programming languages and features. Here’s a list of some I’ve found useful:

AWS Toolkit

Would you like to manage your Amazon Web Services resources directly while writing code? This sounds a little excessive, but is actually quite useful: Amazon’s management console is slow and will only show lists a few items at a time. This plugin makes it easy to find particular details quickly among a huge lists of items, and also comes with syntax highlighting for CloudFormation templates. It’s not entirely supported on the latest version of Eclipse (Juno), but the parts I need all work well.

EclipseFP

Haskell support for Eclipse, including syntax highlighting and direct support for loads of Haskell tools and frameworks directly (cabal, Hoogle, HLint, HTF, Alex, Happy, UUAGC, Snap, Yesod, …). You can run code directly from the IDE, calling out to GHCi.

Geppetto

Puppet is a great way to automate the configuration of new servers. Geppetto (the puppet maker) is an IDE for this: providing syntax highlighting and integration with the Puppet forge.

Json Tools

Support for JSON.

  • Update site:
  • Documentation:
  • Tick:
    • Json Tools

Juno Updates

Eclipse provides much of its core functionality via plugins, and there’s plenty available to extend the default configuration.

  • Update site: http://download.eclipse.org/releases/juno/
  • Documentation: http://www.eclipse.org/
  • Tick:
    • C/C++ Development Tools
    • Eclipse Web Developer Tools
    • Eclipse XML Editors and Tools
    • Eclipse XSL Developer Tools
    • JavaScript Development Tools
    • PHP Development Tools (PDT)
    • Subversive SVN Team Provider [Subversion integration]
    • Web Page Editor
    • … anything else you fancy!

Markdown Editor

Simple syntax highlighting and document outline for Markdown.

ReST Editor

As advertised last post, syntax highlighting makes editing ReStructured Text a lot easier…

ShellEd

BASH syntax highlighting.

TeXlipse

Support for LaTeX.

I hope to update this list as I discover new plugins: feel free to add suggestions in the comments.

Posted in Computing | Leave a comment

ReST Syntax Highlighting

I’ve talked before about writing documentation using Sphinx. Sphinx uses ReST (ReStructured Text) which is great, but sometimes a pain to edit without any form of syntax highlighting. Luckily it was pretty easy to track down an Eclipse plugin to do this.

First add a new software source URL like this:

rest1

Then tick the box to install the plugin:

rest2

Hurrah!

Posted in Computing | 1 Response

MongoDB

I’ve recently started working with MongoDB at work: it forms a core part of our tracking and reporting infrastructure, and all events that we track are slotted tidily into a Mongo database. Mongo has been getting some criticism of late, and while most of this has been largely misdirected (not using the right tool for the job doesn’t mean the tool was the problem), this piece was particularly interesting.

MongoDB does not give us much control over where data is placed, so the frequently accessed data (or data that is scanned together) may be spread over a large area. When scanning data only once, there is no way to prevent that data evicting the more frequently accessed data from memory. Once the frequently accessed data is no longer in memory, MongoDB becomes IO bound and lock contention becomes an issue.

My initial introduction to MongoDB was MongoDB in Action. I remember being struck by this quote:

Database tuning, which in most RDBMSs means tinkering with a wide array of parameters controlling memory allocation and the like, has become something of a black art. MongoDB’s design philosophy dictates that memory management is better handled by the operating system than by a DBA or application developer. Thus, data files are mapped to a system’s virtual memory using the mmap() system call. This effectively offloads memory management responsibilities to the OS kernel.

This sounds great. Why bother configuring when the kernel will probably do a better job than you ever could? It turns out this may have been a poor design decision: the kernel does a good job at managing memory, but doesn’t do the best job because it doesn’t know enough about how that memory is used within MongoDB.

It’s always difficult to separate the hype from the reality with any new technology, as this issue shows.

1. Use Mongo as WEB SCALE DOCUMENT STORE OF CHOICE LOL

2. Assume basic engineering principles applied throughout due to HEAVY MARKETING SUGGESTING AWESOMENESS.

3. Spend 6 months fighting plebbery across the spectrum, mostly succeed.

4. NIGHT BEFORE INVESTOR DEMO, TRY UPLOADING SOME DATA WITH “{$ref: ‘#/mongodb/plebtastic’”

5. LOL WTF?!?!? PYMONGO CRASH?? :OOO LOOOL WEBSCALE

6. It’s 4am now. STILL INVESTIGATING

7. DISCOVER PYMONGO DOES NOT CHECK RETURN VALUES IN MULTIPLE PLACES. DISCOVER ORIGINAL AUTHOR SHOULD NOT BE ALLOWED NEAR COMPUTER

8. REALIZE I CAN CRASH 99% OF ALL WEB 3.9 SHIT-TASTIC WEBSCALE MONGO-DEPLOYING SERVICES WITH 16 BYTE POST

9. REALIZE 10GEN ARE TOO WORTHLESSLY CLUELESS TO LICENCE A STATIC ANALYZER THAT WOULD HAVE NOTICED THIS PROBLEM IN 0.0000001 NANOSECONDS?!!?!?@#

Posted in Computing | Leave a comment

AWS Summit

I’ve just been to this year’s AWS Summit in London. I arrived at the event to find a suspiciously large density of suit-wearing managers. After shaking off the initial fear that I might be at the wrong event, I forged on…

The Business Design Centre is an agreeable sort of place for a conference. There’s plenty of space for cows (it did start life as the Royal Agricultural Hall), and there’s probably a joke about conference delegates and sheep around here somewhere. Lots of people complained about the queue to get in (the organisers had the bright idea of printing name tags as people arrived rather than having them ready), but I’d arrived slightly early and didn’t have to wait for long. I soon tracked down some breakfast and loaded up a sandwich. Note to BDC: £1.10 is too expensive for a sausage (even if they were rather tasty). 60p is certainly too much for two slices of slightly-stale white bread.

Initial worries about the dress sense of those attending were unfortunately completely justified, as the first keynote speech turned out to be an extended sales pitch for cloud computing. I didn’t quite understand the point of this – surely lots of people attending had received the invitation, like me, because they were already AWS customers? A late night, early start, and 1½ hours of boredom made the decision to sit next to my manager rather risky, but I did manage to remain awake for the entire thing, mostly by reading Twitter.

The keynote was punctuated by testimonials from current customers. Interest was maintained by the dubious choice of guests - first a chap from News International and then one from an oil company. These interludes turned out to be the most irritating aspect of the day – most speakers took full advantage of the opportunity to pitch their product, and few said much more than “we like AWS ’cause it’s cheap and it scales.” Even the more-technical talks were hobbled by this requirement, with customer talks failing to fit in to the remainder of the content, and rarely adding any information of interest.

Twitter remained great fun, with the Chinese whispers soon getting out of control:

I’m not sure I was helping…

A lecture about Amazon’s information security was more interesting, and it was illuminating to hear of some of what they do:

  • Staff are only granted access to any system for 60 days until their rights must be renewed by their manager.
  • SSH access to production servers requires a change ticket or issue number, and all activity is logged.
  • Any hard disks leaving their facilities must be physically shredded or destroyed before they may do so.

This lecture was better – but still felt like it was for managers (“don’t worry you can trust us with your data”) rather than developers (“look at our security, it’s cool”).

The afternoon promised more interest, with deeper studies of particular AWS products. Some of these talks were great (presentations about DynamoDB and OpsWorks being highlights of the day), and delivered on the promise of a technology conference – with more detailed information (DynamoDB indexing) and a live demo (using OpsWorks to deploy a web stack during the lecture). Other talks weren’t, with an “Advanced Topics” lecture about “Architecting for High Availability” covering little more than what was in the product overview pages for Elastic Load Balancing and Auto Scaling.

Ultimately I was expecting a tech conference which gave some deeper insight into AWS products, and thought that exposure to the AWS team might well provide that. Unfortunately most of the content was pitched at a very low level. I don’t necessarily think this Amazon’s fault: I evidently wasn’t the target audience, but I was a little bored. I can’t complain too much – the food was pretty good for a free conference!

The next day an email arrived inviting me to supply feedback on my experience, and I thought I might as well do so (the offer of a free Kindle didn’t sway my decision at all, *ahem*). Their survey was hosted by a third party site run by a company called “Qualtrics,” but quality was mostly lacking. For a start, radio buttons aren’t meant to do this:

aws_survey_1

Oh well, I could at least supply some feedback at the end:

aws_survey_2

… or not – as the input box was nowhere to be found.

I think it’s fair to say my appreciation of the day was fairly mixed. I got a day off work, and some free stuff (stickers, food and beer). I paid for it though, as I’m sure I’m stupider now.

There was one astonishingly cool feature – the presentations all used really cool little graphics (see http://awsofa.info/ for a great example). It’d be really cool if these were freely available.

Posted in Computing | Leave a comment

ssh-agent

On Linux systems it’s common to use ssh-agent (manpage) to store SSH keys. When you run the agent its output can be used to configure it within your shell. The variables defined allow ssh and ssh-add to find the agent once it’s started.

$ ssh-agent
SSH_AUTH_SOCK=/tmp/ssh-RRekSLl29890/agent.29890; export SSH_AUTH_SOCK;
SSH_AGENT_PID=29891; export SSH_AGENT_PID;
echo Agent pid 29891;

I use an Ubuntu VM, and regularly log in using PuTTY. It’s easy to connect to the agent on login, but I wanted a way to keep connecting back to the same one so I could add keys and leave them there. There’s loads of solutions available, but I wanted to try and write a simpler version.

Strategy:

  • Store the agent settings into a file under ~/.ssh
  • Load the file on login
  • Check whether the agent can be found, and if it can’t start it

This is slightly more subtle than it first appears: it’s necessary to cope with the file not existing and containing out-of-date data, and I wanted to do this as tidily as possible.

Here’s my solution:

[ -f ~/.ssh/agent_env ] && . ~/.ssh/agent_env
kill -0 $SSH_AGENT_PID 2> /dev/null || . <(ssh-agent | tee ~/.ssh/agent_env)

First we check if the file exists:

[ -f ~/.ssh/agent_env ]

If it does then we execute its contents using a bash shortcut (the ‘.‘):

. ~/.ssh/agent_env

If the file did exist then we should have a value for the SSH_AGENT_PID. We check whether it’s possible to send a signal to the process this identifies using kill -0:

kill -0 $SSH_AGENT_PID

The nice thing about this is that if the variable isn’t defined then we end up passing garbage on to kill, and it’ll return false in the same way as if the process didn’t exist. We add a redirect to /dev/null so we don’t see its helpful error message.

kill -0 $SSH_AGENT_PID 2> /dev/null

Now we just need to run the agent, store its output in the file, and use that output to configure the shell. We can use a bash trick to load the agent and configure the shell in one tidy step:

. <(ssh-agent)

The redirect here has the effect of attaching the ssh-agent output to a file descriptor, and then returning the path to that file descriptor. It’s convenient when you want to pass the output of a command to one which is expecting a filename.

Finally we insert a call to tee which writes out our file.

. <(ssh-agent | tee ~/.ssh/agent_env)
Posted in Computing | Leave a comment

Creating a VM to mimic EC2

I spend a lot of time working with Amazon’s EC2 (Elastic Compute Cloud) virtual servers. Configuring a new server takes up a surprisingly large amount of time when done manually, and so I use Puppet to automate the process. This works pretty well, but after creating lots of new configuration code I always think it’s worth testing the complete installation on a local Virtual Machine. I’ve built a VM which looks quite a lot like an empty Ubuntu 12.04 cloud image, but have always worried that the CD installation might have a few key differences. Here’s a way to work round some of these.

EC2 instances use a different set of sources for APT, so the first step is to sync these up. Wipe the current /etc/apt/sources.list, and replace with an example from the cloud image you’re trying to emulate. I used the following lines:

deb http://eu-west-1.ec2.archive.ubuntu.com/ubuntu/ precise main
deb-src http://eu-west-1.ec2.archive.ubuntu.com/ubuntu/ precise main
deb http://eu-west-1.ec2.archive.ubuntu.com/ubuntu/ precise-updates main
deb-src http://eu-west-1.ec2.archive.ubuntu.com/ubuntu/ precise-updates main

deb http://eu-west-1.ec2.archive.ubuntu.com/ubuntu/ precise universe
deb-src http://eu-west-1.ec2.archive.ubuntu.com/ubuntu/ precise universe
deb http://eu-west-1.ec2.archive.ubuntu.com/ubuntu/ precise-updates universe
deb-src http://eu-west-1.ec2.archive.ubuntu.com/ubuntu/ precise-updates universe

deb http://security.ubuntu.com/ubuntu precise-security main
deb-src http://security.ubuntu.com/ubuntu precise-security main
deb http://security.ubuntu.com/ubuntu precise-security universe
deb-src http://security.ubuntu.com/ubuntu precise-security universe

Now a quick apt-get update will get you installing the same packages as EC2. The next logical step is to reinstall all the current packages to match. I thought this would be pretty easy, and went for the following invocation:

dpkg --list |
grep ^ii |
awk '{ print($2) }' |
xargs apt-get install --reinstall

This works as follows:

  1. Get the current list of packages
  2. Return only those which are actually installed
  3. Drop everything in the output but the second column (the column containing the package name)
  4. Pipe into xargs which in turn calls APT.

This got things moving and I’d soon downloaded 397 fresh packages. Unfortunately things then stopped with the (un)helpful message xargs: apt-get: terminated by signal 11. I tried calling APT separately, and got slightly further: Segmentation fault (core dumped). Did I say this got me further?

Reinstalling packages individually seemed to work, so I developed the theory that APT chokes when fed too many packages in one go. This bug seemed to be related. I worked around it with a simple modification to the xargs command to take ten packages at a time:

dpkg --list |
grep ^ii |
awk '{ print($2) }' |
xargs -L 10 apt-get install --reinstall

I decided to take another look at the Launchpad bug report linked earlier. This states:

When trying to –reinstall multiple packages with apt-get, it crashes with a segmentation fault. The command line: apt-get –reinstall install <2-or-more-packages...>

Having used Ubuntu for free for many years I thought it might be about time I gave something back to the community, so I thought I’d take another look. I’ve posted a comment with some more information after getting rather more involved with GDB and apport then I would have liked!

I genuinely don’t know if there is any difference between the packages for EC2 and those in the standard repositories (I rather suspect there isn’t), but if there is then I definitely don’t want to find out about it when trying to fix an urgent problem.

Posted in Computing | Leave a comment

Exotic Data Structures

I came across this recent comment on Hacker News today, and thought the data structures therein worth advertising further:

Smushed list

Size O(1). The smushed list is a list of variables (of the same type), stored in a single variable of that type. To produce the smushed list, simply XOR all the elements of the list together, then store. To get a value back, simply XOR the smushed list by all the elements other than the one you want. Smushing is also embarrassingly parallel (you can smush two halves separately and then smush the results) so producing smushed lists is blazingly fast.

Unlinked list

O(n). This is slightly faster than a linked list, and acts as a “black box”. Simply allocate nodes that are not linked to each other in any way. The data normally stays out of the way of your program, but in case of a core dump you can find it again. NOTE: If your language does reference-counting this will not work. Get a real language that does what you say.

Search-bush

Search trees are good at bisecting data, but they are not really conducive to a random walk for inspiration. Begin by constructing a binary search tree, keeping track of all the nodes you’ve added, and simply add a third, random, pointer to each node – have it point at a random node somewhere in the tree. In the search algorithm, either follow the left, right, or random node, depending on how much meandering you are interested in doing. The journey is the destination.

I’m reasonably confident I’ve seen these before somewhere but can’t figure out where.

Posted in Computing | Leave a comment

PHP documentation and Sphinx

Documentation is important, but commonly forgotten. Writing any code requires filling your brain with information – what the language is doing – what the API you’re calling is doing – what you want the code to do – and most of this information will fall out again a few days later. A month later a bug is found, and this information needs to be acquired all over again. Luckily you thoroughly documented everything that the code was doing, allowing you to pick up where you left off and fix the bug in seconds! Hurrah! Well, maybe.

To make this process easier, a variety of systems have been developed to build documentation automatically. In general you simply include some extra comments in the sourcecode explaining what you’re up to, and then something else extracts this information and formats it elegantly for the web Javadoc is a standard tool for Java developers, and the syntax has been recycled for other C-style languages, giving us JSDoc (for JavaScript) and PHPDoc. Perl’s POD and Pyhon’s PyDoc seem to have slightly different solutions which emphasise writing a section of documentation at the start of a file rather than merging it into the source, although Python also uses docstrings to directly document parts of the code.

phpDocumentor was the first PHP documentation system I met. It worked pretty well, although the format of the documentation it output left a little to be desired. A good example is the PHP Client Library for Google’s Adwords API.

A rival soon came along in the form of DocBlox, currently used for the Zend Framework documentation. This created much more elegant output.

The two projects have now merged together under the name phpDocumentor 2, which delivers the best documentation yet (do try it).

This system is great, and I always include markup within code to allow simple documentation to be generated. Unfortunately it’s quite hard to write good documentation when it has to be wedded to the structure of a project’s sourcecode. The Zend Framework folks provide automatically-generated documentation, but also write a more thorough reference guide. I wanted a more sophisticated documentation generator.

Sphinx was originally created to create new documentation for the Python language. I’d always assumed it was only really suitable for Python projects, but soon started to see it in more places. The new phpDocumentor documentation is built with Sphinx (see the footer at the bottom), as is the current documentation for Zend Framework 2, and more PHP-centric projects seem to be making use of it.

I soon installed Sphinx and used its quickstart script to get going. Syntax highlighting of PHP source code is supported – Sphinx uses Pygments for this, but I was still lacking the ability to properly cross-reference code. Sphinx uses “domains” to allow you to document parts of the code such as classes or functions, and then allows easy cross-references between these. I found that the Python domain worked pretty well for PHP, but wondered if a better option was available.

Sphinx supports domains for Python, C/C++, JavaScript and “reStructuredText” (the Sphinx markup) out of the box, and more are available in the sphinx-contrib repository, including Ada, Erlang and Ruby. Sadly there was no PHP domain listed in the readme, but I managed to stumble on one by looking through the source (how ironic that a repository for a documentation generator is poorly documented…) The readme for the domain suggested to install it using Python’s easy_install, and I soon found the project itself in the Python Package Index.

Installing all this on Ubuntu is pretty simple:

sudo apt-get install python-sphinx python-setuptools
sudo easy_install pip
sudo pip install sphinxcontrib-phpdomain
cd /path/where/documentation/project/lives
sphinx-quickstart

After answering some questions about your project  you’ll find a directory structure with an automatically generated conf.py. You’ll need to make a couple of changes to it as follows:

# Add any Sphinx extension module names here [...]
extensions = ['sphinxcontrib.phpdomain']

# The name of the default domain.
primary_domain = 'php'

# The default language to highlight source code in.
highlight_language = 'php'

Assuming you allowed the quickstart to create the makefile, you can now test everything works:

make html

… and point your browser at the html documentation. Take a look at the makefile itself to see what formats you can use for your documentation – there’s a lot of options. reStructuredText itself is pretty simple, and the Sphinx documentation itself is good. Documentation for the PHP domain is OK, but misses some bits out. It allows you to document the following objects:

  • ‘function’ cross-referenced with ‘func’
  • ‘global’ cross-referenced with ‘global’
  • ‘const’ cross-referenced with ‘const’
  • ‘method’ cross-referenced with ‘meth’
  • ‘class’ cross-referenced with ‘class’
  • ‘attr’ cross-referenced with ‘attr’
  • ‘exception’ cross-referenced with ‘exc’
  • ‘namespace’ cross-referenced with ‘ns’
  • ‘interface’ cross-referenced with ‘interface’

 Now all you need to do is write something…

Posted in Computing | 1 Response

Developing in a Linux Virtual Machine on Windows (3)

This post is part of a series about my development stack – I use an Ubuntu Server VM running on Windows.

  • Part 1 showed how to set up the VM
  • Part 2 demonstrated how we could share code on the host system with the guest in order to execute it in a Linux environment

There’s some sort of grim inevitably to progress with this series so far: I put together a post showing how to use VMWare Player, only for them to release a new version and render all my screenshots out-of-date. I thought it’d be worth adding some more to show some of the more obvious differences.

Taking a first look at things, the new version looks pretty familiar. It’s lost the thick blue borders and skin, and has a refreshing Aero feel.

Version 4

Version 5

 Perhaps the most interesting change is the prominent new notifications about licensing. I’m not using the software commercially so it’s not a problem for me, but I was curious to see how a licence would cost. Clicking through we reach an unhelpful page offering me “VMWare Fusion for Mac OSX”. Looks like VMWare need to figure things out…

I had no problems starting Virtual Machines created under version 4 of the player (I won’t go so far as to say this was a relief, but it’s certainly helpful!) Once running a VM, the interface looks much cleaner.

One change worth noting is that accessing the virtual machine settings now takes an extra click – it’s now found under the single “Player” menu:

While exploring around I also noticed something I’d not found before – the ability to download and install a “Virtual Appliance” directly. Clicking the menu option takes us to a marketplace for VM images.

There’s an appliance for Ubuntu Server 12.04, but looking at the comments it seems they charge for this, so I’ll stick to installing it manually.

Posted in Computing | Leave a comment

Developing in a Linux Virtual Machine on Windows (2)

This is the continuation of a previous post where we set up an Ubuntu virtual machine in VMWare Player on a Windows host. To recap, what we’ve got so far is:

  • An easily-accessible Ubuntu server.
  • Folders on the host shared with the guest. We can edit code on the host machine and then test the changes in the guest directly, without having to upload / copy / publish / sync anything.

Hurrah!

Running a VM in this way has one profound annoyance: once you’re using the server, you can’t get out! The player captures your mouse and keyboard, and all input is directed to the player. You can get them back again by pressing Ctrl+Alt, but this process keeps getting in the way – stopping easy Alt+Tabbing and disrupting my working rhythm. I also had an additional problem: I use a customised version of the Dvorak keyboard layout, and didn’t fancy the effort required to set this up on Linux. Luckily there’s an easy solution.

The defacto-standard secure shell client on Windows is PuTTY, and this is great for connecting to remote servers. Just because the guest system is running on the same machine doesn’t mean I can’t pretend it’s miles away! Before you connect you’ll have to install an SSH daemon:

sudo apt-get install openssh-server

Now connecting is easy – just put the hostname in and click “Open”:

On some networks you may find that the hostname doesn’t resolve, or it may take a little time for your router to figure out what the name is. You can easily work around this by connecting directly to the IP address of your virtual machine – find it by entering the ifconfig command at the shell. Looking this up every time you boot up the VM rapidly becomes tedious and so you may want to assign a static IP. This is a pretty simple process from the shell, although you may have to persuade your router to play nicely with this process. Once you’ve fixed the IP you can save it in PuTTY as a session, or even add a meaningful name to your Windows hosts file to make it permanent.

Playing with the PuTTY settings is a worthwhile process. I usually set up a couple of changes in the “Default Settings” session to make my life better:

  • Window->Behaviour, set “Full screen on Alt-Enter” to on
  • Connection->Data, set “Terminal-type string” to “xterm-color” to enable the colourful shell prompt

We’ve done a lot but not made a lot of progress from the where we started. We’ve got easier access to a Linux server than ever before, but it doesn’t yet do much. Here’s what I do to get things going:

locale-gen en_GB.UTF-8
update-locale LANG=en_GB.UTF-8
apt-get install apache2 php5 mysql-server libapache2-mod-php5 php5-cli php5-curl php5-gd php5-mysql
echo "AddDefaultCharset utf-8" > /etc/apache2/conf.d/charset
mv /etc/php5/apache2/php.ini{,.old}
cp /usr/share/doc/php5-common/examples/php.ini-development /etc/php5/apache2/php.ini
a2enmod rewrite
service apache2 restart
exit

Once that’s all finished you can test by navigating to your server in a web browser.

Hurrah!

The final step is to make the webserver serve up the shared files from the host. You can easily dig in and edit Apache’s configuration files to accomplish this, but an alternative scheme is just to add a symlink:

sudo ln -s /mnt/hgfs/www /var/www/

(This assumes you setup the shared drive as I did in part 1, i.e. under the name “www”). Now we can access the shared files in the webserver at http://leighserver/www/, and immediately preview changes made in the host.

In future posts I’ll take a look at how we make those changes in the host, and what development tools there are to make that easier.

Posted in Computing | 1 Response

Database design 101: Is your field big enough?

When designing a database schema, it’s worth trying to choose sensible sizes for each column. It’s obviously crucial to make sure that your field gives enough space to accommodate the data you want to store in it, but it’s good to keep the sizes down as far as possible in order to improve performance. Increasing the size of each row is going to come with a cost in storage and processing time. As might be obvious from the title, I learnt a lesson about this a couple of days ago.

Some months ago I built a new API, and we were particularly keen to be able to analyse its performance and the number of queries. Given we had a database connection handy, it seemed most convenient to log requests directly into a table. With hindsight this was probably a bad idea:

  • Ultimately it’s going to be less performant than just writing to a log file
  • As the requests pile up the table gets bigger, making backups huge and soon necessitating a regular cronjob to purge out older entries

On the other hand, logging to the database does give a massive advantage in terms of the ease of querying the data. With a few SQL statements it’s trivial to analyse load per hour, see how it varies over time, and group by other logged parameters (such as the service queried). We also built a plugin for Munin to allow us to produce realtime graphs of API load. While a log file can be queried in a similar way using the linux shell, that’s a rather less trivial process.

I built up the table as follows:

CREATE TABLE `api_requests` (
    `id` MEDIUMINT(8) UNSIGNED NOT NULL AUTO_INCREMENT,
    `service` VARCHAR(10) DEFAULT NULL,
    `ip` VARCHAR(20) NULL DEFAULT NULL,
    `query_string` VARCHAR(1024) NOT NULL DEFAULT '',
    `time_taken` FLOAT(9,6) NULL DEFAULT NULL,
    `timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (`id`)
)
    COLLATE='utf8_general_ci'
    ENGINE=InnoDB
;

This all seems OK so far. So which field is too small?

  • The service field looks suspicious – there’s only 10 characters allowed. This turns out not to be a problem as the only services we’ve built have short names.
  • What about the IP address? 20 characters is more than enough for IPv4, and we’re not being queried over IPv6.
  • I guess the query string could be too short…?

If it’s not yet obvious the answer was the primary key, the ID. A MySQL MEDIUMINT UNSIGNED allows values up to 16,777,215, and after a few months the id grew to this level. At this point the server refused to insert any more rows with a cryptic error message about a unique key violation (the AUTO_INCREMENT value can’t grow any higher, and it tries to insert the new row with the same ID as an old one). A few minutes later (after a lucky catch by our API validation system) and the field size was big enough to last the next 300 years.

It’s obviously impossible to predict the future, but I think I’ll be erring a little more on the side of caution when designing future schemas.

Posted in Computing | Leave a comment

Virtualisation and web hosting – update

I wrote recently about my efforts to find a cheap VPS host. Recently I’ve come across http://www.cloudorado.com/ which aims to make this process easier. It looks pretty good, but doesn’t seem to help much with the really small case – there’s no AWS Micro EC2 instances, for example.

Posted in Computing | Leave a comment

Developing in a Linux Virtual Machine on Windows

Sadly for this post, but luckily for the rest of us, VMWare have now released version 5 of the Player software. Everything seems to work as before and most functionality appears in the same place, but the screenshots are now out-of-date.

I like Windows. I know my way around it, I’ve tried the alternatives and I still like it.

The most important part of writing any code is testing it, and it’s hard to test properly on a remote machine. Every change has to be uploaded (over a slow connection) before it can be run. It’s obviously a bad idea to test on a live site, so now we need two remote machines. Now the cost is mounting along with the frustration.

I tend to work in PHP, with sites hosted on Apache and using MySQL databases. As it turns out, these can all be installed on Windows system, and there’s easy packages available to do that (like WAMP or XAMPP). I’ve tried one of these (can’t remember which), and despite a clunky interface everything did seem to work. I soon ran into problems, however, when maintaining and developing PHP command-line scripts build for a UNIX machine, finding that:

  1. Everything’s in the wrong place. The current scripts expected to look in /usr/bin, but this directory didn’t exist. The closest match might be C:\Program Files\, which isn’t the same at all.
  2. Shell scripts written for BASH don’t work on cmd.exe. Nothing I could do was going to make them work.

These problems aside, testing code in one environment and then deploying to another doesn’t seem like a good idea. I needed access to a UNIX box.

The most obvious option one is to install a UNIX system locally and work on that. This sounds fantastic – you edit code and it’s already uploaded! Unfortunately for me this isn’t an option ’cause I like Windows. Luckily I’d heard of virtualisation. It didn’t take long to start up a Linux virtual machine which ran cleanly within Windows, and here’s how you can do it too.

There’s a wide variety of host software for running virtual machines within Windows, and I’d already heard of VirtualBox, VMWare Player and Virtual PC. The latter doesn’t officially support Linux guests, and a brief comparison led me to pick VMWare’s Player. Installing this was very easy, and I soon found myself ready to start creating a virtual machine.

Choosing the option to create the VM starts a wizard to guide you through the settings. To install an operating system you’d commonly need an installation CD, but the player can cope with .iso image files directly, so burning the physical CD isn’t necessary. I grabbed the latest version of Ubuntu server from their download site and was ready to go.

I’m probably over-suspicious, but I tend not to like letting applications help me out with operating system installations, and VMWare Player will jump in and ask you if you want help with the process. The best way to get it out of the way is to avoid telling it what you’re up to. I chose the option to install the operating system later. (I have had experience of things not working right after using the guided installer).

Again we keep our intentions secret, avoiding telling the player that we’re installing Linux.

At the end of the wizard, click the “Customize Hardware” button to tweak the VM’s hardware settings.

Again I’m probably being over-suspicious, but I think operating systems like an easy life and so my first step was to remove any unnecessary hardware (e.g. USB controllers / printers). Once you’ve done that it’s worth bumping up the available memory to a sensible level, and it’s finally time to tell the player what CD image we want to put in the drive (click to zoom in).

I want to set up a webserver and this will expect to be connected to a network in a normal manner. I therefore chose to bridge the host network adapter, allowing the guest operating system to connect directly on to the network.

Once you’ve done this you’re ready to boot up. The CD image will be used directly, and the virtual machine will boot into the Ubuntu setup program. This is fairly easy to work through, and there’s lots of help and support on the Ubuntu website. Towards the end of the process you’ll find an option to automatically install a load of software. My suspicious nature kicked in again and I decided not to bother, opting to install what I wanted manually later.

VMWare Player comes with a suite of software (the VMWare Tools) which we’ll need. Hiding the operating system we’re installing kept the process smooth, but prevented the tools installing automatically. This is easily rectified: first shut down the VM, then run VMWare Player again. You’ll have an option to edit the virtual machine settings.

Choose “Linux” and “Ubuntu 64-bit” (I installed the 64-bit variety of Ubuntu).

Now the player knows what operating system is running it will offer to install the tools.

Before we do this we need to get the operating system ready. Run the following as root:

apt-get update
apt-get upgrade
apt-get dist-upgrade
apt-get autoremove
apt-get install build-essential
shutdown –r now

When the machine reboots you’re ready to install the tools. You may need to change the filenames as version numbers change – use the shell’s tab autocompletion to help with this.

mount /dev/cdrom /media/cdrom
cp /media/cdrom/VMwareTools-8.8.4-743747.tar.gz .
umount /dev/cdrom
tar -xzf VMwareTools-8.8.4-743747.tar.gz
vmware-tools-distrib/vmware-install.pl

You should be able to get by just by selecting all the default options. I’ve had a few problems getting this to work recently, possibly because the latest Ubuntu comes with version 3 of the Linux kernel whereas the tools build “Using kernel 2.6 build system”. Making sure your system is totally up-to-date usually does the trick.

We’ve now got an easily-accessible shell, but nothing else. We’ve not got the file sharing which had been one of our original aims. This is easy to fix now the tools are installed. You can get to the Virtual Machine Settings dialog from the Virtual Machine menu even though the VM is still running, and set up a file share.

Create a shared folder, and remember its name. I’ve got some web code in E:\Documents\Code\www, and I shared this with the VM. Click OK until you’re back at the shell. VMWare calls this file sharing the “Host Guest File System”, and puts all shares in /mnt/hgfs. A simple ls /mnt/hgfs/www showed that everything was working properly and the host’s files were showing through.

Now we’ve got a Linux server running locally, and I’ve shared files from my host system. I can edit any code from within Windows, and the files will appear immediately on the guest, making it dead easy to debug code. There’s just a few more steps to cover to complete the installation, and we’ll take a look at these next time.

Posted in Computing | 2 Responses