Compression

A link recently bubbled up the frontpage of Hacker News, and I found myself rather curious to find out more. The headline rather proudly proclaimed: “RJSON: compress JSON to JSON”. JSON is a useful and simple way of transmitting data between machines on the Internet, and is commonly-used by many dynamic websites today (see Wiki for more details). Surely a way of compressing these transmissions would speed up the world? Hurrah!

Let’s take a look at the page itself.¬†Basically, it’s a library which lets you get from here:

{
	"id": 7,
	"tags": ["programming", "javascript"],
	"users": [
		{"first": "Homer", "last": "Simpson"},
		{"first": "Hank", "last": "Hill"},
		{"first": "Peter", "last": "Griffin"}
	],
	"books": [
		{"title": "JavaScript", "author": "Flanagan", "year": 2006},
		{"title": "Cascading Style Sheets", "author": "Meyer", "year": 2004}
	]
}

To here:

{
	"id": 7,
	"tags": ["programming", "javascript"],
	"users": [
		{"first": "Homer", "last": "Simpson"},
		[2, "Hank", "Hill", "Peter", "Griffin"]
	],
	"books": [
		{"title": "JavaScript", "author": "Flanagan", "year": 2006},
		[3, "Cascading Style Sheets", "Meyer", 2004]
	]
}

The page also helpfully links to alternative compression schemes called JSON DB and CJSON, which each produce slightly dXMLifferent compressed output.

Let’s consider a number of questions:

  • Does this really compress the file?
  • Is it useful?
  • Should I use it?

Does this really compress the file?

When comparing the two compressed versions I was curious to note that they use a slightly different example for compression. The RJSON site used an abridged version of the example on the JSON DB page. Let’s take a look at these examples and how they compress. For the purposes of the analysis I’ve used UNIX line-endings and not minified the examples in any way. To produce the data I used the online demos for RJSON and CJSON.

  • Example 1 (Short version from RJSON page)
    • Original: 340 bytes
    • RJSON (claimed): 279 bytes
    • RJSON (via demo): er, 549 bytes
    • CJSON (via demo): 308 bytes
  • Example 2 (Longer version from JSON DB page)
    • Original: 711 bytes
    • RJSON: 725 bytes
    • JSON DB: 548 bytes
    • CJSON: 429 bytes

Let’s be charitable and assume that RJSON really works fine, but the demo is broken. If we take the claimed compression then we save an underwhelming (340 – 279) / 340 = 17.9% of the file size. JSON DB manages 22.9%, although I suspect much of this improvement is caused by the larger example input which contains¬† more repeated strings. These algorithms target these repeated strings (the keys for the JSON objects) and so perform better with a larger input. The compression performance will tend to a limit as the size increases, and the back of an envelope shows that this could be quite a large limit for RJSON if we keep adding more “users” to their example (around 50%). JSON DB and CJSON work similarly, and I’m not going to discuss them further.

I can’t help feeling we’re missing something crucial here – with such large potential gains why is the filesize increasing? Clearly the examples are far too short, but they also have a crucial flaw: the whitespace. Indentation and newlines make it clear what’s going on, but don’t represent the theoretical compression we could have by removing those as well. JSMin is a tool for stripping such spaces, and there’s an online version too to make things easy.

  • Example 1
    • Original: 340 bytes
    • Original minified: 284 bytes
    • RJSON minified: 232 bytes
  • Example 2
    • Original: 711 bytes
    • Original minified: 456 bytes
    • RJSON minified: 326 bytes
  • Example 3, made from the first example by retaining only the user section and repeating it 31 times:
    • Original: 4015 bytes
    • Original minified: 3243 bytes
    • RJSON minified: 1632 bytes

Here’s the start of Example 3:

{
"users": [
	{"first": "Homer", "last": "Simpson"},
	{"first": "Hank", "last": "Hill"},
	{"first": "Peter", "last": "Griffin"},
	{"first": "Homer", "last": "Simpson"},
	{"first": "Hank", "last": "Hill"},
	...

Now we’re getting somewhere, with percentage savings of 18.3%, 28.5% and 49.7%. It’s easy to show that, as we repeat the user section more times, this ratio will tend to a maximum of around 50.5%.

Finally, it’d be useful to compare to another compression algorithm to keep things in context, and gzip is a good example. Minifying and gzipping we get:

  • Example 1: 208 bytes
  • Example 2: 260 bytes
  • Example 3: 127 bytes

RJSON is beaten marginally in the first case, convincingly in the second case, but resoundingly in the third. The astonishing compression for the third example is another measurement error: gzip is really good at compressing repeated strings, and 32 copies of the same data compress really well.

Is it useful?

Compression Rate

Well, we’ve got some compression, but not as good as gzip. Clearly the system will be useful in situations where gzip isn’t available, but isn’t worthwhile otherwise.

Support

To use the algorithm we need the code (which, ironically, needs to be sent to the client before it can be used). The code for gzip is already built in to web browsers, meaning we get that for free when downloading data from websites. There’s still some use, however, because data sent from the browser to the server won’t be compressed.

Alternatives

The first question in my mind on seeing these schemes was: “Why?” The developer of the application has control over the data being transmitted, so here’s a quick way to compress the file: change the field names. Changing “first” to “f” and “last” to “l” for Example 3 above drops the filesize to 2571 bytes, a 20.7% saving for virtually no effort. This clearly isn’t as good, but has one crucial advantage: we’re still transmitting JSON, and don’t need any special code to decompress it. Restructuring the code further could have similar gains to the RJSON output without mangling the data too much.

Should I use it?

JSON is already a pretty slender protocol (when compared, say, to XML), so does it really need much compression? Using short keys and only sending the minimum of data required will save a large amount of transmission with very little effort. I see two big disadvantages with any JSON compression scheme implemented in code:

  • It’s implemented in code. Code is really dangerous – all programmers should try really hard to write less code. Code needs to be maintained, adds to the complexity of a project, and contains bugs. Your transmission is failing: now you’ve got a whole extra chunk of code that could be causing the problem.
  • It’s not as transparent. At least the data is still JSON, but it’s not the clear, elegant JSON which is really easy to read. The first step in investigating the failing transmission would be to take a look at what’s being transmitted, but that is now a trickier proposition.

As far as I’m concerned, anything which introduces complexity at the same time as making your application harder to debug is a lose-lose proposition, especially when the gains are reasonably marginal compared with common-sense optimisations. So is this trade-off ever worth it? The answer, of course, depends on what you’re building.

This entry was posted in Computing. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>