When designing a database schema, it’s worth trying to choose sensible sizes for each column. It’s obviously crucial to make sure that your field gives enough space to accommodate the data you want to store in it, but it’s good to keep the sizes down as far as possible in order to improve performance. Increasing the size of each row is going to come with a cost in storage and processing time. As might be obvious from the title, I learnt a lesson about this a couple of days ago.
Some months ago I built a new API, and we were particularly keen to be able to analyse its performance and the number of queries. Given we had a database connection handy, it seemed most convenient to log requests directly into a table. With hindsight this was probably a bad idea:
- Ultimately it’s going to be less performant than just writing to a log file
- As the requests pile up the table gets bigger, making backups huge and soon necessitating a regular cronjob to purge out older entries
On the other hand, logging to the database does give a massive advantage in terms of the ease of querying the data. With a few SQL statements it’s trivial to analyse load per hour, see how it varies over time, and group by other logged parameters (such as the service queried). We also built a plugin for Munin to allow us to produce realtime graphs of API load. While a log file can be queried in a similar way using the linux shell, that’s a rather less trivial process.
I built up the table as follows:
CREATE TABLE `api_requests` ( `id` MEDIUMINT(8) UNSIGNED NOT NULL AUTO_INCREMENT, `service` VARCHAR(10) DEFAULT NULL, `ip` VARCHAR(20) NULL DEFAULT NULL, `query_string` VARCHAR(1024) NOT NULL DEFAULT '', `time_taken` FLOAT(9,6) NULL DEFAULT NULL, `timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (`id`) ) COLLATE='utf8_general_ci' ENGINE=InnoDB ;
This all seems OK so far. So which field is too small?
- The service field looks suspicious – there’s only 10 characters allowed. This turns out not to be a problem as the only services we’ve built have short names.
- What about the IP address? 20 characters is more than enough for IPv4, and we’re not being queried over IPv6.
- I guess the query string could be too short…?
If it’s not yet obvious the answer was the primary key, the ID. A MySQL MEDIUMINT UNSIGNED allows values up to 16,777,215, and after a few months the id grew to this level. At this point the server refused to insert any more rows with a cryptic error message about a unique key violation (the AUTO_INCREMENT value can’t grow any higher, and it tries to insert the new row with the same ID as an old one). A few minutes later (after a lucky catch by our API validation system) and the field size was big enough to last the next 300 years.
It’s obviously impossible to predict the future, but I think I’ll be erring a little more on the side of caution when designing future schemas.