Can WordPress Developers survive without InnoDB? (with MyISAM vs InnoDB benchmarks)

Update: Thanks Matt for the mention and Joseph for the excellent point in the comments that WordPress in fact uses whatever MySQL’s default table handler is, and from 5.5 onwards, that’s InnoDB – and for his comments on InnoDB durability.

My development energy has been focused on WordPress.org a lot during the past few months for the simple reason that I love the publishing platform and it’s hard to resist getting my grubby paws stuck into the awesome plugin API.

To my horror I discovered that WordPress installs using the MyISAM table engine. [Update: via Joseph Scott on the Automattic Akismet team: WordPress actually specifies no table type in the create statements, so it uses MySQL's default table engine which is InnoDB from version 5.5 onwards. See the rest of his comment below.] I absolutely loathe MyISAM because it has burned me badly in the past when table locking killed performance in very high traffic apps I’ve built. Converting to InnoDB saved the day and so I have a lot of love for the InnoDB storage engine.

Many WordPress hosts like Hostgator don’t support anything but the default MyISAM table type that WordPress uses and they’ve made it clear they don’t plan to change.

While WordPress is a mostly read-only application that doesn’t suffer too badly with read/write locking issues, the plugin I’m developing will be doing a fair amount of writing, so the prospect of being stuck with MyISAM is a little horrifying.

So I set out to figure out exactly how bad my situation is being stuck with MyISAM.

I created a little PHP script (hey, it’s WordPress so gotta go with PHP) to benchmark MySQL with both table types. Here’s what the script does:

  • Create a table using the MyISAM or InnoDB table type (depending on what we’re benching)
  • The table has two integer columns (one is a primary key) and a 255 character varchar.
  • Insert 10,000 records with randomly generated strings of 255 chars in length.
  • Fork X number of processes (I start with one and gradually increase to 56 processes)
  • Each process has a loop that iterates a number of times so that we end up with 1000 iterations evenly spaced across all processes.
  • In each iteration we do an “insert IGNORE” that may or may not insert a record, a “select *” that selects a random record that may or may not exist, and then a “delete from table order by id asc limit 3″ and every second delete orders “desc” instead.
  • Because the delete is the most resource intensive operation I decided to bench it without the delete too.

The goal is not to figure out if WordPress needs InnoDB. The answer is that it doesn’t. The goal is to figure out if plugin developers like me should be frustrated or worried that we don’t have access to InnoDB on many WordPress hosting provider platforms.

 

NOTE: A lower graph is better in the benchmarks below.

 

MyISAM vs InnoDB doing Insert IGNORE and Selects up to 56 processes

The X axis shows number of threads and the Y axis shows the total time for each thread to complete it’s share of the loop iterations. Each loop does an “insert ignore” and a “select” from the primary key id.

The benchmark below is for a typical write intensive application where multiple threads (up to 56) are selecting and inserting into the same table. I was expecting InnoDB to murder MyISAM with this benchmark, but as you can see they are extremely close (look at the Y axis on the left) and are both very fast. Not only that but they are both very stable as concurrency increases.

 

 

MyISAM vs InnoDB doing Insert IGNORE, Selects and delete with order by

The X axis shows number of threads and the Y axis shows the total time for each thread to complete it’s share of the loop iterations. Each loop does an “insert ignore”, a “select” from the primary key id AND a “delete from table order by id desc limit 3 (or asc every second iteration).

This is a very intensive test in terms of locking because you have both the write operation of the insert combined with a small ordered range of records being deleted each iteration. I was expecting MyISAM to basically stall as threads increased. Instead I saw the strangest thing…..


Rather than MyISAM stalling and InnoDB getting better under a highly concurrent load, InnoDB gave me two spikes as concurrency increased. So I did the benchmark again because I thought maybe a cron job fired on my test machine…..

While doing the second test I kept a close eye on memory and the machine had plenty to spare. The only explanation I can come up with is that InnoDB did a buffer flush or log flush at the same point in each benchmark which killed performance.

 

Conclusions

Firstly, I’m blown away by the performance and level of concurrency that MyISAM delivers under heavy writes. It may be benefitting from concurrent inserts, but even so I would have expected it to get killed with my “delete order by” query.

I don’t think the test I threw at InnoDB gives a complete picture of how amazing the InnoDB storage engine actually is. I use it in extremely large scale and highly concurrent environments and I use features like clustered indexes and cascading deletes via relational constraints and the performance and reliability is spectacular. But as a basic “does MyISAM completely suck vs InnoDB?” test I think this is a useful comparison, if somewhat anecdotal.

Resources and Footnotes

You can find my benchmark script here. Rename it to a .php extension once you download it.

I’m running MySQL Server version: 5.1.49-1ubuntu8.1 (Ubuntu)

I’m running PHP:
PHP 5.3.3-1ubuntu9.5 with Suhosin-Patch
Zend Engine v2.3.0

Both InnoDB and MyISAM engines were tuned using the following parameters:

InnoDB:
innodb_flush_log_at_trx_commit = 0
innodb_buffer_pool_size = 256M
innodb_additional_mem_pool_size = 20M
innodb_log_buffer_size = 8M
innodb_max_dirty_pages_pct = 90
innodb_thread_concurrency = 4
innodb_commit_concurrency = 4
innodb_flush_method = O_DIRECT

MyISAM:
key_buffer = 100M
query_cache_limit = 1M
query_cache_size = 16M

Binary logging was disabled.
The Query log was disabled.

The machine I used is a stock Linode 512 instance with 512 megs of memory.
The virtual CPU shows up as four:
Intel(R) Xeon(R) CPU L5520 @ 2.27GHz
with bogomips : 4533.49

I’m running Ubuntu 10.10 Maverick Meerkat
I’m running: Linux dev 2.6.32.16-linode28 #1 SMP

MySQL GIS Extensions Quick Start

A friend is busy putting together a kick ass startup with a strong geographic component. He’s using Google Maps API version 3 which is a vast improvement (and total rewrite) from previous versions. But he needs to store and query his geographic data in a fast efficient way. There are many options out there, but if you want massive speed and scaleability you really want to use MySQL. So I’m writing this quickstart guide for his (and your) benefit.

Most websites don’t need to do complicated things like store polygon data. They just need to store points on a map and then retrieve those points. They also need to be able to ask the database for all points within a rectangle. So I’m going to run you through schema creation, inserting data, getting your lat/lon data out of the database again, and querying the database for all points within a rectangle. We’re also going to deal with the nasty little issue of asking the database for points in a rectangle that crosses the 180 degree boundary (or the International Date Line).

Why use MySQL’s GIS extensions?

The main, and possibly only reason is because you want speed. You could store lat/lon coordinates as decimal degrees in MySQL. Then when you query the database you’d say “Give me all records where the lat is > X and < Y and the lon is > A and < B. But MySQL (and many other databases) is slow when you’re doing range queries like that because it can’t use it’s regular B-Tree indexes effectively for a range query.

So instead you create a column called a geometry. Then you create an index on that column called a spatial index. This is really an R-Tree index that is very fast when you’re doing range queries.

It’s really that simple. You want to use MySQL’s GIS because spatial indexes are faster for lat/lon range queries than regular indexes. I honestly can’t think of another reason I’d go through the effort of storing/retrieving my data using GIS functions.

How do I create a table to store lat/lon points?

I’m assuming you know how to create a regular table in MySQL. You create a basic table containing coordinates like so:

CREATE TABLE geom (
  lat float(10,7) NOT NULL,
  lon float(10,7) NOT NULL,
  g GEOMETRY NOT NULL,
  SPATIAL INDEX(g)
) ENGINE=MyISAM;

Firstly note that I’m storing the coordinates as decimal degrees AND in a spatial column (g). I’m a little paranoid and I like to store the source data.

Note that I’ve specified the table type to be MyISAM. You currently can only create a spatial index on MyISAM tables. You CAN create geometry columns on other table types like InnoDB, but without a spatial index it’s going to be slow which defeates the whole point of using GIS extensions because the only reason you’re using them is SPEED. The down-side of using MyISAM is that if you’re going to be doing a lot of writes to your table (by a lot I mean more than 10 per second) then MyISAM is going to slow down because it doesn’t support row level locking. But if you’re just going to be adding a few hundred thousand records a day and doing a lot more reads than writes, then this will work just fine for you. And remember that you can always replicate this table to a small cluster of slaves and have your web servers query the slaves when you want to scale your website.

How do I insert data into my fancy new spatial table?

I’m going to assume you’ve figured out how to get the lat/lon coordinates you need from the Google Maps API or whatever your source is. Here’s how you insert the data:

INSERT INTO geom (lat, lon, g) VALUES
    (47.37, -122.21, GeomFromText('POINT(47.37 -122.21)'));

Some things to note here: The value inside the GeomFromText function is a string. So in your application you’re going to have to create that string by concatenating together ‘POINT(‘, your lat, a space, your lon and ‘)’. Then you’re probably going to prepare a statement that looks like:

insert into geom (lat, lon, g) values
   (?, ?, GeomFromText(?))

When you execute it you’ll pass in the decimal degrees and the string you created.

Great, so how do I get the data back out again?

MySQL will tell you to use the AsText function to turn that geometry point back into something your application can use or pass to the Google Maps API. But because you also stored it as decimal degrees you can just do:

select lat, lon from geom;

But what you really care about is getting it back out FAST! When you ask the database for all points inside a rectangle you need to define that rectangle. So if you imagine a map you need to give the database two points on that map. So we’ll use a point in the South West and a point in the North East.

Lets say you want all points inside the rectangle where the South West point is latitude 46, longitude -123 and the North East point is latitude 48 and longitude -121. NOTE: -121 is further east than -123 degrees of longitude. Just to make it clearer:

SW Lat: 46
SW Lon: -123
NE Lat: 48
NE Lon: -121

You’ll do the following query:

select lat, lon from geom where
   MBRContains(
    GeomFromText('Polygon((46 -123, 48 -123, 48 -121, 46 -121, 46 -123))'),
    p
   );

If all those numbers look a little confusing, what you’re actually doing is drawing a square (polygon) starting at the south west corner and ending back at the south west corner. A square has four corners, but you have to close the box for MySQL so you have to repeat the last coordinate. ‘p’ is the second parameter to MBRContains and it specifies which column in the table must be contained in the box you’ve created. Lets replace the coordinates with variables to make it easier to read:

select lat, lon from geom where
   MBRContains(
    GeomFromText('Polygon((swLat swLon, neLat swLon, neLat neLon,
      swLat neLon, swLat swLon))'),
   p );

NOTE: I had to break the above lines up for readability. You may want to have this all as a single line in your code.

Well that’s just peachy, but what if my rectangle is in the middle of the Pacific ocean and crosses the International Date Line?

The Earth is, unfortunately, round. If it were flat we could end the conversation here, I could get on with some work and you could do whatever it is you do on a Saturday at 11:46am mountain standard time. But the Earth is round, so you and I are stuck with each other for another few minutes.

The reason roundness matters is because if the rectangle you are painting on Earth crosses 180 (or -180) degrees of longitude, then you need to change your logic a little. Normally your south-west longitude will be less than your north-east longitude. West is less than east. But if your square crosses the dreaded 180 boundary, then your western longitude will be greater than your eastern longitude. For example you might have a western longitude of 170 and an eastern longitude of 10 degrees.

If you don’t deal with this little hiccup then when you ask the database for points that are inside a square that crosses the 180 boundary, then you’re going to get everything to the left and right of that square and nothing inside it. So you have to draw two squares on either side of the 180 boundary.

You do this in your application logic. I’m going to throw a little code at you. Here goes:

sub makeMBR {
 my ($swLat, $swLon, $neLat, $neLon) = @_;

 if($swLon > $neLon) {
 return (' (' .
 'MBRContains(GeomFromText(\'Polygon((' .
 $swLat . ' ' . $swLon . ',' . $neLat . ' ' . $swLon . ',' .
 $neLat . ' 180,' . $swLat . ' 180,' .
 $swLat . ' ' . $swLon .
 '))\'), g) ' .
 ' OR ' .
 ' MBRContains(GeomFromText(\'Polygon((' .
 $swLat . ' -180,' . $neLat . ' -180,' .
 $neLat . ' ' . $neLon . ',' . $swLat . ' ' . $neLon . ',' .
 $swLat . ' -180' .
 '))\'), g) ' .
 ') ', 1);

 } else {
 return (' MBRContains(GeomFromText(\'Polygon((' .
 $swLat . ' ' . $swLon . ',' . $neLat . ' ' . $swLon . ',' .
 $neLat . ' ' . $neLon . ',' . $swLat . ' ' . $neLon . ',' .
 $swLat . ' ' . $swLon .
 '))\'), g) ', 0);
 }
}

The code above is Perl. It looks horrendous but it’s actually quite simple. It creates the MBRContains() part of the SQL statement for you automatically. It simply says: If the swLon is greater than the neLon then create two boxes and ask the database for all points in both those boxes. Otherwise just create one box as per normal. The two boxes that area created are on either side of the dreaded 180 boundary.

The function actually returns two values. The first is the MBRContains string that you can combine with the SQL in your application and feed to the database. The second value is either a 1 or a zero. A 1 indicates that the box has crossed the dreaded 180 boundary and you’re actually asking for points in 2 boxes. A zero indicates that it’s a regular single box. You may want to use this value in your application to determine how things are displayed. Generally when a box crosses the 180 boundary I tend to zoom out a little more so the user can see what’s going on.

You’ll use this code like so:

my ($geoSQL, $crossesIDL) = makeMBR(46, -123, 48, -121);
#Then you'll run this query on the database:
$dbh->selectrow_array("select lat, lon from geom where $geoSQL");

Conclusion

If this helped you and you’ve discovered a tip that could help others or have something to add, please post a comment. Muchos gracias, baaie dankie, dis mos lekker by die see and have a spectacular day!

Mark.