Advanced WordPress: How to get Real WordPress Commenter IP Addresses behind your Nginx Proxy

News [April 24th, 2012]: I’ve launched Wordfence to permanently fix your WordPress site’s security issues. Click here to learn more.

If you run a reasonably high traffic blog on a small Linode server like this one, it’s a really good idea to set up an Nginx front-end proxy for your Apache server. It lets you handle relatively high traffic without running out of apache children while keeping keep-alive enabled.

You can read more about how to set up Nginx and other tips on my Basic WordPress Speedup page.

If you have set up Nginx, you’ll notice that your comments no longer have the real IP address of visitors to your site. They’re all 127.0.0.1 or something similar.

The way I solved this was to edit my php.ini file. On my Ubuntu server this lives in /etc/php5/apache2/php.ini

I modifed the auto_prepend_file variable to look like this:

auto_prepend_file = /etc/php5/apache2/mdm.php

Then in the mdm.php file I put this:


<?php
$mdm_headers = apache_request_headers();
$_SERVER['REMOTE_ADDR'] = $mdm_headers["X-Forwarded-For"];
?>

 

This assumes you have the following line in your Nginx.conf to forward the real IP address:

proxy_set_header X-Forwarded-For $remote_addr;

 

Which programming language should I learn?

I’ve been asked this question twice in the last 2 weeks by people wanting to write their first Web application. So I’m going to answer it here for anyone else interested:

If you want to write Web applications you need to learn the following languages: Javascript, PHP, HTML, CSS and SQL. It sounds like a lot, but it really is not. You can learn enough of each of these languages to write a basic Web application within a week. Trust me. It’s easy!

PHP is the guts of your Web application. It is the language that runs on your web server. It is also the only language where you have a choice about learning it or learning another language. You must learn HTML, CSS and Javascript and 99% of web programmers learn SQL to talk to a database. But there are many other languages to choose from that can do the same thing that PHP does.

However, if you are starting out writing web applications, PHP is the first server language you should learn and here is why:

  1. PHP is used by a huge number of websites, both big and small. Most of Facebook is written in PHP. Wikipedia powered by Mediawiki is written in PHP.
  2. WordPress, the worlds most popular open source blog platform is written in PHP. If you know PHP you can change it any way you like or even contribute to the community. WordPress is used by eBay, Yahoo, Digg, The Wall Street Journal, Techcrunch, TMZ, Mashable and of course the whole of WordPress.com is powered by WordPress written in PHP.
  3. Most of the worlds best content management systems are written in PHP.
  4. The PHP community is massive and supportive, unlike the Ruby on Rails community for example.
  5. 99% of web programmers can understand PHP, even though some don’t realize it. (like Perl Developers)
  6. PHP is a mature language which means the bugs have all been ironed out and it runs fast!
  7. If you Google a question you have about PHP, you have a much higher likelihood of finding an answer than any other server programming language.
  8. Don’t learn Perl because even though it’s a mature, fast and popular language, it’s harder to learn than PHP.
  9. Don’t learn Java because Java is better suited to launching spacecraft and running systems that control oil rigs or banking software than Web applications. It is strongly typed which means that you need to write more lines of code to get the same thing done. It’s also harder to learn because it’s a purer object oriented language . It also is owned by Oracle which means it’s a commercial language and that means Oracle will continually be trying to sell you stuff by making things seem harder than they are and claiming they have the answer to the problem they created in your mind.
  10. Don’t learn .NET because it’s also a commercial language and pretty much everything made by Microsoft either will cost you money or will break a lot.
  11. Don’t learn Ruby because the guys who run the community are total a-holes who will insult you for asking beginner questions. Ruby is also way less popular than PHP or Perl even though it’s used to power Twitter. It’s also the reason Twitter is down so often.
  12. Don’t learn Brainf*ck, Cobol, D, Erlang, Fortran, Go, Haskell, Lisp, OCaml, Python or Smalltalk because these are languages that people tell you they know to show off. Some of them have specific advantages like parallelism, being a pure object oriented language or being compact. But they are not for you if you are starting out. In fact, the combination of PHP and Javascript will give you 99.99999% of what all these languages offer.
You also need to learn two presentation languages: HTML and CSS. They are actually part of each other because HTML is not too useful without CSS and vice versa.

HTML tells the browser the structure and content of a page. e.g. Put a form after a paragraph and have one field for email and one for full name.

CSS tells the browser how to make that page look e.g. Which fonts to use, what size they should be, what colors, how wide or tall things on the page should be, how thick borders should be, how much padding to use and how thick to make margins.

Then you also need to learn a data storage language called SQL which lets you talke to a database to store things like visitor names, email addresses and so on. For example, using SQL you can tell a database to store an email address and full name by saying “INSERT INTO visitors (name, email) values (‘Mark Maunder’, ‘mark@example.com’);. There are other ways to store data and a popular terrorist movement calling itself NoSQL has formed in the last 4 years and they spend their time sowing fear and doubt about SQL and confusing beginners like you. The reality is that 99% of web applications use SQL and continue to use SQL. It works, it’s fast, it’s easy to learn and everyone understands it. It’s used by WordPress, Wikipedia, Facebook and everyone else who counts, whether they like it or not. Just learn SQL!! I also recommend you use MySQL to store your web application’s data (even though it’s owned by Oracle) because it’s the most popular open source database out there. PHP applications use MySQL more than any other database engine on the web.

To summarize, so far you need to learn:

  • Javascript (a programming language that runs on inside your visitor’s browser)
  • PHP (a programming language that runs on the server)
  • HTML (a presentation language that tells the browser the structure of a page)
  • CSS (a presentation language that tells the browser how to make a page look once it’ knows the structure)
  • SQL (a data access language that lets you store and retrieve data from a database)
Each of these languages runs or executes in a certain place or environment:
  • Javascript runs inside the browser of someone who has visited your website. What’s cool about this is that it uses your visitor’s CPU and memory instead of the resources on your server.
  • PHP runs on your own web server. Most websites use a kind of “container” or application server to run PHP called Apache Web Server with something called mod_php installed. Apache handles all the web server stuff like receiving the request for the document and making sure it’s formatted correctly. It then passes the request to mod_php which is executing your PHP code. This actually runs your web application written in PHP, your program sends the response back to Apache which sends it back to your visitor.
  • HTML is interpreted by a visitor’s browser and tells the browser how to structure the page as it loads.
  • CSS is also interpreted by a visitor’s browser and tells the browser how to make the HTML look.
  • SQL is a language that you use inside your PHP application to talk to a database like MySQL. You will actually write SQL in your PHP code but it will be sent to the database engine which is where it is interpreted and executed. The database then sends your PHP code whatever it asked for, if anything. (sometimes you’re just inserting data and not asking a question)
As you progress you will get familiar with the platforms you run each of these languages on. They include:
  • Linux is the operating system you will run on your server. Everything else on your server runs on top of Linux. Linux lets your web application talk to the server’s hardware.
  • Apache Web Server running mod_php. This is the application server you will use to run your PHP code. It will receive the web requests, forward them to your PHP code, and receive the response which is forwarded to your visitor.
  • MySQL database engine. You will talk to mysql using SQL which is written inside your PHP code.

One last note to help you in your language decision making. It’s important that you understand there are a few phenomenon that may confuse you in your language research:

The first is that some software developers have little life beyond writing software and have large egos. One of the few things they have to impress you with is their own intelligence. They will try to make programming sound harder than it actually is. It’s not hard. It’s easy.

Secondly, an arrogant programmer may regale you with a list of programming languages to choose from and tell you that he or she knows them all. They may make the choice sound complicated. It’s not. They’re just showing off. Choose PHP and the set of tools listed above and you’ll be fine.

Third, remember that there is always something new and shiny coming out that will get a lot of attention and is advertised to “change the way we…” or will “make everything you know about programming irrelevant”. Ignore the noise and stay focused on the basics. Until a new language, operating system, application or piece of hardware has been around for a while (usually at least 5 years), it’s going to be full of bugs, run slow, break often and it will be hard to get help by Googling because few people are using it and have had the problem you’re having.

Lastly, many companies like Google and Facebook spend a lot of time and energy trying to attract the best software engineers in the world. Google associated themselves with NASA purely for this reason, even though they’re in completely different businesses. To draw attention to themselves as thought leaders in software they talk a lot about languages like Erlang, Haskell and so on. The reality is that their bread and butter languages are pretty ordinary – languages like C++ and PHP. So don’t get confused when you see Facebook talking about using Erlang for real-time chat. They’re just showing off. Their bread and butter is PHP, HTML, CSS, SQL and Javascript, like most of the rest of the Web.

Who am I and how dare I express an opinion on this? I’ve been programming web applications since 2 years after the Web was invented. I’m the CEO and CTO of a company who’s web apps are seen by over 200 million unique people every month. I also own the company.  I’ve seen languages and platforms come and go including Netscape Commerce Server, Java Applets, Visual Basic, XML, NetWare, Windows NT, Microsoft IIS, thin clients, network computers, etc.

The Web is Simple. Programming is Easy. Now go have fun!!

 

Can WordPress Developers survive without InnoDB? (with MyISAM vs InnoDB benchmarks)

Update: Thanks Matt for the mention and Joseph for the excellent point in the comments that WordPress in fact uses whatever MySQL’s default table handler is, and from 5.5 onwards, that’s InnoDB – and for his comments on InnoDB durability.

My development energy has been focused on WordPress.org a lot during the past few months for the simple reason that I love the publishing platform and it’s hard to resist getting my grubby paws stuck into the awesome plugin API.

To my horror I discovered that WordPress installs using the MyISAM table engine. [Update: via Joseph Scott on the Automattic Akismet team: WordPress actually specifies no table type in the create statements, so it uses MySQL’s default table engine which is InnoDB from version 5.5 onwards. See the rest of his comment below.] I absolutely loathe MyISAM because it has burned me badly in the past when table locking killed performance in very high traffic apps I’ve built. Converting to InnoDB saved the day and so I have a lot of love for the InnoDB storage engine.

Many WordPress hosts like Hostgator don’t support anything but the default MyISAM table type that WordPress uses and they’ve made it clear they don’t plan to change.

While WordPress is a mostly read-only application that doesn’t suffer too badly with read/write locking issues, the plugin I’m developing will be doing a fair amount of writing, so the prospect of being stuck with MyISAM is a little horrifying.

So I set out to figure out exactly how bad my situation is being stuck with MyISAM.

I created a little PHP script (hey, it’s WordPress so gotta go with PHP) to benchmark MySQL with both table types. Here’s what the script does:

  • Create a table using the MyISAM or InnoDB table type (depending on what we’re benching)
  • The table has two integer columns (one is a primary key) and a 255 character varchar.
  • Insert 10,000 records with randomly generated strings of 255 chars in length.
  • Fork X number of processes (I start with one and gradually increase to 56 processes)
  • Each process has a loop that iterates a number of times so that we end up with 1000 iterations evenly spaced across all processes.
  • In each iteration we do an “insert IGNORE” that may or may not insert a record, a “select *” that selects a random record that may or may not exist, and then a “delete from table order by id asc limit 3″ and every second delete orders “desc” instead.
  • Because the delete is the most resource intensive operation I decided to bench it without the delete too.

The goal is not to figure out if WordPress needs InnoDB. The answer is that it doesn’t. The goal is to figure out if plugin developers like me should be frustrated or worried that we don’t have access to InnoDB on many WordPress hosting provider platforms.

 

NOTE: A lower graph is better in the benchmarks below.

 

MyISAM vs InnoDB doing Insert IGNORE and Selects up to 56 processes

The X axis shows number of threads and the Y axis shows the total time for each thread to complete it’s share of the loop iterations. Each loop does an “insert ignore” and a “select” from the primary key id.

The benchmark below is for a typical write intensive application where multiple threads (up to 56) are selecting and inserting into the same table. I was expecting InnoDB to murder MyISAM with this benchmark, but as you can see they are extremely close (look at the Y axis on the left) and are both very fast. Not only that but they are both very stable as concurrency increases.

 

 

MyISAM vs InnoDB doing Insert IGNORE, Selects and delete with order by

The X axis shows number of threads and the Y axis shows the total time for each thread to complete it’s share of the loop iterations. Each loop does an “insert ignore”, a “select” from the primary key id AND a “delete from table order by id desc limit 3 (or asc every second iteration).

This is a very intensive test in terms of locking because you have both the write operation of the insert combined with a small ordered range of records being deleted each iteration. I was expecting MyISAM to basically stall as threads increased. Instead I saw the strangest thing…..


Rather than MyISAM stalling and InnoDB getting better under a highly concurrent load, InnoDB gave me two spikes as concurrency increased. So I did the benchmark again because I thought maybe a cron job fired on my test machine…..

While doing the second test I kept a close eye on memory and the machine had plenty to spare. The only explanation I can come up with is that InnoDB did a buffer flush or log flush at the same point in each benchmark which killed performance.

 

Conclusions

Firstly, I’m blown away by the performance and level of concurrency that MyISAM delivers under heavy writes. It may be benefitting from concurrent inserts, but even so I would have expected it to get killed with my “delete order by” query.

I don’t think the test I threw at InnoDB gives a complete picture of how amazing the InnoDB storage engine actually is. I use it in extremely large scale and highly concurrent environments and I use features like clustered indexes and cascading deletes via relational constraints and the performance and reliability is spectacular. But as a basic “does MyISAM completely suck vs InnoDB?” test I think this is a useful comparison, if somewhat anecdotal.

Resources and Footnotes

You can find my benchmark script here. Rename it to a .php extension once you download it.

I’m running MySQL Server version: 5.1.49-1ubuntu8.1 (Ubuntu)

I’m running PHP:
PHP 5.3.3-1ubuntu9.5 with Suhosin-Patch
Zend Engine v2.3.0

Both InnoDB and MyISAM engines were tuned using the following parameters:

InnoDB:
innodb_flush_log_at_trx_commit = 0
innodb_buffer_pool_size = 256M
innodb_additional_mem_pool_size = 20M
innodb_log_buffer_size = 8M
innodb_max_dirty_pages_pct = 90
innodb_thread_concurrency = 4
innodb_commit_concurrency = 4
innodb_flush_method = O_DIRECT

MyISAM:
key_buffer = 100M
query_cache_limit = 1M
query_cache_size = 16M

Binary logging was disabled.
The Query log was disabled.

The machine I used is a stock Linode 512 instance with 512 megs of memory.
The virtual CPU shows up as four:
Intel(R) Xeon(R) CPU L5520 @ 2.27GHz
with bogomips : 4533.49

I’m running Ubuntu 10.10 Maverick Meerkat
I’m running: Linux dev 2.6.32.16-linode28 #1 SMP

How to integrate PHP, Perl and other languages on Apache

I have this module that a great group of guys in Malaysia have put together. But their language of choice is PHP and mine is Perl. I need to modify it slightly to integrate it. For example, I need to add my own session code so that their code knows if my user is logged in or not and who they are.

I started writing PHP but quickly started duplicating code I’d already written in Perl. Fetch the session from the database, de-serialize the session data, that sort of thing. I also ran into issues trying to recreate my Perl decryption routines in PHP. [I use non-mainstream ciphers]

Then I found ways to run Perl inside PHP and vice-versa. But I quickly realized that’s a very bad idea. Not only are you creating a new Perl or PHP interpreter for every request, but you’re still duplicating code, and you’re using a lot more memory to run interpreters in addition to what mod_php and mod_perl already run.

Eventually I settled on creating a very lightweight wrapper function in PHP called doPerl. It looks like this:

$associativeArrayResult = doPerl(functionName, associativeArrayWithParameters);

function doPerl($func, $arrayData){
 $ch = curl_init();
 $ip = '127.0.0.1';
 $postData = array(
 json => json_encode($arrayData),
 auth => 'myPassword',
 );
 curl_setopt($ch,CURLOPT_POST, TRUE);
 curl_setopt($ch,CURLOPT_POSTFIELDS, $postData);
 curl_setopt($ch, CURLOPT_URL, "http://" . $ip . "/webService/" . $func . "/");
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
 $output = curl_exec($ch);
 curl_close($ch);
 $data = json_decode($output, TRUE);
 return $data;
}

On the other side I have a very fast mod_perl handler that only allows connections from 127.0.0.1 (the local machine). I deserialise the incoming JSON data using Perl’s JSON::from_json(). I use eval() to execute the function name that is, as you can see above, part of the URL. I reserialize the result using Perl’s JSON::to_json($result) and send it back to the PHP app as the HTML body.

This is very very fast because all PHP and Perl code that executes is already in memory under mod_perl or mod_php. The only overhead is the connection creation, sending of packet data across the network connection and connection breakdown. Some of this is handled by your server’s hardware. [And of course the serialization/deserialization of the JSON data on both ends.]

The connection creation is a three way handshake, but because there’s no latency on the link it’s almost instantaneous. The transferring of data is faster than a network because the MTU on your lo interface (the 127.0.0.1 interface) is 16436 bytes instead of the normal 1500 bytes. That means the entire request or response fits inside a single packet. And connection termination is again just two packets from each side and because of the zero latency it’s super fast.

I use JSON because it’s less bulky than XML and on average it’s faster to parse across all languages. Both PHP and Perl’s JSON routines are ridiculously fast.

My final implementation on the PHP side is a set of wrapper classes that use the doPerl() function to do their work. Inside the classes I use caching liberally, either in instance variables, or if the data needs to persist across requests I use PHP’s excellent APC cache to store the data in shared memory.

Update: On request I’ve posted the perl web service handler for this here. The Perl code allows you to send parameters via POST using either a query parameter called ‘json’ and including escaped JSON that will get deserialized and passed to your function, or you can just use regular post style name value pairs that will be sent as a hashref to your function. I’ve included one test function called hello() in the code. Please note this web service module lets you execute arbitrary perl code in the web service module’s namespace and doesn’t filter out double colon’s, so really you can just do whatever the hell you want. So I’ve included two very simple security mechanisms that I strongly recommend you don’t remove. It only allows requests from localhost, and you must include an ‘auth’ post parameter containing a password (currently set to ‘password’). You’re going to have to implement the MM::Util::getIP() routine to make this work and it’s really just a one liner:

sub getIP {
 my $r = shift @_;
 return $r->headers_in->{'X-Forwarded-For'} ? 
    $r->headers_in->{'X-Forwarded-For'} : 
    $r->connection->get_remote_host();
}