I’ve spent the last 5 days agonizing over a very hard problem on my network. Using curl, LWP::UserAgent, openssl, wget or any other SSL client, I’d see connections either timeout or hang halfway through the transfer. Everything else works fine including secure protocols like SSH and TLS. In fact inbound SSL connections work great too. It’s just when I connect to an external SSL host that it hiccups.
If you remember your OSI model, SSL is well above layer 3 (IP addresses and routers) and layer 2 (LAN traffic routed via MAC addresses). So the last place I planned to look was the network infrastructure.
I eliminated specific clients by trying others and I eliminated the OS by spinning up virtual machines running other versions of Linux. I elminated my physical hardware by reproducing it on a non Dell server and having one of the ops guys repro it on his OS X macbook.
And just to prove it was the network, which is all that was left, I set up a VPN from one of my machines that tunnelled all traffic over the VPN to a machine on an external network that acted as the router, thereby encapsulating the layer 2 and 3 traffic in a layer 4 and 5 VPN. And the problem went away. So I knew it was the network.
Tonight a few minutes ago my colo provider took down my local router and I gracefully failed over to the redundant router, and lo and behold the problem has gone away.
I still don’t know what it is, but what I do know is that a big chunk of layer 3 infrastructure has been changed and it’s fixed a layer 5 problem. What’s weird is that TCP connections (which is what SSL rides on top of) have delivery confirmation. So if the problem was packet loss, TCP would just request the packet again. So it’s something else and something that only affects SSL – and only connections bound from my servers out to the Internet.
The reason I’m posting this is because during the hours I spent Googling this issue this week (and finding nothing) I saw a lot of complaints about SSL timeouts and no solutions. So if you’re getting timeouts like this, check your underlying infrastructure and you might just be surprised. To verify that it’s a network problem, set up a VLAN using PPTP. Set up NAT on the external linux machine that is your VLAN server. Then disable the default gateway on the machine having the issue (the VLAN client) and verify that all traffic is routing via your VLAN. Then try and reproduce the SSL timeout and if it doesn’t occur, it’s probably your layer 2 or 3 infrastructure.