SSL Network problem follow-up

It’s now exactly a week since I blogged about my SSL issues over our network. To summarize, when fetching documents on the web via HTTPS from my servers, the connection would just hang halfway through until it timed out. I had confirmed that it wasn’t the infamous PMTU ICMP issue that is common if you’re fetching documents via HTTPS from a misconfigured web server. It was being caused by inbound HTTPS data packets getting dropped and when the retransmit would occur the retransmitted packets would also get dropped. Exactly the same packet every time would get dropped.

Last night we solved it. We’ve been working with Cisco for the last week and have been through several of their engineers with no progress. I was seeing packets arriving on my provider’s switch (we have a great working relationship and share a lot of data like sniffer logs) – but the packet was not arriving on my switch We had isolated it to the layer 2 infrastructure.

Last night we decided to throw a hail-mary and my provider changed the switch module my two HSRP uplinks were connected to from one 24 port module to another. And holycrap it fixed the problem. We then reconfigured routes and everything else so that the only thing that had changed was the 24 port module. And it was still fixed.

This is the strangest thing I’ve seen and the Cisco guys we were working with echoed that. It’s extremely rare for Layer 2 infrastructure which is fairly brain-dead to cause errors with packets that have a higher level protocol like HTTPS in common. These devices examine the layer-2 header with the MAC address and either forward the entire packet or not. The one thing we did notice is that the packets that were getting dropped were the last data packet in a PDU (protocol data unit) and were therefore slightly shorter by about 100 bytes than the other packets in the PDU that were stuffed full of data.

But we’ve exorcised the network ghosts and data is flowing smoothly again.