How to reliably limit the amount of bandwidth your room mate or bad office colleague uses

Update: It seems I’ve created a monster. I’ve had my first two Google searchers arrive on this blog entry searching for “limit roomate downloading” and “netgear limit roomate”. Well after years of experimenting with QoS this is the best method I’ve found to do exactly that, so enjoy.

For part of the year I’m on a rural wifi network that, on a good day, gives me 3 megabits per second download speed and 700kbps upload speed. I’ve tried multiple rural providers, had them rip out their equipment because of the packet loss (that means you Skybeam), I’ve shouted at Qwest to upgrade the local exchange so we can get DSL, but for now I’m completely and utterly stuck on a 3 megabits downlink using Mile High Internet.

I have an occasional room-mate, my nephew, who downloads movies on iTunes and it uses about 1.5 to 3 megabits. I’ve tried configuring quality of service (QoS) on various routers including Netgear and Linksys/Cisco and the problem is that I need a zero latency connection for my SSH sessions to my servers. So while QoS might be great if everyone’s using non-realtime services like iTunes downloads and web browsing, when you are using SSH or a VoIP product like Skype, it really sucks when someone is hogging the bandwidth.

The problem arises because of the way most streaming movie players download movies. They don’t just do it using a smooth 1 megabit stream. They’ll suck down as much as your connection allows, buffer it and then use very little bandwidth for a few seconds, and then hog the entire connection again. If you are using SSH and you hit a key, it takes a while for the router to say: “Oh, you wanted some bandwidth, ok fine let me put this guy on hold. There. Now what did you want from me again? Hey you still there? Oh you just wanted one real-time keystroke. And now you’re gone. OK I guess I’ll let the other guy with a lower priority hog the bandwidth again until you hit another keystroke.”

So the trick, if you want to effectively deal with the movie downloading room-mate is to limit the amount of bandwidth they can use. That way netflix, iTunes, youtube, amazon unbox or any other streaming service has to use a constant 1 megabit rather than bursting to 3 megabits and then dropping to zero – and you always have some bandwidth available without having to wait for the router to do it’s QoS thing.

Here’s how you do it.

First install DD-WRT firmware on your router. I use a Netgear WNDR3300 router and after using various Linksys/Cisco routers I swear by this one. It has two built in radios so you can create two wireless networks, one on 2Ghz and one of 5Ghz. It’s also fast and works 100% reliably.

Then look up your router on dd-wrt’s site and download DD-WRT for your router and install it. I use version “DD-WRT v24-sp2 (10/10/09) std – build 13064″. There are newer builds available, but when I wrote this this was the recommended version.

Once you’re all set up and you have  your basic wireless network with DD-WRT, make sure you disable QoS (it’s disabled by default).

Then configure SSH on DD-WRT. It’s a two step process. First you have to click the “Services” tab and enable SSHd. Then you have to click the Administration tab and enable SSH remote management.

Only the paid version of DD-WRT supports per user bandwidth limits, but I’m going to show you how to do it free with a few shell commands. I actually tried to buy the paid version of DD-WRT to do this, but their site is confusing and I couldn’t get confirmation they actually support this feature. So perhaps the author can clarify in a comment.

Because you’re going to enter shell commands, I recommend adding a public key for password-less authentication when you log in to DD-WRT. It’s on the same DD-WRT page where you enabled  the SSHd.

Tip: Remember that with DD-WRT, you have to “Save” any config changes you make and then “Apply settings”. Also DD-WRT gets confused sometimes when you make a lot of changes, so just reboot after saving and it’ll unconfuse itself.

Now that you have SSHd set up, remote ssh login enabled and hopefully your public ssh keys all set up, here’s what you do.

SSH to your router IP address:

ssh root@192.168.1.1

Enter password.

Type “ifconfig” and check which interface your router has configured as your internal default gateway. The IP address is often 192.168.1.1. The interface is usually “br0″.

Lets assume it’s br0.

Enter the following command which clears all traffic control settings on interface br0:

tc qdisc del dev br0 root

Then enter the following:


tc qdisc add dev br0 root handle 1: cbq \
avpkt 1000 bandwidth 2mbit

tc class add dev br0 parent 1: classid 1:1 cbq \
rate 700kbit allot 1500 prio 5 bounded isolated

tc filter add dev br0 parent 1: protocol ip \
prio 16 u32 match ip dst 192.168.1.133 flowid 1:1

tc filter add dev br0 parent 1: protocol ip \
prio 16 u32 match ip src 192.168.1.133 flowid 1:1

These commands will rate limit the IP address 192.168.1.133 to 700 kilobits per second.

If you’ve set up automatic authentication and you’re running OS X, here’s a perl script that will do all this for you:

#!/usr/bin/perl

my $ip = $ARGV[0];
my $rate = $ARGV[1];

$ip =~ m/^\d+\.\d+\.\d+\.\d+$/ &&
$rate =~ m/^\d+$/ ||
die “Usage: ratelimit.pl\n”;

$rate = $rate . ‘kbit';

print `ssh root\@192.168.1.1 “tc qdisc del dev br0 root”`;

print `ssh root\@192.168.1.1 “tc qdisc add dev br0 root handle 1: cbq avpkt 1000 bandwidth 2mbit ; tc class add dev br0 parent 1: classid 1:1 cbq rate $rate allot 1500 prio 5 bounded isolated ; tc filter add dev br0 parent 1: protocol ip prio 16 u32 match ip dst $ip flowid 1:1 ; tc filter add dev br0 parent 1: protocol ip prio 16 u32 match ip src $ip flowid 1:1″`;

You’ll see a few responses for DD-WRT when you run the script and might see an error about a file missing but that’s just because you tried to delete a rule on interface br0 that might not have existed when the script starts.

These rules put a hard limit on how  much bandwidth an IP address can use. What you’ll find is that even if you rate limit your room mate to 1 megabit, as long as you have 500 kbit all to yourself, your SSH sessions will have absolutely no latency, Skype will not stutter, and life will be good again. I’ve tried many different configurations with various QoS products and have not ever achieved results as good as I’ve gotten with these rules.

Notes: I’ve configured the rules on the internal interface even though most QoS rules are generally configured on an external interface because it’s the only thing that really really seems to work. The Cisco engineers among you may disagree, but go try it yourself before you comment. I’m using the Linux ‘tc’ command and the man page is here.

PS: If you are looking for a great router to install DD-WRT on, try the Cisco-Linksys E3200. It has a ton of RAM and the CPU is actually faster at 500 MHz than the E4200 which is more expensive and only has a 480 MHz CPU. It also is the cheapest Gigabit Ethernet E series router that Cisco-Linksys offers. Here is the Cisco-Linksys E3200’s full specs on DD-WRT’s site. The E3200 is fully DD-WRT compatible but if you are lazy and don’t want to mess with DD-WRT, check out the built in QoS (Quality of Service) that the E3200 has built in on this video.

Bandwidth providers: Please follow Google’s lead in helping startups, the environment and yourselves

There’s a post on Hacker News today pointing to a few open source javascript libraries that Google is hosting on their content distribution network. ScriptSrc.net has a great UI that gives you an easy way to link to the libs from your web pages. Developers and companies can link to these scripts from their own websites and gain the following benefits:

  • Your visitor may have already cached the script on another website so your page will load faster
  • The script is hosted on a different domain which allows your browser to create more concurrent connections while fetching your content – another speed increase.
  • It saves you the bandwidth of having to serve that content up yourself which can result in massive cost savings if you’re a high traffic site.
  • Just like your visitor already cached the content, their workstation or local DNS server may also have the CDN’s IP address cached which further speeds load time.

While providing a service like this does cost Google or the providing company more in hosting, it provides an overall efficiency gain. Less bandwidth and CPU is used on the Web as a whole by Google providing this service. That means less cooling is required in data centers, less networking hardware needs to be manufactured to support the traffic on the web and so on.

The environment benefits as a whole by Google or another large provider hosting these frequently loaded scripts for us.

The savings are passed on to lone developers and startups who are using the scripts. For smaller companies who are trying to minimize costs while dealing with massive growth this can result in a huge cost savings that helps them to continue to innovate.

The savings are also passed on to bandwidth providers like NTT, AT&T, Comcast, Time Warner, Qwest and other bandwidth providers who’s customers consume less bandwidth as a result.

So my suggestion is that Google and bandwidth providers collaborate to come up with a package of the most used open source components online and keep the list up to date. Then provide local mirrors of each of these packages with a fallback mechanism if the package isn’t available. Google should define an IP address similar to their easy to remember DNS ip address 8.8.8.8 that hosts these scripts. Participating ISP’s route traffic destined for that IP address to a local mirror using a system similar to IP Anycast. An alternative URL is provided via a query string. e.g.

http://9.9.9.9/js/prototype.1.5.0.js?fallback=http://mysite.com/myjs/myprototype.1.5.0.js

If the local ISP isn’t participating the request is simply routed to Google’s 9.9.9.9 server as per normal.

If the local ISP (or Google) doesn’t have a copy of the script in their mirror it just returns a 302 redirect to the fallback URL which the webmaster has provided and which usually points to the webmaster’s own site. A mechanism for multiple fallbacks can easily be created e.g. fallback1, fallback2, etc.

Common scripts, icon libraries and flash components can be hosted this way. There may even be scenarios where a company (like Google) is used by such a large percentage of the Net population that it makes sense to put them on the 9.9.9.9 mirror system so that local bandwidth providers can serve up commonly used components rather than have to fetch them from via their upstream providers. Google’s logo for example.

SSL Network problem follow-up

It’s now exactly a week since I blogged about my SSL issues over our network. To summarize, when fetching documents on the web via HTTPS from my servers, the connection would just hang halfway through until it timed out. I had confirmed that it wasn’t the infamous PMTU ICMP issue that is common if you’re fetching documents via HTTPS from a misconfigured web server. It was being caused by inbound HTTPS data packets getting dropped and when the retransmit would occur the retransmitted packets would also get dropped. Exactly the same packet every time would get dropped.

Last night we solved it. We’ve been working with Cisco for the last week and have been through several of their engineers with no progress. I was seeing packets arriving on my provider’s switch (we have a great working relationship and share a lot of data like sniffer logs) – but the packet was not arriving on my switch We had isolated it to the layer 2 infrastructure.

Last night we decided to throw a hail-mary and my provider changed the switch module my two HSRP uplinks were connected to from one 24 port module to another. And holycrap it fixed the problem. We then reconfigured routes and everything else so that the only thing that had changed was the 24 port module. And it was still fixed.

This is the strangest thing I’ve seen and the Cisco guys we were working with echoed that. It’s extremely rare for Layer 2 infrastructure which is fairly brain-dead to cause errors with packets that have a higher level protocol like HTTPS in common. These devices examine the layer-2 header with the MAC address and either forward the entire packet or not. The one thing we did notice is that the packets that were getting dropped were the last data packet in a PDU (protocol data unit) and were therefore slightly shorter by about 100 bytes than the other packets in the PDU that were stuffed full of data.

But we’ve exorcised the network ghosts and data is flowing smoothly again.

SSL Timeouts and layer 3 infrastructure

I’ve spent the last 5 days agonizing over a very hard problem on my network. Using curl, LWP::UserAgent, openssl, wget or any other SSL client, I’d see connections either timeout or hang halfway through the transfer. Everything else works fine including secure protocols like SSH and TLS. In fact inbound SSL connections work great too. It’s just when I connect to an external SSL host that it hiccups.

If you remember your OSI model, SSL is well above layer 3 (IP addresses and routers) and layer 2 (LAN traffic routed via MAC addresses). So the last place I planned to look was the network infrastructure.

I eliminated specific clients by trying others and I eliminated the OS by spinning up virtual machines running other versions of Linux. I elminated my physical hardware by reproducing it on a non Dell server and having one of the ops guys repro it on his OS X macbook.

And just to prove it was the network, which is all that was left, I set up a VPN from one of my machines that tunnelled all traffic over the VPN to a machine on an external network that acted as the router, thereby encapsulating the layer 2 and 3 traffic in a layer 4 and 5 VPN. And the problem went away. So I knew it was the network.

Tonight a few minutes ago my colo provider took down my local router and I gracefully failed over to the redundant router, and lo and behold the problem has gone away.

I still don’t know what it is, but what I do know is that a big chunk of layer 3 infrastructure has been changed and it’s fixed a layer 5 problem. What’s weird is that TCP connections (which is what SSL rides on top of) have delivery confirmation. So if the problem was packet loss, TCP would just request the packet again. So it’s something else and something that only affects SSL – and only connections bound from my servers out to the Internet.

The reason I’m posting this is because during the hours I spent Googling this issue this week (and finding nothing) I saw a lot of complaints about SSL timeouts and no solutions. So if you’re getting timeouts like this, check your underlying infrastructure and you might just be surprised. To verify that it’s a network problem, set up a VLAN using PPTP. Set up NAT on the external linux machine that is your VLAN server. Then disable the default gateway on the machine having the issue (the VLAN client) and verify that all traffic is routing via your VLAN. Then try and reproduce the SSL timeout and if it doesn’t occur, it’s probably your layer 2 or 3 infrastructure.