A while back, Jobster CTO Phil Bogle blogged about some of the tricks I’ve used to do fast location queries in SQL. The link to my SQL query to generate the zip lookup table for radius searches is now dead (a cybersquatter stole my domain name and I don’t want to discuss it!). So here’s the original blog post:
If you need to build a radius search for something on your website, then creating a zip code distance lookup table performs much better than calculating distance for every search. This monster chunk of SQL will create that table for you. It takes about 12 hours to run on a fast machine.
insert into zip_dist2 (fromzip, tozip, dist) select z1.zip, z2.zip, ROUND((3956 * 2 * atan2(sqrt((POW(sin(((z1.lat * (atan2(1,1) * 4) / 180) – (z2.lat * (atan2(1,1) * 4) / 180))/2.0),2) + cos((z2.lat * (atan2(1,1) * 4) / 180)) * cos((z1.lat * (atan2(1,1) * 4) / 180)) * POW(sin(((z1.lon * (atan2(1,1) * 4) / 180) – (z2.lon * (atan2(1,1) * 4) / 180))/2.0),2))),sqrt(1-(POW(sin(((z1.lat * (atan2(1,1) * 4) / 180) – (z2.lat * (atan2(1,1) * 4) / 180))/2.0),2) + cos((z2.lat * (atan2(1,1) * 4) / 180)) * cos((z1.lat * (atan2(1,1) * 4) / 180)) * POW(sin(((z1.lon * (atan2(1,1) * 4) / 180) – (z2.lon * (atan2(1,1) * 4) / 180))/2.0),2))))),2) as distance from zip_data as z1, zip_data as z2 where (3956 * 2 * atan2(sqrt((POW(sin(((z1.lat * (atan2(1,1) * 4) / 180) – (z2.lat * (atan2(1,1) * 4) / 180))/2.0),2) + cos((z2.lat * (atan2(1,1) * 4) / 180)) * cos((z1.lat * (atan2(1,1) * 4) / 180)) * POW(sin(((z1.lon * (atan2(1,1) * 4) / 180) – (z2.lon * (atan2(1,1) * 4) / 180))/2.0),2))),sqrt(1-(POW(sin(((z1.lat * (atan2(1,1) * 4) / 180) – (z2.lat * (atan2(1,1) * 4) / 180))/2.0),2) + cos((z2.lat * (atan2(1,1) * 4) / 180)) * cos((z1.lat * (atan2(1,1) * 4) / 180)) * POW(sin(((z1.lon * (atan2(1,1) * 4) / 180) – (z2.lon * (atan2(1,1) * 4) / 180))/2.0),2))))) < 100 and z2.zip != z1.zip
This generates a lookup table that contains two zip codes and their distance from each other for every zip in the USA within 100 miles of each other. It uses the Haversine formula to calculate the distance between two points on the surface of a sphere.
I used this when designing our vertical search engine to create the lookup tables we use for our radius search. I use MySQL for this. You’re also going to need a zip_data table that contains zip codes and their respective latitudes and longitudes. You can buy this data for about $50 from the many retailers online.
Pingback: Open Source Seattle | Blog Archive | Free data for the taking
It’s not. I’ve tried it. MySQL’s implementation of the trig functions is very fast and it’s very good at fast disk access/caching. There may be a faster algorithm or some additional optimizations. One that occurs to me is using a bounding square before the haversine calculation which would take advantage of mysql’s indexing.
This might be useful in terms of collecting the data: http://www.census.gov/geo/www/gazetteer/places2k.html
Aside from that, SQL seems a bit inefficient for this job. I would probably do it in python or perl, then import. I’m guessing that would be a lot faster.