Blogger to customers: Your blog will now run on multiple domains so we can censor it

The worlds largest blog host by a wide margin, Blogger (or Blogspot.com) has now actively started redirecting visitors to top level country domains (ccTLD’s) based on which country they are in.

I run a real-time analytics service and we have roughly 700,000 Blogspot customers. At 1AM on January 30th (UTC time) we saw a huge number of new domains appearing on our radar. Most of these were new blogspot domains ending in the .in top level country domain and we saw several others.

The way this new configuration works is as follows. If you have example.blogspot.com as your blog:

  • If visitors arrive from a country in which this is not enabled by Blogger, they will see example.blogspot.com as per usual.
  • If visitors arrive from a country that has requested, or may request in future, that Google censor content, the visitor is redirected to example.blogspot.ccTLD, where ccTLD is replaced with a country top level domain. This is example.blogspot.in in India or example.blogspot.com.au in Australia, for example.

The effect of this is:

  1. Blog owners are likely to be looking at their blog on a different domain to their visitors. E.g. you will see your blog on example.blogspot.co.nz if you are in New Zealand and your visitors will be visiting your blog using domains like example.blogspot.co.za, example.blogspot.in, example.blogspot.com.au, etc.
  2. Because your blog now lives on multiple domains, your content is duplicated on the Web. Google claim they deal with this by setting a canonical tag in your HTML content that points to the .com domain so crawlers will not be confused.
  3. Your visitors are now spread across as many websites as Google has top level country domains for Blogger. Rather than having a single page about bordeaux wines, you instantly have 10 or 20 pages about bordeaux wines, they’re all identical in every way except the URL and your visits are spread evenly across them.

A URL or Uniform Resource Locator has always been a canonical string that represents the location of a page on the Web. Modifying the worlds largest blog hosting service to break this convention in order to enable Web censorship, by Google no less, leaves me deeply concerned. I can only speculate that either Google is throwing Blogspot under the bus, or Google’s view of their company and its role on the Web has become deeply flawed.

WordPress Security: Hardening and Malware list removal

Big News [April 24th, 2012]: I’ve launched Wordfence to permanently fix your WordPress site’s security issues. Click here to learn more.

I spent some time yesterday reaching out to folks I know to try and get some input on WordPress security, avoiding getting listed as Malware and how to get removed from the Malware list. Rand Fishkin, the founder of SEOMoz and all round SEO God was kind enough to introduce me to Justin Briggs who is an SEO consultant and guru. Justin quickly came back with the following advice:

WordPress is certainly more susceptible to malicious attacks due to its popularity and the large number of sites that can be compromised with an exploit.
The best preemptive solution is to keep up on updates and increase security associated with WordPress.
Here are two good articles on ways to improve WordPress security.
WordPress offers an article on hardening WordPress:
If a site is compromised, Google will make an effort to get in touch with you. They outlined these details of how they attempt this here:
http://www.google.com/support/webmasters/bin/answer.py?answer=163633#3
They also offer some additional tips:
Once a site has been cleaned up, you can send a request to Google:
I’ve had a friend’s site who was exploited several months ago. It was a bit of work to get it cleaned up, but the warning was removed relatively quick after submitting the request to Google.
I contacted friends who are current and former Google employees but no luck getting in touch with the Malware team. In general it’s hard to connect with folks inside the big G with questions that are usually handled by support teams. [As I’ve been politely told in the past]. 🙂

SEO: Don’t use private registration

This one is short and sweet. A new domain recently wasn’t getting any SEO traffic after 2 months. As soon as the registration was made non-private i.e. we removed the domainsByProxy mask on who owns the domain, it started getting traffic and has been growing ever since.

Correlation does not equal causation, but it does give me pause.

While ICANN has made it clear that the whois database has one purpose only, Google publicly stated they became a registrar to “increase the quality of our search results“.

 

Posted in SEO

SEO: Google may treat blogs differently

A hobby site I have has around 300,000 pages indexed and good pagerank. It gets a fair amount of SEO traffic which has been growing. The rate at which Google indexes the site has been steadily climbing and is now indexing at around 2 to 3 pages per second.

I added a new page on the site that was linked to from most other pages about a week ago. The page had a query string variable called “ref”. The instant it went live, Googlebot went crazy indexing the page and considering every permutation of “ref” to be a different page, even though the page generated was identical every time. The page quickly appeared in Googles index. I solved it by telling Googlebot to ignore “ref” through Webmaster Tools and temporarily disallowed indexing using robots.txt.

A week later I added another new page. This time I used WordPress.org as a CMS and created a URL, lets call it “/suburl/” and published the new page as “/suburl/blog-entry-name.html”. Again I linked to it from every page on the site.

Googlebot took a sniff at “/suburl/” and at “/suburl/?feed=rss2” and then a day later it grabbed “/suburl/author/authorname” but it never put the page in it’s search index and hasn’t visited since. The bot continues to crawl the rest of the site aggressively.

Back in 2009, Matt Cutts (Google search quality team) mentioned that “WordPress takes care of 80-90% of (the mechanics of) Search Engine Optimization (SEO)”.

A different interpretation is that “WordPress gives Google a machine readable platform with many heuristics that can be used to more accurately assess page quality”.

One of those heuristics is age of the blog and number of blog entries. Creating a fresh blog on a fresh domain or subdomain and publishing a handful of affiliate targeted pages is a common splog (spam blog) tactic. So it’s possible that Google saw my one-page-blog and decided the page doesn’t get put in the index until the blog has credibility.

So from now on when I have content to put online, I’m going to consider carefully whether I’m going to publish it using WordPress as a CMS with just a handful of blog entries, or if I’m going to hand-publish it (which has worked well for me so far).

Let me know if your mileage varies.

Posted in SEO

How much traffic do the biggest typo domains get?

There’s an article on searchengineland today about domaining and how Google and Yahoo “make money off a twitter typo domain”. I’m not sure I’m as excited about exposing this travesty of justice as SEL is, but I was curious how much traffic typo domains get:

Alexa domain typo traffic

In my brief research I found facebok.com was by far the biggest winner with twiter.com running a distant second. But their traffic dropped off to a trickle middle of this year. I wonder if facebook themselves or a popular app mistyped a URL somewhere and then fixed it.

Other variations of facebook, twitter, google and myspace didn’t yield much. I entered a high traffic site who’s exact numbers I have access to for comparison and by my estimates facebok.com was getting just under half a million uniques per month. Nothing compared to the real FB, but slapping remnant advertising on there would yield $1000 to $5000 per month. Twiter.com gets around a quarter million uniques per month netting around $500 to $2500 on remnant ads.

How to easily cross-post your linkbait

In my recent podcast we chatted about Linkbait. Linkbait is simply the act of writing a headline for a blog entry or page that will generate a very high click rate and then publicizing that page. If you’re not sure how to write great headlines, start with this page of 10 Sure-Fire headline formulas that work.

If you’re writing great headlines for your blog entries and are looking for places to publicize them, check out socialposter.com. It’s a bookmarklet you drag onto your browser bar. Then you go to the page you want to promote, drag your mouse to select the text on the page you want to use as the summary, and then click the bookmarklet. It lets you easily cross-post to these websites:

Digg.com
Netscape.com
Reddit.com
Del.icio.us
Stumbleupon.com
Google.com/Bookmarks
Myweb2.search.yahoo.com
Technorati.com
Indianpad.com
Socialogs.com
Furl.net
Diigo.com
Wirefan.com
Bibsonomy.org
Looklater.com
Blinklist.com
Blogmemes.net
Bluedot.us
Myjeeves.ask.com
Simpy.com
Backflip.com
Spurl.net
Newsvine.com
Netvouz.com
Grupl.com
Blinkbits.com
Bmaccess.net
Shadows.com
Ma.gnolia.com
Scuttle.org
Smarking.com
Blogmarks.net
Plugim.com
Linkagogo.com
Dotnetkicks.com
Mister-wong.de
Favorites.live.com
Wdclub.com
Yigg.de

Posted in SEO

Competitive intelligence tools

In an earlier post I suggested that too much competitive analysis too early might be a bad idea. But it got me thinking about the tools that are available for gathering competitive intelligence about a business and what someone else might be using to gather data about my business.

Archive.org

One of my favorites! Use archive.org to see how your competitors website evolved from the early days until now. If they have a robots.txt blocking iarchive (archive.org’s web crawler) then you’re not going to see anything, but most websites don’t block the crawler. Here’s Google’s early home page from 1998.

For extra intel, combine Alexa with archive.org to find out when your competitors traffic spiked, and then look at their pages during those dates on Archive.org to try and figure out what they did right.

Yahoo Site Explorer

Site explorer is useful for seeing who’s linking to your competitor i.e. who you should be getting backlinks from.

Netcraft Site Report

Netcraft have a toolbar of their own. Take a look at the site rank to get an indication of traffic. Click the number to see who has roughly the same traffic. The page also shows some useful stuff like which hosting facility your competitor is using.

Google pages indexed

What interests me more than pagerank is the number of pages of content a website has and which of those are indexed and are ranking well. Search for ‘site:example.com’ on Google to see all pages that Google has indexed for a given website. Smart website owners don’t optimize for individual keywords or phrases, but instead provide a ton of content that Google indexes. They then work on getting a good overall rank for their site and getting search engine traffic for a wide range of keywords. I blogged about this recently on a friends blog and it’s called the long tail approach.

If I’m looking at which pages my competitor has indexed, I’m very interested in what specific content they’re providing. So often I’ll skip to result 900 or more and see what the bulk of their content is. You may dig up some interesting info doing this.

Technorati Rank, Links and Authority

If you’re researching a competing blog, use Technorati. Look at the rank, blog reactions (inbound links really) and the technorati authority. Authority is the number of blogs linking to the blog you’re researching in the last 6 months.

Alexa

Sites like Alexa, Comscore and Compete are incredibly inaccurate and easy to game. Just read this piece by the CEO of plenty of fish. Alexa provides an approximation of traffic. It’s also subject to anomalies that throw the stats wildly off. Like the time that Digg.com overtook Slashdot.org in traffic. Someone on Digg posted an article about the win and all the Digg visitors went to Alexa to look at the stats and many installed the toolbar. The result was a big jump in Digg’s traffic according to Alexa when nothing had changed.

Google PageRank

PageRank is only updated about once every 2 or more months. New sites could be getting a ton of traffic and have no pagerank, while older sites can have huge pagerank but very little content and only rank well for a few keywords. Install Google Toolbar to see pagerank for any site. You may have to enable it in advanced options.

nmap

This may get you blocked by your ISP and may even be illegal, so I’m just mentioning it for informational purposes and because this may be used on you. nmap is a port scanning tool that will tell you what services someone is running on their server, what operating system they’re running, what other machines are on the same subnet and so on. It’s a favorite used by hackers to find potential targets. It also has the potential to slow down or harm a server. It’s also quite easy to detect if someone is running this on your server and find out who they are. So don’t go and load this on your machine and run it.

Compete

Compete is basically an Alexa clone. I never use this site because I’ve checked sites that I have real data on and compete seems way off. They claim to provide demographics too, but if the basics are wrong, how can you trust the demographics.

whois

I use unix command line whois, but you can use whois.net if you’re not a geek. We use a domain proxy service to preserve our privacy, but many people don’t. You’ll often dig up some interesting data in whois, like who the parent company of your competitor is, or who founded the company and is still the owner of the domain name. Try googling any corporation or personal names you find and you might come up with even more data.

HTML source of competitors site

Just take a glance at the headers and footers and any comments in the middle of the pages. Sometimes you can tell what server platform they’re running or sometimes a silly developer has commented out code that’s part of a yet unreleased feature.

Personal blogs of competitors and staff

If you’re researching linebuzz.com and you’re my competitor, then it’s a good idea to keep an eye on this blog. I sometimes talk about the technology we use and how we get stuff done. Same applies for your competitors. Google the founders and management team, find their blogs and read them regularly.

dig (not Digg.com)

dig is another unix tool that queries dns servers. Much of this data is available from netcraft.com mentioned above. But you can use dig to find out who your competitor uses for email with ‘dig mx example.com’ and you can do a reverse lookup on an ip address which may help you find out who their ISP is (netcraft gives you this)

Another useful thing that dig does is give you an indication how your competitor is set up for web hosting – if they’re using round-robin DNS or a single IP with a load sharer.

traceroute

Another unix tool. Run: ‘/usr/sbin/traceroute www.example.com’ and you’ll get a list of the path your traffic takes to get to your competitors servers. Look at the last few router hostnames before the final destination of the traffic. You may get data on which country and/or city your competitors servers are based in and which hosting provider they use. There’s a rather crummy web based traceroute here.

Google alerts

Set up Google news, blog and search alerts for both your competitors brands and your own because your competitors may mention you in a blog comment or somewhere else.

There is a lot more information available via SEC filings, Secretary of State websites and so on – perhaps the subject of a future entry.