Alternative to Using robots.txt

In my post Common Web Site Structuring Mistakes, I did mention that robots.txt tend to deceive its function of restricting access to certain areas of your web site by search engine crawlers, more commonly known as robots. Instead, the robots.txt file might actually do more revealing rather than restricting. In light of this, is there any other way to get the function of robots.txt without actually revealing too much information? Lucky for you, there is.

Say hello to the robots meta tag… Same functionality as robots.txt, but far less obvious. Here’s how one would look like:

<meta name="robots" content="noindex,nofollow">

As with all meta tags, the above would go in the <head> section of your HTML. What that particular line does is to inform the robot that the particular page which has this meta tag should not be indexed, and the links in this page should not be crawled. This will effectively cause compliant robots to virtually ignore the existance of this page.

Yes, I did mention “compliant robots”… and the last time I checked the robots/crawlers from major search engines such as Google, Yahoo!, MSN, Altavista, etc. do support the robots meta tag as well as robots.txt. Heck some of the more “evil” robots don’t even follow the robots.txt rules, these robots are often used as email address harvesters by spammers.

Another item you might want to consider adding to your robots meta tag is noarchivecontent. This tells compliant search engines not to cache the contents of the page. Thus if your page appears on, let’s say Google’s search results, pages with the noarchive directive won’t have a Cached link under it.

So in summary:

<meta name="robots" content="noindex">
This tells the robot not to index the page.

<meta name="robots" content="nofollow">
This tells the robot not to follow links on the page.

<meta name="robots" content="noarchive">
This tells the robot not to archive the page.

<meta name="robots" content="noindex,nofollow,noarchive">
Multiple directives are possible and the options should be separated by commas.

A final caveat: Just as not all robots obey the rules of robots.txt, the same thing applies to using the robots meta tag. There’s no 100% sure way to exclude your pages from being found, other than not to publish them at all, of course.

Leave a Reply