This post is the first part of the series Ultimate Guide to SEO. To get the most out of this article you will want to read the entire series.
What is the Robot Exclusion Protocol (REP)
Since it was introduced in the early ‘90s, REP has become the de facto standard by which web publishers specify which parts of their site they want public and which parts they want to keep private. I use robots.txt on this site to keep the search engines away from parts of the site that contain duplicate content. Often times sites exclude private data that they don’t want indexed. This is where the Robot Exclusion Protocol or robots.txt comes in.
Wikipedia’s definition for REP.
The robot exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code. The standard complements Sitemaps, a robot inclusion standard for websites.
The REP is still evolving based on the entire internet communities needs; however, There isn’t a true standard that is followed by all of the major search engines. Although Google has worked with Microsoft and Yahoo they each have their own implementation of the protocol. This is why it is important to understand the differences between them.
This article will explore Google’s implementation thoroughly and breifly touch on the differences between Microsoft and Yahoo’s implementation. My explanations will be based on Google’s released detailed documentation on how they implement REP.
Robots.txt Directives
According to Google’s documentation these directives are implemented by all three major search engines: Google, Microsoft, and Yahoo.
Disallow
Allow
$ Wildcard Support
Wildcard Support
Sitemaps Location
Disallow
Google:Tells a crawler not to index your site – your site’s robots.txt file still needs to be crawled to find this directive, however disallowed pages will not be crawled
Use Cases: ‘No Crawl’ page from a site. This directive in the default syntax prevents specific path(s) of a site from being crawled.
Mark: This is probably the most commonly used directive. Since the Googlebot penalizes for duplicate content I use this extensively in my robots.txt to hide duplicate content. This directive is also useful if you want to hide private or subscription data from bots. Although subscription pages will probably be password protected you should also add these pages to the disallow just in case.
Allow
Google: Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow
Use Cases: This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed except for a small section within it
Mark: The allow clause will trump the disallow. This is helpful if you want to allow a specific page in a directory that would normally be disallowed.
$ Wildcard Support
Google: Tells a crawler to match everything from the end of a URL – large number of directories without specifying specific pages
Use Cases: ‘No Crawl’ files with specific patterns, for example, files with certain filetypes that always have a certain extension, say pdf
Mark: If you have an upload folder for your blog or website and you want to restrict a specific filetype but allow images to be indexed you could use this directive.
* Wildcard Support
Google: Tells a crawler to match a sequence of characters
Use Cases: ‘No Crawl’ URLs with certain patterns, for example, disallow URLs with session ids or other extraneous parameters
Mark: This is probably the second most used directive in conjuntion with the disallow. Allows you to match multiple directories at once. One word of caution here is to test your robots.txt with Google’s Webdeveloper Tool.
Sitemaps Location
Google: Tells a crawler where it can find your Sitemaps
Use Cases: Point to other locations where feeds exist to help crawlers find URLs on a site
Mark: It is a good idea to create a sitemap for your website and include it in your robots.txt. You can also tell Google where your sitemap is through the webdeveloper tool.
HTML META Directives
Not only can you provide rules that search engine bots must follow through robots.txt, you can also specify rules per html page. This is often required for sites that want the search spider to follow through links to other pages but to refrain from indexing that specific page. I use this method on this blog. I want the search spiders to follow my links from category and archive pages but to exclude the category and achrive listings since they contain duplicate content.
The following HTML META directives are implemented by all three major search engines: Google, Microsoft, and Yahoo.
NOINDEX META Tag
NOFOLLOW META Tag
NOSNIPPET META Tag
NOARCHIVE META Tag
NOODP META Tag
I will first give you the exact description given by Google’s documentation and then give you my own explination along with examples when necessary.
NOINDEX META Tag
Google: Tells a crawler not to index a given page.
Use Cases: Don’t index the page. This allows pages that are crawled to be kept out of the index.
NOFOLLOW META Tag
Google: Tells crawler not to follow a link to other content on a give page.
Use Cases: Prevent publicly writeable areas to be abused by spammers looking for link credit. By using NOFOLLOW you let the robot know that you are dicounting all outgoing links from this page.
Mark: A good place to put this tag is on outgoing links in commented areas. Wikipedia uses this method on all external links placed on wiki pages. It should also be noted that this not only can be used for entire pages but individual links:
Example: <a href="http://Somespamlink.com" rel="nofollow">Some comment spam</a>
NOSNIPPET META Tag
Google: Tells a crawler not to display snippets in the search results for a given page.
Use Cases: Present no snippet for the page on Search Results.
NOARCHIVE META Tag
Google: Tells a search engine not to show a “cached” link for a given page.
Use Cases: Do not make available to users a copy of the page from the Search Engine Cache.
NOODP META Tag
Google: Tells a crawler not to use a title and snippet from the Open Directory Project for a given page.
Use Cases: Do not use the ODP (Open Directory Project) title and snippet for this page.
Mark: If you do a quick search for Marksanborn.net you will see that my google snippet is:
Includes howtos and guides for Linux, PHP, Security, SEO and Software.
Which matches exactly to The Open Directory Project (http://www.dmoz.com). If I wanted to change my snippet I would add this directive to my index page.
Targeting specifc Search Spiders (user-agents)
Each visitor to a website will have a string that is identified by user-agent. This is the same string that we can use in our robots.txt that allows us to specify different rules for different search spiders. For example you could deny access to your archive articles because of [Google’s duplicate content penality] and allow the same archives to Yahoo’s bot.
Not only can the Google bot be identified with the user-agent string but also can be identified using reverse DNS based authentication. This allows for an alternative way to verify the identity of the crawler.
Here are some common user-agents for search engines:
Googlebot - Google
Googlebot-Image - Google Image
msnbot-Products - Windows Live Search
Mediapartners-Google - Google Adsense
Yahoo! Slurp - Yahoo!
Baiduspider+( - Baidu
ia_archiver - Alexa
Ask Jeeves - Ask Jeeves
Gigabot/ - Gigabot
msnbot-media/ - MSN Media
W3C_*Validator - W3C Validator
Feedfetcher-Google - Google Feedfetcher
msnbot-NewsBlogs/ - MSN News Blogs
So where can I put these rules?
You can put these in all forms of html and non html type documents. The most common place for these is robots.txt. This file is checked by all bots before entering a page and will adhere to the rules in the file for the entire domain. All you have to do is create a text file in the root directory of your domain with the name ‘robots.txt’.
These robot exclusions can also be put in non html files like, PDF and Video files using the X-Robots-Tag. You place these directives in the file’s http header.
Google also has a web developer tool that will simulate their bots going to your site. It will show you which URLs were excluded due to your robots.txt. You can even modify your robots.txt for testing purposes through their tool. This will help make sure you don’t accidently exclude the wrong directories/pages.
Robots.txt for Wordpress
Since the default installation of Wordpress is full of duplicate content I have created a robots.txt file to focus the Google bot’s attention on indexing the actual articles. Before I implemented my robots.txt file my rss feed page and archive listings were indexed over the actual articles. Here is an example of my robots.txt that removes most of the duplicate content in Wordpress. If you want to know more about Wordpress’ duplicate content problems check out, Duplicate Content Causes SEO Problems in Wordpress.
User-agent: *
Disallow: /wp-
Disallow: /search
Disallow: /feed
Disallow: /comments/feed
Disallow: /feed/$
Disallow: /*/feed/$
Disallow: /*/feed/rss/$
Disallow: /*/trackback/$
Disallow: /*/*/feed/$
Disallow: /*/*/feed/rss/$
Disallow: /*/*/trackback/$
Disallow: /*/*/*/feed/$
Disallow: /*/*/*/feed/rss/$
Disallow: /*/*/*/trackback/$
References
Below are links to the official documentation for REP from the three major search engines: Google, Microsoft and Yahoo.
Google’s Policy
Microsoft’s Policy
Yahoo’s Policy
Robot.txt for publishers PDF
Read the rest of The Ultimate SEO Guide.