Sitepoint.com

Robots and You: A Guide to Robots.txt

2016-09-09

This article is part of an SEO series from WooRank. Thank you for supporting the partners who make SitePoint possible.

A robots.txt file is a plain text file that specifies whether or not a crawler should or shouldn’t access specific folders, subfolders or pages, along with other information about your site. The file uses the Robots Exclusion Standard, a protocol set in 1994 for websites to communicate with crawlers and other bots. It’s absolutely essential that you use a plain text file. Creating a robots.txt file using HTML or a word processor will include code that search engine crawlers will ignore if they can’t read.

How Does It Work?

When a site owner wants to give some guidance to web crawlers, they put their robots.txt file in the root directory of their site, e.g. https://www.example.com/robots.txt. Bots that follow this protocol will fetch and read the file before fetching any other file from the site. If the site doesn’t have a robots.txt, the crawler will assume the webmaster didn’t want to give any specific instructions and will go on to crawl the entire site.

Robots.txt is made up of two basic parts: User-agent and directives.

User-Agent

User-agent is the name of the spider being addressed, while the directive lines provide the instructions for that particular user-agent. The User-agent line always goes before the directive lines in each set of directives. A very basic robots.txt looks like this:

These directives instruct the user-agent Googlebot, Google’s web crawler, to stay away from the entire server — it won’t crawl any page on the site. If you want to give instructions to multiple robots, create a set of user-agents and disallow directives for each one.

Now both Google and Bing’s user-agents know to avoid crawling the entire site. If you want to set the same requirement for all robots, you can use what’s called a wildcard, represented with an asterisk (*). So if you want to allow all robots to crawl your entire site, your robots.txt file should look like this:

It’s worth noting that search engines will choose the most specific user-agent directives they can find. So, for example, say you have four sets of user-agents: One using using a wildcard (*), one for Googlebot, one for Googlebot-News and one for Bingbot, and your site is visited by the Googlebot-Images user-agent. That bot will follow the instructions for Googlebot, as it is the most specific set of directives that apply to it.

The most common search engine user-agents are:

User-Agent

Search Engine

Field

baiduspider

Baidu

General

baiduspider-image

Baidu

Images

baiduspider-mobile

Baidu

Mobile

baiduspider-news

Baidu

News

baiduspider-video

Baidu

Video

bingbot

Bing

General

msnbot

Bing

General

msnbot-media

Bing

Images & Video

adidxbot

Bing

Ads

Googlebot

Google

General

Googlebot-Image

Google

Images

Googlebot-Mobile

Google

Mobile

Googlebot-News

Google

News

Googlebot-Video

Google

Video

Mediapartners-Google

Google

AdSense

AdsBot-Google

Google

AdWords

slurp

Yahoo!

General

yandex

Yandex

General

Disallow

The second part of robots.txt is the disallow line. This directive tells spiders which pages they aren’t allowed to crawl. You can have multiple disallow lines per set of directives, but only one user-agent.

You don’t have to put any value for the disallow directive; bots will interpret an empty disallow value to mean that you aren’t disallowing anything and will access the entire site. As we mentioned earlier, if you want to deny access to the entire site to a bot (or all bots), use a slash (/).

You can get granular with disallow directives by specifying specific pages, directories, subdirectories and file types. To block crawlers from a specific page, use that page’s relative link in the disallow line:

Block access to whole directories the same way:

You can also use robots.txt to block bots from crawling certain file types by using a wildcard and file type in the disallow line:

While the robots.txt protocol technically doesn’t support the use of wildcards, search engine bots are able to recognize and interpret them. So in the directives above, a robot would automatically expand the asterisk to match the path of the filename.

For example, it would be able to figure out that www.example.com/presentations/slideshow.ppt and www.example.com/images/example.jpg are disallowed while www.example.com/presentations/slideshowtranscript.html isn’t. The third disallows crawling of any file in the /duplicatecontent/ directory that starts with ‘copy’ and ends in ‘.html’. So these pages are blocked:

/duplicatecontent/copy.html

/duplicatecontent/copy1.html

/duplicatecontent/copy2.html

/duplicatecontent/copy.html?id=1234

However, it would not disallow any instances of ‘copy.html’ stored in another directory or subdirectory.

One issue you might encounter with your robots.txt file is that some URLs contain excluded patterns in URLs we would actually want to be crawled. From our earlier example of Disallow: /images/*.jpg, that directory might contain a file called ‘description-of-.jpg.html’. That page would not be crawled because it matches the exclusion pattern. To resolve this, add a dollar symbol ($) to signify that it represents the end of a line. This will tell search engine crawlers to avoid only files that end in the exclusion pattern. So Disallow: /images/*.jpg$ blocks only files that end in ‘.jpg’ while allowing files that include ‘.jpg’ in the title.

Allow

Sometimes you might want to exclude every file in a directory but one. You can do this the hard way by writing a disallow line for every file except the one you want crawled. Or you can use the Allow directive. It works pretty much like you would expect it to: Add the ‘Allow’ line to the group of directives for a user-agent:

Continue reading %Robots and You: A Guide to Robots.txt%