This article is part of an SEO series from WooRank. Thank you for supporting the partners who make SitePoint possible.
A robots.txt file is a plain text file that specifies whether or not a crawler should or shouldn’t access specific folders, subfolders or pages, along with other information about your site. The file uses the Robots Exclusion Standard, a protocol set in 1994 for websites to communicate with crawlers and other bots. It’s absolutely essential that you use a plain text file. Creating a robots.txt file using HTML or a word processor will include code that search engine crawlers will ignore if they can’t read.
How Does It Work?
When a site owner wants to give some guidance to web crawlers, they put their robots.txt file in the root directory of their site, e.g. https://www.example.com/robots.txt. Bots that follow this protocol will fetch and read the file before fetching any other file from the site. If the site doesn’t have a robots.txt, the crawler will assume the webmaster didn’t want to give any specific instructions and will go on to crawl the entire site.
Robots.txt is made up of two basic parts: User-agent and directives.
User-Agent
User-agent is the name of the spider being addressed, while the directive lines provide the instructions for that particular user-agent. The User-agent line always goes before the directive lines in each set of directives. A very basic robots.txt looks like this:
These directives instruct the user-agent Googlebot, Google’s web crawler, to stay away from the entire server — it won’t crawl any page on the site. If you want to give instructions to multiple robots, create a set of user-agents and disallow directives for each one.
Now both Google and Bing’s user-agents know to avoid crawling the entire site. If you want to set the same requirement for all robots, you can use what’s called a wildcard, represented with an asterisk (*). So if you want to allow all robots to crawl your entire site, your robots.txt file should look like this:
It’s worth noting that search engines will choose the most specific user-agent directives they can find. So, for example, say you have four sets of user-agents: One using using a wildcard (*), one for Googlebot, one for Googlebot-News and one for Bingbot, and your site is visited by the Googlebot-Images user-agent. That bot will follow the instructions for Googlebot, as it is the most specific set of directives that apply to it.
The most common search engine user-agents are:
User-Agent
Search Engine
Field
baiduspider
Baidu
General
baiduspider-image
Baidu
Images
baiduspider-mobile
Baidu
Mobile
baiduspider-news
Baidu
News
baiduspider-video
Baidu
Video
bingbot
Bing
General
msnbot
Bing
General
msnbot-media
Bing
Images & Video
adidxbot
Bing
Ads
Googlebot
Google
General
Googlebot-Image
Google
Images
Googlebot-Mobile
Google
Mobile
Googlebot-News
Google
News
Googlebot-Video
Google
Video
Mediapartners-Google
Google
AdSense
AdsBot-Google
Google
AdWords
slurp
Yahoo!
General
yandex
Yandex
General
Disallow
The second part of robots.txt is the disallow line. This directive tells spiders which pages they aren’t allowed to crawl. You can have multiple disallow lines per set of directives, but only one user-agent.
You don’t have to put any value for the disallow directive; bots will interpret an empty disallow value to mean that you aren’t disallowing anything and will access the entire site. As we mentioned earlier, if you want to deny access to the entire site to a bot (or all bots), use a slash (/).
You can get granular with disallow directives by specifying specific pages, directories, subdirectories and file types. To block crawlers from a specific page, use that page’s relative link in the disallow line:
Block access to whole directories the same way:
You can also use robots.txt to block bots from crawling certain file types by using a wildcard and file type in the disallow line:
While the robots.txt protocol technically doesn’t support the use of wildcards, search engine bots are able to recognize and interpret them. So in the directives above, a robot would automatically expand the asterisk to match the path of the filename.
For example, it would be able to figure out that www.example.com/presentations/slideshow.ppt and www.example.com/images/example.jpg are disallowed while www.example.com/presentations/slideshowtranscript.html isn’t. The third disallows crawling of any file in the /duplicatecontent/ directory that starts with ‘copy’ and ends in ‘.html’. So these pages are blocked:
/duplicatecontent/copy.html
/duplicatecontent/copy1.html
/duplicatecontent/copy2.html
/duplicatecontent/copy.html?id=1234
However, it would not disallow any instances of ‘copy.html’ stored in another directory or subdirectory.
One issue you might encounter with your robots.txt file is that some URLs contain excluded patterns in URLs we would actually want to be crawled. From our earlier example of Disallow: /images/*.jpg, that directory might contain a file called ‘description-of-.jpg.html’. That page would not be crawled because it matches the exclusion pattern. To resolve this, add a dollar symbol ($) to signify that it represents the end of a line. This will tell search engine crawlers to avoid only files that end in the exclusion pattern. So Disallow: /images/*.jpg$ blocks only files that end in ‘.jpg’ while allowing files that include ‘.jpg’ in the title.
Allow
Sometimes you might want to exclude every file in a directory but one. You can do this the hard way by writing a disallow line for every file except the one you want crawled. Or you can use the Allow directive. It works pretty much like you would expect it to: Add the ‘Allow’ line to the group of directives for a user-agent:
Continue reading %Robots and You: A Guide to Robots.txt%