What is a bot?
Internet bots — or web robots — are software applications that run repetitive tasks across the web. And while they may sound like something out of sci-fi, they are, for the most part, less “Minority Report” than “Wall-E.” While it’s true that bots can be designed with malicious intentions, the most common bots — web crawlers and other web robots — are quite useful.
Internet search providers use web robots to index websites and turn those scans into searchable results. This is an essential function for people who care about search result rankings and SEO. Without bots, no one would find your website when they searched Google. In fact, there would be no Google.
However, there are circumstances when robots can have a negative impact on your site. Most of the time it is not their fault, they’re just doing what they’re designed to do, but if you have a bot or two indexing your site it can really slow things down.
A robots.txt file can help
That’s where the robots.txt file comes in. The robots.txt file is a file that robots look for, which dictates how a bot behaves on your site. Think of your robots.txt file as your website’s house rules. Most bots will look at this file and behave according to the rules you’ve laid out.
Most legitimate bots obey all of the rules in a robots.txt file, but some do not. Googlebot, for example, requires Google Webmaster Tools. Still, it’s important to manage as many bots as possible so it pays to have a robots.txt file in place. This way you can promote good SEO by allowing search engine bots to do their work, and prevent them from slowing your website down in the process.
Creating a robots.txt file
Note: The following instructions assume you will create your robots.txt file directly on your server via the command line, but you can also create a text file (.txt) named “robots” in a desktop word processor (such as Apple’s Text Edit or Windows Notepad) and upload the final file once all your instructions are complete.
In order to use a robots.txt file you will need to first create a file called robots.txt and place it in the webroot of your site: public_html for cPanel and httpdocs for Plesk. Once created you will need to fill this file with instructions, and they can become quite robust. In this article, we’ll we’re only going to discuss a few of the most common directives: User-agent, Crawl-delay, and Disallow.
The User-agent will be the first entry in a robots.txt file as it defines which bots should be following the rules. You can provide specific instructions to different bots. For example; if you have a North American site that caters to North American people and you are getting a lot of traffic from Asian bots such as Yandex or Baidu, you specify them and have them follow a specific set of rules. Generally however, you will be writing rules for all bots. Use “*” to set the User-agent as all bots. The text in the file should look like this:
Next is the Crawl-delay directive, which is a very powerful directive if bot traffic is slowing your sites down. What the Crawl-delay directive does is dictate how long the bot will wait between requests to the same site, measured in seconds. So, by adding “Crawl-delay: 10″ to the previous entry in our example robots.txt file, we are telling all bots to wait 10 seconds before they make additional requests from the site.
User-agent: * Crawl-delay: 10
The last variable we’ll discuss is the aptly named Disallow article. Disallow tells bots not to index the specified content. This is useful for having bots skip over large files or folders with content that would take the bots a lot of time to crawl, but which is not valuable for search, such as:
/cgi-bin/ /tmp/ /junk/
Make no mistake, Disallow is not a way to hide sensitive information. In fact, the robots.txt file is publicly viewable so people, bots, and whatever else can see exactly what you’re telling Internet bots to ignore. Using a robots.txt file to keep malicious bots and users away from sensitive files is like drawing a treasure map with an X marks the spot, and a little note that says, “Treasure here, please avoid.” If you need to protect information on your server from intrusive eyes you need to set permissions for that data accordingly. You can find more information on permissions here.
To continue our example, let’s say you’ve decided that you want Internet bots to avoid the time-consuming task of crawling your cgi-bin, tmp, and junk directories. Adding the disallow rule for those three locations updates our robots.txt file to:
User-agent: * Crawl-delay: 10 Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/
Your robots.txt file can have separate entries for different bots or groups of bots. For example, if you wanted to disallow all bots from the cgi-bin, tmp and junk directories, and you wanted to block a specific User-agent from indexing your site, say for geographical differences, you could do something like the following. Yandex is a legitimate Russian search engine bot, but if your site’s content does not cater to a Russian audience you may be able to squeeze a little additional performance by disallowing them.
User-agent: * Crawl-delay: 10 Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/ User-agent: Yandex Disallow: /
Please keep in mind that this is an example, it may not be right for your application. The robots.txt file is website coding, which is beyond the scope of ServInt’s support. The goal of this article is to empower you to take control over your sites, so please take this knowledge and use it as you see fit.
For more information on robots.txt files as well as more variables, please visit one of these sources:
Photo by Don