Your Web site isn't always accessed by human users. Many search engines index your Web site by using Web robots — programs that traverse Web sites for indexing purposes. These robots often index information they shouldn't — and sometimes don't index what they should. The following section examines ways to control (most) robot access to your Web site.
Frequently used search engines such as Yahoo!, AltaVista, Excite, and Infoseek use automated robot or spider programs that search Web sites and index their contents. This is usually desirable, but on occasion, you may find yourself wanting to stop these robots from accessing a certain part of your Web site.
If content in a section of your Web site frequently expires (daily, for example), you don't want the search robots to index it. When a user at the search-engine site clicks a link to the old content and finds that the link doesn't exist, she isn't happy. That user may then go to the next link without returning to your site.
Sometimes you may want to disable the indexing of your content (or part of it), because the robots can overwhelm Web sites by requesting too many documents too rapidly. Efforts are underway to create standards of behavior for Web robots. In the meantime, the Robot Exclusion Protocol enables Web site administrators to place a robots.txt file on their Web sites, indicating where robots shouldn't go.
For example, a large archive of bitmap images is useless to a robot that is trying to index HTML pages. Serving these files to the robot wastes resources on your server and at the robot's location.
This protocol is currently voluntary, and etiquette is still evolving for robot developers as they gain experience with Web robots. The most popular search engines, however, abide by the Robot Exclusion Protocol. Here is what a robot or spider program does:
1. When a compliant Web robot visits a site called www.domain.com, it first checks for the existence of the URL:
2. If this URL exists, the robot parses its contents for directives that instruct the robot to index the site. As a Web server administrator, you can create directives that make sense for your site. Only one robots.txt file may exist per site; this file contains records that may look like the following:
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /~kabir/
In the preceding code
■ The first directive tells the robot that the following directives should be considered by any robots.
■ The following three directives (Disallow) tell the robot not to access the directories mentioned in the directives.
You need a separate Disallow line for every URL prefix you want to exclude. For example, your command line should not read like this:
You should not have blank lines in a record. They delimit multiple records. Regular expressions aren't supported in the User-agent and Disallow lines. The asterisk in the User-agent field is a special value that means any robot. Specifically, you can't have lines like either of these:
Everything not explicitly disallowed is considered accessible by the robot (some examples follow).
To exclude all robots from the entire server, use the following configuration:
To permit all robots complete access, use the following configuration:
User-agent: * Disallow:
You can create the same effect by deleting the robots.txt file. To exclude a single robot called WebCrawler, add these lines:
User-agent: WebCrawler Disallow: /
To allow a single robot called WebCrawler to access the site, use the following configuration:
User-agent: WebCrawler Disallow: User-agent: * Disallow: /
To forbid robots to index a single file called /daily/changes_to_often.html, use the following configuration:
Was this article helpful?