How the Web crawler uses the robots exclusion protocol

Unless you configure Web crawler properties to ignore a Web server's robots.txt file, the crawler tries to comply with the Robots Exclusion protocol and not crawl Web sites if rules in the robots.txt file disallow crawling.

When the crawler is configured to honor robots.txt files, a successful download is when the crawler can retrieve the robots.txt file from a Web server or confirm that a robots.txt file does not exist. The download is considered a failure when the crawler cannot obtain the rules or cannot confirm that a robots.txt file exists.

A successful download does not mean that the crawler has permission to crawl because rules in the robots.txt file can disallow crawling. A download failure temporarily prohibits crawling because the crawler cannot determine what the rules are.

These are the steps that the crawler takes when attempting to download the robots.txt file:
  1. When the crawler discovers a new site, it tries to obtain the server's IP address. If this attempt fails, crawling is not possible.
  2. When at least one IP address is available, the crawler tries to download the robots.txt file by using HTTP (or HTTPS) GET.
  3. If the socket connection times out, is broken, or another low-level error occurs (such as an SSL certificate problem), the crawler logs the problem, and repeats the attempt on every IP address known for the target server.
  4. If no connection is made after the crawler tries all addresses, the crawler waits two seconds, then tries all the addresses one more time.
  5. If a connection is made, and HTTP headers are exchanged, the return status is examined. If the status code is 500 or higher, the crawler interprets this as a bad connection and continues trying other IP addresses. For any other status, the crawler stops trying alternative IP addresses and proceeds according to the status code.
After the crawler receives an HTTP status code below 500, or after the crawler tries all IP addresses twice, the crawler proceeds as follows:
  1. If no HTTP status below 500 was received, the site is disqualified for the time being.
  2. If an HTTP status of 400, 404 or 410 was received, the site is qualified for crawling with no rules.
  3. If an HTTP status of 200 through 299 was received, the following conditions direct the next action:
    • If the content was truncated, the site is disqualified for the time being.
    • If the content parsed without errors, the site is qualified for crawling with the rules that were found.
    • If the content parsed with errors, the site is qualified for crawling with no rules.
  4. If any other HTTP status was returned, the site is disqualified for the time being.

When the crawler attempts to download the robots.txt file for a site, it updates a persistent timestamp for that site called the robots date. If a site is disqualified because the robots.txt information is not available, the persistent robots failure count is incremented.

When the retry interval is reached, the crawler tries again to retrieve robots.txt information for the failed site. If the number of successive failures reaches the maximum number of failures allowed, the crawler stops trying to retrieve the robots.txt file for the site and disqualifies the site for crawling.

After a site is qualified for crawling (the check for robots.txt file rules succeeds), the failure count is set to zero. The crawler uses the results of the download until the interval for checking rules elapses. At that time, the site must be qualified again.

Tip:

To help troubleshoot problems, you can request a site report when you monitor the Web crawler. Select options for viewing the contents of the robots.txt file (to see whether rules forbid the Web crawler from accessing the site), seeing the date and time that the crawler last attempted to download the robots.txt file (the crawler will not attempt again until the retry interval elapses), and seeing how many consecutive attempts the crawler made to download the robots.txt file but failed to do so.