How the Web crawler uses the robots exclusion protocol

Unless you configure Web crawler properties to ignore a Web server's robots.txt file, the crawler tries to comply with the Robots Exclusion protocol and not crawl Web sites if rules in the robots.txt file disallow crawling.

When the crawler is configured to honor robots.txt files, a successful download is when the crawler can retrieve the robots.txt file from a Web server or confirm that a robots.txt file does not exist. The download is considered a failure when the crawler cannot obtain the rules or cannot confirm that a robots.txt file exists.

A successful download does not mean that the crawler has permission to crawl because rules in the robots.txt file can disallow crawling. A download failure temporarily prohibits crawling because the crawler cannot determine what the rules are.

These are the steps that the crawler takes when attempting to download the robots.txt file:

When the crawler discovers a new site, it tries to obtain the server's IP address. If this attempt fails, crawling is not possible.
When at least one IP address is available, the crawler tries to download the robots.txt file by using HTTP (or HTTPS) GET.
If the socket connection times out, is broken, or another low-level error occurs (such as an SSL certificate problem), the crawler logs the problem, and repeats the attempt on every IP address known for the target server.
If no connection is made after the crawler tries all addresses, the crawler waits two seconds, then tries all the addresses one more time.
If a connection is made, and HTTP headers are exchanged, the return status is examined. If the status code is 500 or higher, the crawler interprets this as a bad connection and continues trying other IP addresses. For any other status, the crawler stops trying alternative IP addresses and proceeds according to the status code.

After the crawler receives an HTTP status code below 500, or after the crawler tries all IP addresses twice, the crawler proceeds as follows:

If no HTTP status below 500 was received, the site is disqualified for the time being.
If an HTTP status of 400, 404 or 410 was received, the site is qualified for crawling with no rules.
If an HTTP status of 200 through 299 was received, the following conditions direct the next action:
- If the content was truncated, the site is disqualified for the time being.
- If the content parsed without errors, the site is qualified for crawling with the rules that were found.
- If the content parsed with errors, the site is qualified for crawling with no rules.
If any other HTTP status was returned, the site is disqualified for the time being.

When the crawler attempts to download the robots.txt file for a site, it updates a persistent timestamp for that site called the robots date. If a site is disqualified because the robots.txt information is not available, the persistent robots failure count is incremented.

When the retry interval is reached, the crawler tries again to retrieve robots.txt information for the failed site. If the number of successive failures reaches the maximum number of failures allowed, the crawler stops trying to retrieve the robots.txt file for the site and disqualifies the site for crawling.

After a site is qualified for crawling (the check for robots.txt file rules succeeds), the failure count is set to zero. The crawler uses the results of the download until the interval for checking rules elapses. At that time, the site must be qualified again.

Tip:

If a server returns content but it contains syntax errors, or if the server uses a robots protocol other than the 1994 version, or if the content contains something other than robots rules (such as a soft error page), the crawler acts as though the server does not have an applicable rules file and crawls the site. This action is usually correct because collection administrators do not control site content or default server behavior. If a Web server administrator does not want a site to be crawled, and does not want to install a conforming rules file, the collection administrator can block the site from the Web crawler by specifying the site's domain, IP address, or HTTP prefix in the crawler's rules.
If a server returns a 302 status code or other redirection codes, the crawler interprets the code to mean that the site has a robots.txt file that should be used, but the file is not at the conforming location (the site root). The Web server administrator must move the file to the correct location so that the Web crawler can abide by the rules in the file.
If there are certificate problems (for example, the certificate might be out of date, the certificate authority might not be trusted, or the certificate might be self-signed and the crawler is not configured to accept self-signed certificates), the crawler interprets the problem as a failure to connect with the site and disqualifies the site. The same problems would probably prevent crawling other pages from the site, anyway. To enable the site to be crawled, the collection administrator must enable self-signed certificates, add the site's authority to the trusted keystore file, or ask the Web server administrator to obtain an up-to-date certificate.
The Web crawler might be configured to use HTTP basic authentication (including HTTP basic proxy authentication). If properly configured, authentication is required for downloads of robots.txt files, too. A status code of 403, 407, or other authentication related responses indicates authorization problems, and the crawler disqualifies the site. (Only HTTP basic authentication is supported.)
If the robots.txt file for a site exceeds the maximum length for a robots page, the collection administrator can raise the configured maximum (the default value of one million bytes should be sufficient).

To help troubleshoot problems, you can request a site report when you monitor the Web crawler. Select options for viewing the contents of the robots.txt file (to see whether rules forbid the Web crawler from accessing the site), seeing the date and time that the crawler last attempted to download the robots.txt file (the crawler will not attempt again until the retry interval elapses), and seeing how many consecutive attempts the crawler made to download the robots.txt file but failed to do so.