Unless you configure Web crawler properties to
ignore a Web server's robots.txt file, the crawler
tries to comply with the Robots Exclusion protocol and not crawl Web
sites if rules in the robots.txt file disallow
crawling.
When the crawler is configured to honor robots.txt files,
a successful download is when the crawler can retrieve the robots.txt file
from a Web server or confirm that a robots.txt file
does not exist. The download is considered a failure when the crawler
cannot obtain the rules or cannot confirm that a robots.txt file
exists.
A successful download does not mean that the crawler has permission
to crawl because rules in the robots.txt file
can disallow crawling. A download failure temporarily prohibits crawling
because the crawler cannot determine what the rules are.
These are the steps that the crawler takes when attempting to download
the
robots.txt file:
- When the crawler discovers a new site, it tries to obtain the
server's IP address. If this attempt fails, crawling is not possible.
- When at least one IP address is available, the crawler tries to
download the robots.txt file by using HTTP
(or HTTPS) GET.
- If the socket connection times out, is broken, or another low-level
error occurs (such as an SSL certificate problem), the crawler logs
the problem, and repeats the attempt on every IP address known for
the target server.
- If no connection is made after the crawler tries all addresses,
the crawler waits two seconds, then tries all the addresses one more
time.
- If a connection is made, and HTTP headers are exchanged, the return
status is examined. If the status code is 500 or higher, the crawler
interprets this as a bad connection and continues trying other IP
addresses. For any other status, the crawler stops trying alternative
IP addresses and proceeds according to the status code.
After the crawler receives an HTTP status code below 500, or after
the crawler tries all IP addresses twice, the crawler proceeds as
follows:
- If no HTTP status below 500 was received, the site is disqualified
for the time being.
- If an HTTP status of 400, 404 or 410 was received, the site is
qualified for crawling with no rules.
- If an HTTP status of 200 through 299 was received, the following
conditions direct the next action:
- If the content was truncated, the site is disqualified for the
time being.
- If the content parsed without errors, the site is qualified for
crawling with the rules that were found.
- If the content parsed with errors, the site is qualified for crawling
with no rules.
- If any other HTTP status was returned, the site is disqualified
for the time being.
When the crawler attempts to download the robots.txt file
for a site, it updates a persistent timestamp for that site called
the robots date. If a site is disqualified because the robots.txt information
is not available, the persistent robots failure count is incremented.
When the retry interval is reached, the crawler tries again to
retrieve robots.txt information for the failed
site. If the number of successive failures reaches the maximum number
of failures allowed, the crawler stops trying to retrieve the robots.txt file
for the site and disqualifies the site for crawling.
After a site is qualified for crawling (the check for robots.txt file
rules succeeds), the failure count is set to zero. The crawler uses
the results of the download until the interval for checking rules
elapses. At that time, the site must be qualified again.
Tip: - If a server returns content but it contains syntax errors, or
if the server uses a robots protocol other than the 1994 version,
or if the content contains something other than robots rules (such
as a soft error page), the crawler acts as though the server does
not have an applicable rules file and crawls the site. This action
is usually correct because collection administrators do not control
site content or default server behavior. If a Web server administrator
does not want a site to be crawled, and does not want to install a
conforming rules file, the collection administrator can block the
site from the Web crawler by specifying the site's domain, IP address,
or HTTP prefix in the crawler's rules.
- If a server returns a 302 status code or other redirection codes,
the crawler interprets the code to mean that the site has a robots.txt file
that should be used, but the file is not at the conforming location
(the site root). The Web server administrator must move the file to
the correct location so that the Web crawler can abide by the rules
in the file.
- If there are certificate problems (for example, the certificate
might be out of date, the certificate authority might not be trusted,
or the certificate might be self-signed and the crawler is not configured
to accept self-signed certificates), the crawler interprets the problem
as a failure to connect with the site and disqualifies the site. The
same problems would probably prevent crawling other pages from the
site, anyway. To enable the site to be crawled, the collection administrator
must enable self-signed certificates, add the site's authority to
the trusted keystore file, or ask the Web server administrator to
obtain an up-to-date certificate.
- The Web crawler might be configured to use HTTP basic authentication
(including HTTP basic proxy authentication). If properly configured,
authentication is required for downloads of robots.txt files,
too. A status code of 403, 407, or other authentication related responses
indicates authorization problems, and the crawler disqualifies the
site. (Only HTTP basic authentication is supported.)
- If the robots.txt file for a site exceeds
the maximum length for a robots page, the collection administrator
can raise the configured maximum (the default value of one million
bytes should be sufficient).
To help troubleshoot problems, you can request a site report
when you monitor the Web crawler. Select options for viewing the contents
of the robots.txt file (to see whether rules
forbid the Web crawler from accessing the site), seeing the date and
time that the crawler last attempted to download the robots.txt file
(the crawler will not attempt again until the retry interval elapses),
and seeing how many consecutive attempts the crawler made to download
the robots.txt file but failed to do so.