User agent configuration

To crawl a Web site that uses the Robots Exclusion protocol, ensure that the robots.txt file on the Web site allows the user agent name that you configure for the Web crawler to access the Web site.

When the system is started, the Web crawler loads the user agent name that you configure for it. Before the crawler downloads a page from a Web site that it has not previously visited (or that it has not visited for some time), the crawler first tries to download a file called robots.txt. This file is in the root directory of the Web site.

If the robots.txt file does not exist, the Web site is open to unrestricted crawling. If the file does exist, it specifies what areas of the site (directories) are off limits to crawlers. The robots.txt file specifies permissions for crawlers by identifying their user-agent name.

The Robots Exclusion protocol is voluntary, but the Web crawler tries to comply with it:

Web site administrators often specify a final entry that bars access to all crawlers that are not explicitly granted access. If you are configuring a new Web crawler and you know that some of the Web sites that you want to crawl use the Robots Exclusion protocol, ask the Web site administrators to add an entry for your crawler to their robots.txt files.

Be sure to specify the same user agent name in the Web crawler's properties and in all robots.txt files that belong to the Web sites of interest.

If none of the Web sites to be crawled use the Robots Exclusion protocol, then the value that you specify for the user agent property typically does not matter. However, some application servers, JSPs, and servlets tailor their responses to the user agent name. For example, different responses exist to handle browser incompatibilities. The user agent name that you specify for the Web crawler might matter in these situations, regardless of the Robots Exclusion protocol. If you need to crawl these types of sites, consult with the Web site administrators to ensure that the Web crawler is allowed access.