Global Web crawl space configuration

You can configure a global crawl space for Web crawlers, which enables you to better control the removal of URLs from the index.

Each Web crawler is configured with a crawl space that defines the URLs that are to be crawled or not crawled. Discovered URLs that are in the crawl space are retained (in a database) for later crawling; URLs that are not in the crawl space are discarded. If the crawler starts with an empty database, the crawl space definition and database remain consistent while the crawler runs.

Sometimes a crawler is stopped, and its crawl space is reduced (for example, by new rules that forbid pages to be crawled). When the crawler is restarted, its crawl space definition and database become inconsistent. The database contains URLs (some crawled and some not crawled) which are not in the new, smaller crawl space.

If a collection has only one Web crawler, the Web crawler can restore consistency by changing the HTTP status codes for these URLs to 760 (which specifies that they are to be excluded) and requesting the removal of the now-excluded pages from the index.

If you divide the crawl space between two or more Web crawlers (for example, to ensure some pages are crawled more often than the rest), each Web crawler maintains independent database tables (initially empty), and they each crawl a different part of the Web crawl space. The original crawler's crawl space is then reduced to whatever is left after the parts to be crawled by other crawlers are removed. Problems arise when the original crawler attempts to restore consistency by removing the moved pages from the index. Because the moved pages are now being crawled by other crawlers, the pages should remain in the index.

By configuring a higher level, global crawl space you can identify URLs that are not to be crawled by the original crawler, but are not to be removed from the index, either. URLs that are no longer in any crawler's crawl space continue to be marked for exclusion by the discovery processes, and are removed from the index when they are recrawled.

The global crawl space is defined by a configuration file named global.rules, which must exist in the crawler configuration directory (the presence of a global.rules file enables the global crawl space function). If this file exists, it is read during crawler initialization. If this file does not exist, the crawler operates with a single-level crawl space, and removes documents from the index as necessary to maintain consistency between its crawl space definition and database.

If a global crawl space exists, the crawler rules URLs in or out as before, but will request the removal of a URL from the index only if the URL is not in any Web crawl space.

The global.rules file has the same syntax as the local crawl.rules file, except that it can contain only domain name rules. This restriction enables a crawl space to be partitioned between crawlers only on the basis of DNS host names, not IP addresses or HTTP prefix patterns. URLs that are excluded by URL prefix or IP address rules in the local crawl space (as defined in the crawl.rules file) are unaffected by the global crawl space; such URLs are still excluded.

The global crawl space is used only to prevent the removal of URLs, which are excluded from one crawler's crawl space by a local domain rule, from the index. The following rules apply in the following order:
  1. If a URL from the crawler's database is excluded by a local prefix rule or address rule, the URL is assigned status code 760 and it is removed from the index. The URL will not be crawled again.
  2. If a URL from the crawler's database is excluded by a local domain rule, and there is no global crawl space, the URL is assigned status code 760, and it is removed from the index. The URL will not be crawled again.
  3. If a URL from the crawler's database is excluded by a local domain rule, but explicitly allowed by a rule in the global crawl space, the URL is assigned status code 761. The crawler will not crawl the URL again, but it is not removed from the index (it is assumed to be in some other crawler's local crawl space).
  4. If a URL from the crawler's database is excluded by a local domain rule, and not explicitly allowed by a rule in the global crawl space, the URL is assigned status code 760, and removed from the index.
Because the global crawl space is consulted only to prevent the deletion of URLs that have already been excluded by the local crawl space, the default result from the global crawl space, if no rule applies to a candidate URL, is to forbid it from being crawled.

The global.rules file must exist in the master_config directory of every crawler that shares the global crawl space. You must carefully edit all copies of the global.rules file and the individual local crawl.rules files to ensure that they remain mutually consistent.