Rules to limit the Web crawl space

To ensure that users access only the Web sites that you want them to search, you specify rules to limit what the Web crawler can crawl.

When a Web crawler crawls a Web page, it discovers links to other pages and puts those links in a queue to be crawled next. Crawling and discovery can be repeated as long as time and memory resources permit. When you configure a Web crawler, you specify where the crawler is to begin crawling. From these initial URLs (which are called start URLs) the Web crawler can reach any document on the Web that is connected by direct or indirect links.

To limit the crawl space, configure the Web crawler to crawl certain URLs thoroughly and ignore links that point outside the area of interest. Because the crawler, by default, accepts any URL that it discovers, you must specify rules that identify which URLs you want to include in the collection, and eliminate the rest of the pages.

You can specify in several ways what you want the Web crawler to crawl and not crawl. You can specify:

A list of start URLs where the crawler is to begin crawling
Three types of crawling rules: domain, Internet Protocol (IP) address, and URL prefix
A list of MIME types for documents that you want to include
A list of file extensions for documents that you want to exclude
The maximum number of directories in a URL path
The maximum number of links the crawler traverses when it follows links from one document to another

Crawling rules have the form:

action type target

action is forbid or allow; type is domain, IP address, or URL prefix (HTTP or HTTPS); and target depends on the value of type. You can specify an asterisk (*) as a wildcard character, in limited ways, to specify targets that match a pattern.

Domain rules

The target of a domain rule is a DNS domain name. For example, you can specify that the entire www.ibm.com domain is to be crawled:

allow domain www.ibm.com

You can specify an asterisk as a wildcard character, which causes the rule to apply to any host name that matches the rest of the pattern. For example, you can specify that no domains that begin with server and end in ibm.com are to be crawled:

forbid domain server*.ibm.com

Host name matching is case sensitive, whether you specify an explicit domain name or a domain name pattern. For example, *.user.ibm.com matches joe.user.ibm.com and mary.smith.user.ibm.com, but not joe.user.IBM.com.

A domain rule that does not specify a port number applies to all ports on that domain. In the following example, all ports on the sales domain are allowed:

allow domain sales.ibm.com

If a domain rule specifies a port number, then the rule applies only to that port. In the following example, only port 443 on the sales domain is allowed:

allow domain sales.ibm.com:443

Prefix rules

A prefix rule controls the crawling of URLs that begin with a specified string. The target is a single URL, which typically contains one or more asterisks to signify a pattern. For example, an asterisk is often specified as the final character in the prefix string.

A prefix rule enables you to crawl all or part of a Web site. You can specify a directory path or pattern, and then allow or forbid everything from that point on in the directory tree. For example, the following rules work together to allow the crawler to crawl everything in the public directory at sales.ibm.com, but forbid the crawler from accessing any other pages on the site:

allow prefix http://sales.ibm.com/public/*
forbid prefix http://sales.ibm.com/*

When you specify prefix rules, you can specify more than one asterisk and you can specify them anywhere in the prefix string, not just in the last position. For example, the following rule forbids the crawler from crawling any documents in a top-level directory of the sales.ibm.com site if the directory name ends in fs. (For example, you might have file system mounts that do not contain information that would be useful in the search index.)

forbid http://sales.ibm.com/*fs/*

Address rules

An address rule enables you to control the crawling of entire hosts or networks by specifying an IP address and netmask as the target. For example:

IPv4: allow address 9.0.0.0 255.0.0.0
IPv6: If you run the system on a Windows 2003 server, and enabled the system to use the IP version 6 (IPv6) protocol, you must enclose the address in brackets.
allow address [2001:db8:0:1:0:0:0:1]

The netmask enables you to specify pattern matching. For an address rule to apply to a candidate IP address, the IP address in the rule and the candidate IP address must be identical, except where masked off by zeros in the netmask. The address rule defines a pattern, and the netmask defines the significant bits in the address pattern. A zero in the netmask acts as a wildcard and signifies that any value that is specified in that same bit position in the address matches.

In the preceding example, the allow rule applies to any IP address with 9 in the first octet, and any value at all in the last three octets.

The following rule is a useful rule to include as the final address in your list of address rules. This rule matches any IP address because the netmask makes all bits insignificant (the rule forbids all addresses that are not allowed by a preceding rule in your list of rules).

IPv4: forbid address 0.0.0.0 0.0.0.0
IPv6: forbid address :: ::

Restrictions for proxy servers: If you plan to crawl Web sites that are served by a proxy server, do not specify IP address rules. A proxy server is typically used when a user agent (browser or crawler) does not have direct access to the networks where the Web servers are. For example, an HTTP proxy server can relay HTTP requests from a crawler to a Web server, and convey the responses back to the crawler.

When a Web crawler uses a proxy server, the IP address of the proxy server is the only IP address that the crawler has for another host. If IP address rules are used to constrain the crawler to a subnet of IP addresses, the constraint causes almost all URLs to be classified with status code 760 (which indicates that they are forbidden by the Web space).

Crawling rule order

The crawler applies the crawling rules at various times during the process of discovering and crawling URLs. The order of the rules is important, but only within the rules of a each type. It makes a difference whether an address rule comes before or after another address rule, but it makes no difference whether an address rule comes before or after a prefix rule, because the crawler does not apply the rules at the same time.

Within the set of rules for a single type, the crawler tests a candidate domain, address, or URL against each rule, from the first specified rule to the last, until it finds a rule that applies. The action specified for the first rule that applies is used.

The dependency on order leads to a typical structure for most crawling rules:

The set of domain rules typically begins with forbid rules that eliminate single domains from the crawl space. For example, the collection administrator might determine that certain domains do not contain useful information.
The list of forbid rules is typically followed by a series of allow rules (with wildcard characters) that enable the crawler to visit any domain that ends in one of the high-level domain names that define an enterprise intranet (such as *.ibm.com and *.lotus.com).
End the set of domain rules with the following default rule, which eliminates domains that were not allowed by a preceding rule:
```
forbid domain *
```
This final rule is critical, because it prevents the crawl space from including the entire Internet.
The set of address rules typically begins with a small number of allow rules that enable the crawler to crawl the high-level (class-A, class-B, or class-C) networks that span an enterprise intranet.
See the preceding discussion about address rules for examples of how to specify the final rule in your list of address rules to prevent the crawler from crawling Web sites that are outside the corporate network.
The set of prefix rules is usually the largest, because it contains arbitrarily detailed specifications of allowed and forbidden regions that are specified as trees and subtrees. A good approach is to allow or forbid more tightly localized regions first, and then specify the opposite rule, in a more general pattern, to allow or forbid everything else.
The prefix section does not typically end with a typical rule. The suggested final domain and address rules can ensure that the crawler does not crawl beyond the enterprise network more efficiently than by testing URL prefixes.

The crawler can apply prefix rules more efficiently if you group the rules by action (forbid or allow). For example, instead of specifying short sequences of allow and forbid rules that alternate with each other, specify a long sequence of rules that stipulate one action and then specify a long sequence of rules that stipulate the other action. You can interweave allow and forbid rules to achieve the goals of your crawl space. But grouping the allow rules together and the forbid rules together can improve crawler performance.

File extensions and MIME types

These options provide additional ways for you to specify content for the crawl space. You can exclude certain types of documents based on document's file extension, and you can include certain types of documents based on the document's MIME type. When you specify which MIME types you want the crawler to crawl, consider that the MIME type is often set incorrectly in Web documents.

Maximum URL path depth and maximum link depth

The maximum URL path depth is the number of slashes in a URL from its site root. This option enables you to prevent the crawler from being drawn into recursive file system structures of infinite depth. The crawl depth does not correspond to the levels that the crawler traverses when it follows links from one document to another.

The maximum link depth controls how many documents the crawler is to include when it follows links from one document to another. If a link to a document exceeds the maximum link depth, the document is excluded from the crawl space.

If you change these advanced properties, you must restart the Web crawler so that the index can be updated. For example, if you specify a larger number for the maximum link depth, previously excluded pages might be included. If you specify a smaller number for the maximum link depth, pages that were previously included are removed from the index. Documents that exceed the maximum link depth are removed the next time they are crawled.

Start URLs

Start URLs are the URLs that the crawler begins crawling with, and these URLs are inserted into the crawl every time the crawler is started. If the start URLs were already discovered, they will not be crawled or recrawled sooner than other Web sites that you allow in the crawling rules.

A start URL is important the first time that a Web crawler is started and the crawl space is empty. A start URL is also important when you add a URL that was not previously discovered to the list of start URLs in a crawl space.

Start URLs must be fully qualified URLs, not just domain names. You must specify the protocol and, if the port is not 80, the port number.

The following URLs are valid start URLs:

http://w3.ibm.com/
http://sales.ibm.com:9080/

The following URL is not a valid start URL:

www.ibm.com

You must include the start URLs in your crawling rules. For example, the crawler cannot begin crawling with a specified start URL if the crawling rules do not allow that URL to be crawled.

Support for IPv6 addresses: If you run Watson Content Analytics on a Windows 2003 server, and enabled the system to use the IP version 6 (IPv6) protocol, you must enclose the start URLs in brackets. For example:

http://[2001:db8:0:1:0:0:0:1]
http://[2001:db8:0:1::1]