To ensure that users access only the Web sites that you want them to search, you specify rules to limit what the Web crawler can crawl.
When a Web crawler crawls a Web page, it discovers links to other pages and puts those links in a queue to be crawled next. Crawling and discovery can be repeated as long as time and memory resources permit. When you configure a Web crawler, you specify where the crawler is to begin crawling. From these initial URLs (which are called start URLs) the Web crawler can reach any document on the Web that is connected by direct or indirect links.
To limit the crawl space, configure the Web crawler to crawl certain URLs thoroughly and ignore links that point outside the area of interest. Because the crawler, by default, accepts any URL that it discovers, you must specify rules that identify which URLs you want to include in the collection, and eliminate the rest of the pages.
action type target
action is
forbid or allow; type is domain, IP address,
or URL prefix (HTTP or HTTPS); and target depends
on the value of type. You can specify an asterisk
(*) as a wildcard character, in limited ways, to specify targets
that match a pattern.allow domain www.ibm.com
You
can specify an asterisk as a wildcard character, which causes the
rule to apply to any host name that matches the rest of the pattern.
For example, you can specify that no domains that begin with server
and end in ibm.com are to be crawled:forbid domain server*.ibm.com
Host
name matching is case sensitive, whether you specify an explicit domain
name or a domain name pattern. For example, *.user.ibm.com matches joe.user.ibm.com and mary.smith.user.ibm.com,
but not joe.user.IBM.com.allow domain sales.ibm.com
allow domain sales.ibm.com:443
A prefix rule controls the crawling of URLs that begin with a specified string. The target is a single URL, which typically contains one or more asterisks to signify a pattern. For example, an asterisk is often specified as the final character in the prefix string.
allow prefix http://sales.ibm.com/public/*
forbid prefix http://sales.ibm.com/*
forbid http://sales.ibm.com/*fs/*
allow address [2001:db8:0:1:0:0:0:1]
The netmask enables you to specify pattern matching. For an address rule to apply to a candidate IP address, the IP address in the rule and the candidate IP address must be identical, except where masked off by zeros in the netmask. The address rule defines a pattern, and the netmask defines the significant bits in the address pattern. A zero in the netmask acts as a wildcard and signifies that any value that is specified in that same bit position in the address matches.
In the preceding example, the allow rule applies to any IP address with 9 in the first octet, and any value at all in the last three octets.
The following rule is a useful rule to include as the final address in your list of address rules. This rule matches any IP address because the netmask makes all bits insignificant (the rule forbids all addresses that are not allowed by a preceding rule in your list of rules).
forbid address :: ::
When a Web crawler uses a proxy server, the IP address of the proxy server is the only IP address that the crawler has for another host. If IP address rules are used to constrain the crawler to a subnet of IP addresses, the constraint causes almost all URLs to be classified with status code 760 (which indicates that they are forbidden by the Web space).
The crawler applies the crawling rules at various times during the process of discovering and crawling URLs. The order of the rules is important, but only within the rules of a each type. It makes a difference whether an address rule comes before or after another address rule, but it makes no difference whether an address rule comes before or after a prefix rule, because the crawler does not apply the rules at the same time.
Within the set of rules for a single type, the crawler tests a candidate domain, address, or URL against each rule, from the first specified rule to the last, until it finds a rule that applies. The action specified for the first rule that applies is used.
forbid domain *
This
final rule is critical, because it prevents the crawl space from including
the entire Internet.See the preceding discussion about address rules for examples of how to specify the final rule in your list of address rules to prevent the crawler from crawling Web sites that are outside the corporate network.
The prefix section does not typically end with a typical rule. The suggested final domain and address rules can ensure that the crawler does not crawl beyond the enterprise network more efficiently than by testing URL prefixes.
The crawler can apply prefix rules more efficiently if you group the rules by action (forbid or allow). For example, instead of specifying short sequences of allow and forbid rules that alternate with each other, specify a long sequence of rules that stipulate one action and then specify a long sequence of rules that stipulate the other action. You can interweave allow and forbid rules to achieve the goals of your crawl space. But grouping the allow rules together and the forbid rules together can improve crawler performance.
These options provide additional ways for you to specify content for the crawl space. You can exclude certain types of documents based on document's file extension, and you can include certain types of documents based on the document's MIME type. When you specify which MIME types you want the crawler to crawl, consider that the MIME type is often set incorrectly in Web documents.
The maximum URL path depth is the number of slashes in a URL from its site root. This option enables you to prevent the crawler from being drawn into recursive file system structures of infinite depth. The crawl depth does not correspond to the levels that the crawler traverses when it follows links from one document to another.
The maximum link depth controls how many documents the crawler is to include when it follows links from one document to another. If a link to a document exceeds the maximum link depth, the document is excluded from the crawl space.
If you change these advanced properties, you must restart the Web crawler so that the index can be updated. For example, if you specify a larger number for the maximum link depth, previously excluded pages might be included. If you specify a smaller number for the maximum link depth, pages that were previously included are removed from the index. Documents that exceed the maximum link depth are removed the next time they are crawled.
Start URLs are the URLs that the crawler begins crawling with, and these URLs are inserted into the crawl every time the crawler is started. If the start URLs were already discovered, they will not be crawled or recrawled sooner than other Web sites that you allow in the crawling rules.
A start URL is important the first time that a Web crawler is started and the crawl space is empty. A start URL is also important when you add a URL that was not previously discovered to the list of start URLs in a crawl space.
Start URLs must be fully qualified URLs, not just domain names. You must specify the protocol and, if the port is not 80, the port number.
http://w3.ibm.com/
http://sales.ibm.com:9080/
www.ibm.com
You must include
the start URLs in your crawling rules. For example, the crawler cannot
begin crawling with a specified start URL if the crawling rules do
not allow that URL to be crawled.http://[2001:db8:0:1:0:0:0:1]
http://[2001:db8:0:1::1]