Web crawlers

To include pages from Web sites in a collection, you must configure a Web crawler.

You can use the Web crawler to crawl any number of Hypertext Transfer Protocol (HTTP) servers and secure HTTP (HTTPS) servers. The crawler visits a Web site and reads the data on the site. It then follow links in documents to crawl additional documents. The Web crawler can crawl and extract links from individual pages or framesets (pages that are created with HTML frames).

The crawled data can be in one of many common formats, and comes from various sources within your intranet or the Internet. Common formats include HTML, PDF, Microsoft Word, Lotus® WordPro, Extensible Markup Language (XML), and so on.

When you create the crawler, a wizard helps you do these tasks:

Specify properties that control how the crawler operates and uses system resources. The crawler properties control how the crawler crawls all Web pages in the crawl space.
Specify rules to allow and forbid visits to Web sites. When you specify crawling rules, you can test the rules and verify that the crawler is able to access the sites that you want to include in the crawl space.
Specify options to include certain types of files and exclude files with certain file extensions.
Specify rules for how the Web crawler handles soft error pages.
Configure document-level security options. If security was enabled when the collection was created, the crawler can associate security data with documents in the index. This data enables enterprise search applications to enforce access controls based on the stored access control lists or security tokens.
Specify options for crawling password-protected Web sites (the Web servers to be crawled must use HTTP basic authentication or HTML forms to prompt for passwords).
Specify options to crawl Web sites that are served by a proxy server.
Specify schedules for crawling specific Web servers at specific times.

Crawler connection credentials

When you create the crawler, you can specify credentials that allow the crawler to connect to the sources to be crawled. You can also configure connection credentials when you specify general security settings for the system. If you use the latter approach, multiple crawlers and other system components can use the same credentials. For example, the search servers can use the credentials when determining whether a user is authorized to access content.