To include pages from Web sites in a collection, you must
configure a Web crawler.
You can use the Web crawler to crawl any number of Hypertext Transfer
Protocol (HTTP) servers and secure HTTP (HTTPS) servers. The crawler
visits a Web site and reads the data on the site. It then follow links
in documents to crawl additional documents. The Web crawler can crawl
and extract links from individual pages or framesets (pages
that are created with HTML frames).
The crawled data can be in one of many common formats, and comes
from various sources within your intranet or the Internet. Common
formats include HTML, PDF, Microsoft Word, Lotus® WordPro,
Extensible Markup Language (XML), and so on.
When you create the crawler, a wizard helps you do these tasks:
- Specify properties that control how the crawler operates and uses
system resources. The crawler properties control how the crawler crawls
all Web pages in the crawl space.
- Specify rules to allow and forbid visits to Web sites. When you
specify crawling rules, you can test the rules and verify that the
crawler is able to access the sites that you want to include in the
crawl space.
- Specify options to include certain types of files and exclude
files with certain file extensions.
- Specify rules for how the Web crawler handles soft error pages.
- Configure document-level security options. If security was enabled
when the collection was created, the crawler can associate security
data with documents in the index. This data enables enterprise search applications
to enforce access controls based on the stored access control lists
or security tokens.
- Specify options for crawling password-protected Web sites (the
Web servers to be crawled must use HTTP basic authentication or HTML
forms to prompt for passwords).
- Specify options to crawl Web sites that are served by a proxy
server.
- Specify schedules for crawling specific Web servers
at specific times.
Crawler connection credentials
When
you create the crawler, you can specify credentials that allow the
crawler to connect to the sources to be crawled. You can also configure
connection credentials when you specify general security settings
for the system. If you use the latter approach, multiple crawlers
and other system components can use the same credentials. For example,
the search servers can use the credentials when determining whether
a user is authorized to access content.