All search engines include four main components:
- Crawling, Seeds, and Connectors: The crawler collects the raw data that will eventually form the search
results. When crawling web pages, the crawler begins at the user-specified seed URLs and
begins downloading web pages. The crawler locates hyperlinks on the downloaded pages and
schedules the newly-discovered pages for further crawling. Configuration information is used
to determine which pages need to be crawled and how to crawl them.
- Converting: The converter processes the raw data discovered by the crawler and
produces one or more pieces of indexable data. The raw data may be encoded in any number of
formats, including archives, compressed files, PDFs, or Microsoft Word files. Most search
engines do not expose the conversion step. This step is heavily customizable in the
Watson Explorer Engine Search Engine, supporting extremely flexible processing of the raw data that
includes sophisticated metadata processing and, optionally, generation. Watson Explorer
Engine also provides a very advanced title extractor to infer document titles from documents
in PDF, Word, and other formats. The final output of the conversion process will be XML in
the IBM XML format.
- Indexing: The indexer processes the textual data produced by the converter and
builds data structures to facilitate the efficient search and retrieval of this information.
In the Watson Explorer Engine Search Engine, indexing also produces signatures that will be
used for near-duplicate elimination at search time. The indexer service for each collection
is the process that actually serves the results.
- Searching: The search process (called the query-service) runs constantly and
proxies requests to the correct indexer service.
The Watson Explorer Engine administration tool contains a configuration tab containing a
sub-section for each of these components. Additionally, the Query Service configuration specifies options that apply to all collections.
The data and configuration for a search is called a collection. There is no limit to
the number of collections that may be created. Each collection contains the live data
and (potentially) the staging data. The live data is used for the current search. The
staging data is used to accumulate information while a new copy of the collection is being
crawled and indexed. These concepts are explained in more detail in the Live vs. Staging section. To change and test a new configuration, a collection may also have a
working copy.
A new collection is a copy of the configuration of an existing collection. This is normally
the default collection. The default collection can be used to specify organization-wide
default options. For example, if a proxy is required, the proxy information can be entered in
the default collection, and it will be used for all subsequently-created collections.
The crawler is based on a recursive rule system, offering extensive control over the crawl.
Documents are generated using an open, extensible framework in which you can insert arbitrary
programs and scripts as well as XSL transformations or Watson Explorer Engine
transformations.