Search Engine

All search engines include four main components:

  1. Crawling, Seeds, and Connectors: The crawler collects the raw data that will eventually form the search results. When crawling web pages, the crawler begins at the user-specified seed URLs and begins downloading web pages. The crawler locates hyperlinks on the downloaded pages and schedules the newly-discovered pages for further crawling. Configuration information is used to determine which pages need to be crawled and how to crawl them.
  2. Converting: The converter processes the raw data discovered by the crawler and produces one or more pieces of indexable data. The raw data may be encoded in any number of formats, including archives, compressed files, PDFs, or Microsoft Word files. Most search engines do not expose the conversion step. This step is heavily customizable in the Watson Explorer Engine Search Engine, supporting extremely flexible processing of the raw data that includes sophisticated metadata processing and, optionally, generation. Watson Explorer Engine also provides a very advanced title extractor to infer document titles from documents in PDF, Word, and other formats. The final output of the conversion process will be XML in the IBM XML format.
  3. Indexing: The indexer processes the textual data produced by the converter and builds data structures to facilitate the efficient search and retrieval of this information. In the Watson Explorer Engine Search Engine, indexing also produces signatures that will be used for near-duplicate elimination at search time. The indexer service for each collection is the process that actually serves the results.
  4. Searching: The search process (called the query-service) runs constantly and proxies requests to the correct indexer service.

The Watson Explorer Engine administration tool contains a configuration tab containing a sub-section for each of these components. Additionally, the Query Service configuration specifies options that apply to all collections.

The data and configuration for a search is called a collection. There is no limit to the number of collections that may be created. Each collection contains the live data and (potentially) the staging data. The live data is used for the current search. The staging data is used to accumulate information while a new copy of the collection is being crawled and indexed. These concepts are explained in more detail in the Live vs. Staging section. To change and test a new configuration, a collection may also have a working copy.

A new collection is a copy of the configuration of an existing collection. This is normally the default collection. The default collection can be used to specify organization-wide default options. For example, if a proxy is required, the proxy information can be entered in the default collection, and it will be used for all subsequently-created collections.

The crawler is based on a recursive rule system, offering extensive control over the crawl. Documents are generated using an open, extensible framework in which you can insert arbitrary programs and scripts as well as XSL transformations or Watson Explorer Engine transformations.