Parse and index administration

To enhance your ability to find relevant documents or explore deviations and trends in data, you can specify options for how documents and metadata are to be parsed and analyzed before they are added to the index. To ensure that users always have access to the latest information, incremental index updates occur automatically when new documents are ready to be indexed.

The indexing services provide both document processing and indexing capabilities. First, the index service reads crawled documents in the file queue of the data store. To prepare a document for indexing, the document processor tasks extract and analyze the text and metadata for each document. Document processing includes:

Using parsers, such as the text extractor filters and HTML, XML, and text parsers, to analyze content.
Using Apache Unstructured Information Management Architecture (UIMA) annotators to tokenize content and extract entities.

After the text data and metadata are analyzed and tokenized, the index service builds a main text index. It might also build other indexes, such as a category index, document cache, and so on. With incremental indexing, documents can be available for searching in a short amount of time without indexing the entire collection.

Index partitions

When you create a collection, you can specify whether you want to create multiple index partitions. Partitions enable the system to scale to multiple millions of documents and index documents in parallel. The index partitions are accessed as if they were one index.

If you add search servers and index servers to your system, you can choose the servers that you want to use with each collection. Depending on how many search servers you add, each server can search all partitions independently or you can choose to distribute the index partitions. In the latter case, the number partitions that each search server searches is based on the number of partitions divided by the number of selected search servers.

When you add an index server, you must ensure that the index server and master server share the same data directory (ES_NODE_ROOT). If you choose to use a custom location for the index when you create a collection, ensure that the directory is shared. In a multiple server configuration, the location that you specify must exist on all servers, and the default Watson Explorer Content Analytics administrator ID must have permission to access the directory and subdirectories.

Important: If storage is not shared among the index and search servers, the index must be copied from the index server to the search server. The copy operation can take time, depending on the size of the index. Until the index copy operation is completed, search results might be incomplete.

If an index server fails, you can edit the collection settings to remove the server from the list of selected servers and reconfigure the index servers that are used by the collection.

To avoid performance degradation, keep each partition to below 20 million documents. For example, an index with 40 million documents might require 10 seconds for search results to be returned. If you expect such a collection to grow, you can achieve better performance by creating three index partitions with up to 15 million documents each.