Custom text processing

You can improve the quality and precision of search results by integrating custom text processing algorithms with collections.

Watson Explorer Content Analytics supports the Apache Unstructured Information Management Architecture (UIMA), which is a framework for creating, discovering, composing, and deploying text analysis functions. Application developers create and test analysis algorithms for the content to be searched, then create a processing engine archive (.pear file) that includes all of the resources required to use the archive. To be able to query collections with your custom analysis algorithms, you must add the archive (which contains the text analysis engine) to the system.

In addition to the system text analysis engine, a collection that is based on a solution package can be associated with other text analysis engines, known as solution text analysis engines, that are provided in the solution package or are installed in the collection by exporting a UIMA pipeline from Content Analytics Studio.

The analysis logic component in a text analysis engine is called an annotator. Each annotator performs specific linguistic analysis tasks. A text processing engine can contain any number of annotators, or it can be a composite of several text analysis engines, each of which contain their own custom annotators.

The information produced by the annotators is referred to as the analysis results. Analysis results, which correspond to the information that you want to search for, are written to a data structure called a common analysis structure.

When you configure text processing options for a collection, you do the following tasks:

Select the system text analysis engine that you want to use for annotating documents in the collection.
If your collection contains XML documents with meaningful markup, and you want to use this markup in your custom text analysis, you can associate mapping files with the collection and map the output of the XML mapping files to the common analysis structure.
For example, you can map the content of <addressee> and <customer> elements to Person annotations in the common analysis structure. These annotations can then be accessed by your custom annotators, which might detect additional information (for example, they might detect the gender of the Person). You can also map Person annotations to the index, which allows users to search for Persons without having to know the original names of the XML elements.

If you want to allow users to specify the original XML elements in queries, then you do not need to define any XML mappings. Instead, you can configure parsing options and enable native XML search for the collection.
Map the common analysis structure to the index, which enables the annotated documents to be searched with semantic search.
For example, depending on the entities and relationships that are detected by the annotators, users can search for concepts that occur in the same sentence (such as a specific person and any competitor name), or a keyword and a concept (such the name Alex and a phone number).
Map the common analysis structure to a relational database. You can map data to IBM® DB2® tables or Oracle tables. This type of mapping enables the results of analysis to be used in database applications such as data mining. It also enables you to use SQL queries to search the data outside of Watson Explorer Content Analytics.