Basic concepts in text analysis processing

Basic concepts that are used in text analysis processing include annotators, analysis results, feature structure, type, type system, annotation, and common analysis structure.

Annotators contain the logic that analyzes text and discovers and records descriptive data about the document as a whole (referred to as document metadata) and parts in the document. This descriptive data is referred to as analysis results. The analysis results annotate any contiguous substring of text (also referred to as a span of text).

The annotators that are responsible for discovering and analyzing text are contained in an analysis engine, a central concept in UIMA. An analysis engine might contain a single annotator or it might be a composite of many engines, each in turn containing annotators.

A feature structure is the underlying data structure that represents an analysis result. A feature structure is an attribute-value structure. Each feature structure is of a type and every type has a specified set of valid features or attributes (properties), much like a Java™ class. Features have a range type that indicates the type of value that the feature must have, such as String. All annotators in UIMA store data in feature structures.

For example, the text span "James Matthew Bloggs" might be spanned by an annotation of type Person with the features personName, age, nationality, and profession.

The type system defines the types of objects (feature structures) that can be discovered in the input text. The type system defines all possible feature structures in terms of types and features (attributes), much like a class hierarchy in Java. You can define any number of different types in a type system. A type system is domain and application specific.

Most of the text analysis annotators produce their analysis results in the form of annotations. Annotations are a special kind of feature structure that is designated for linguistic analysis processing. An annotation spans or covers a piece of input text and is defined in terms of its beginning and end positions in the input text.

For example, an annotator that recognizes monetary expressions creates for the text "100.55 US Dollars" an annotation of type monetaryExpression that covers the text with the feature currencySymbol set to "$".

All feature structures are represented in a central data structure called the common analysis structure. All data exchange is handled by using the common analysis structure.

The common analysis structure contains the following objects:

The text document
The type system description that indicates the types, subtypes, and their features
Analysis results that describe the document or regions of the document
A repository that supports access to and iteration over the analysis results