XML markup in analysis and search

You can map information in XML structures that are in a document directly to a common analysis structure without writing a UIMA annotator.

If the documents in your collection are in XML and you want to use native XML search to query the documents, you do not need to create a mapping configuration file. You can enable native XML search by configuring parsing options in the administration console.

If the documents in your collection are in XML and you want to exploit the XML markup during text analysis or semantic search, you can map XML elements to the common analysis structure. You might want to map XML elements in the following cases:

The semantics of certain XML elements are precise and can be used in further text analysis steps. These analysis steps can operate directly on the annotations and features created from the XML structures, and are shielded from the potentially different formats of the original documents. For example, the element <addressee> in documents on billings usually contains customer names. Using the XML elements to the common analysis structure mapping, the content of this element can be mapped directly to annotations of type Customer. An annotator can then infer a Customer-located-at relationship, using the information surrounding the Customer annotation.
You want to limit the processing scope of a custom annotator to specified areas in the XML input. For example, you might want to limit the analysis to the content of the <technicianComment> tags only in an annotator that detects car problems.
You want to restrict both text analysis processing and subsequent search to certain parts of the XML document, and filter out irrelevant or non-textual content.
You want to map XML tags that have different names to a common span that is to be used in semantic search. For example, mapping <mainHeading> or <doc> to title.

In these cases, you must create an XML elements to the common analysis structure mapping file that defines which XML elements map which feature structures. The feature structures that you define in the mapping file are created when the documents are parsed, and are accessed by the custom annotators.

You can use more than one XML elements to the common analysis structure mapping file for a document collection. Which mapping file is used for which XML document is determined by the <identifier> element. The <identifier> element in the mapping file must match the root element in the XML document. For example, if the root element of your document is doc, the value of the <identifier> element in the mapping file must also be "doc".

If no match is found, the program will search for a mapping file with the <identifier> element set to Default. If no default mapping is found, the textual sections of the document (with no tag information) are mapped to the document annotation in the common analysis structure.

If you want to extract information that is only contained in relevant parts of a document, while ignoring irrelevant parts, simply specify which XML elements in the documents contain relevant information. This is referred to as content extraction. For example, you can extract the input specified in the title and body elements, while ignoring the input in author, date, ID, and publisher.

Content extraction can improve analysis processing for the following types of XML documents:

Documents that contain large quantities of content that are not subject to analysis, for example, binary attachments. Using content extraction reduces the document size significantly, speeding up processing and avoiding analysis errors that start from unsuitable data.
Documents in which document text is interspersed with irrelevant text, for example, documents that contain editorial information within <note> tags. Ignoring this information leads to better results when analyzing the document content.

Using native XML search and the content extraction options in the XML elements to the common analysis structure mapping are mutually exclusive options, because either all content or only specified content can be considered. If you specify content extraction, native XML mapping is ignored. Without content extraction, you can have both XML elements to common analysis structure mapping and native XML search.

All the types and features that you use in your configuration file must be described in the type system description of your custom analysis steps. You can create a type system descriptor in your UIMA environment by using the Component Descriptor Editor Eclipse plug-in. This plug-in allows you to create a descriptor file without needing to know about the necessary XML syntax.

After you have built and tested the custom analysis, use the UIMA PEAR (Processing Engine Archive) generation wizard to create an archive that contains the custom analysis files including the type system description. Then, you can upload the custom analysis archive and your XML elements to the common analysis structure mapping files by using the administration console.