XML markup in analysis and search

You can map information in XML structures that are in a document directly to a common analysis structure without writing a UIMA annotator.

If the documents in your collection are in XML and you want to use native XML search to query the documents, you do not need to create a mapping configuration file. You can enable native XML search by configuring parsing options in the administration console.

If the documents in your collection are in XML and you want to exploit the XML markup during text analysis or semantic search, you can map XML elements to the common analysis structure. You might want to map XML elements in the following cases:

In these cases, you must create an XML elements to the common analysis structure mapping file that defines which XML elements map which feature structures. The feature structures that you define in the mapping file are created when the documents are parsed, and are accessed by the custom annotators.

You can use more than one XML elements to the common analysis structure mapping file for a document collection. Which mapping file is used for which XML document is determined by the <identifier> element. The <identifier> element in the mapping file must match the root element in the XML document. For example, if the root element of your document is doc, the value of the <identifier> element in the mapping file must also be "doc".

If no match is found, the program will search for a mapping file with the <identifier> element set to Default. If no default mapping is found, the textual sections of the document (with no tag information) are mapped to the document annotation in the common analysis structure.

If you want to extract information that is only contained in relevant parts of a document, while ignoring irrelevant parts, simply specify which XML elements in the documents contain relevant information. This is referred to as content extraction. For example, you can extract the input specified in the title and body elements, while ignoring the input in author, date, ID, and publisher.

Content extraction can improve analysis processing for the following types of XML documents:

Using native XML search and the content extraction options in the XML elements to the common analysis structure mapping are mutually exclusive options, because either all content or only specified content can be considered. If you specify content extraction, native XML mapping is ignored. Without content extraction, you can have both XML elements to common analysis structure mapping and native XML search.

All the types and features that you use in your configuration file must be described in the type system description of your custom analysis steps. You can create a type system descriptor in your UIMA environment by using the Component Descriptor Editor Eclipse plug-in. This plug-in allows you to create a descriptor file without needing to know about the necessary XML syntax.

After you have built and tested the custom analysis, use the UIMA PEAR (Processing Engine Archive) generation wizard to create an archive that contains the custom analysis files including the type system description. Then, you can upload the custom analysis archive and your XML elements to the common analysis structure mapping files by using the administration console.