Creating an XML elements to the common analysis structure mapping file

In an XML to the common analysis structure mapping file, you can employ the full range of configuration options for mapping XML to UIMA data types.

About this task

The XML elements to the common analysis structure mapping file is shown in the following example.

The sample police report has XML tags for the crime type, crime date, crime location, reporting officer, the police precinct where the officer is employed, suspect description, and abstract. This is followed by a body section. For example:
<report>
  <doc>
    <crimeType>Car theft</crimeType>
    <crimeDate>04/23/05 09:23 pm</crimeDate>
    <crimeLocation>27 Main Street, Brynston, Springfield, New Jersey</crimeLocation>
    <reportingOfficer rank="Lt">Jakob
      <lastName>Collins</lastName>
    </reportingOfficer>
    <policePrecinct>14th Precinct</policePrecinct>
    <suspectDescription>Male, dark haired, dark glasses, 
      blue jeans with dark, probably black, 
      jacket</suspectDescription>
    <abstract>A Mercedes CLK was stolen on 04/23/2005 from a parking
      lot in front of the Blue Lagoon restaurant on 
      27 Main Street, Brynston.(serial number: 32 2761 50871)</abstract>
    <body>A Mercedes CLK was stolen on 04/23/2004 from a parking 
      lot in front of the Blue Lagoon restaurant on 27 Main Street,
      Brynston.(serial number: 32 2761 50871)
      It has a black color and wide Michelin tires.
      Eyewitnesses in front of the restaurant saw two darkly dressed 
      males drive away in the car at high speed. The car was 
      found abandoned on Aliway Ave in Brooklyn. The fuel tank was empty.
      The seats were badly stained and the back seat was vandalized. 
      Nothing was stolen out of the car....</body>
  </doc> 
  <image>
    <--! image of the crime scene as a base64-encoded string -->
  </image>
</report>
Based on the sample report, an XML to the common analysis structure mapping file might have the following structure. The sample uses the type system that is defined for the police report scenario.
<?xml version="1.0"?>
<xmlCasInitializerConfiguration
  xmlns="http://www.ibm.com/2005/uima/jedii_ci_xml">
  
  <identifier>Default</identifier>
  <description>Sample configuration</description> 

  <contentElements>
    <element>/report/doc</element>
  </contentElements>

  <elementToTypeMappings>
    <elementToTypeMapping>
      <element>//doc//reportingOfficer</element>
      <type>com.ibm.omnifind.types.Person</type>
      <featureValueAssignment>
        <feature>role</feature>
        <basicValue default="Reporting officer">
        </basicValue>
      </featureValueAssignment>
      <featureValueAssignment>
        <feature>gender</feature>
        <basicValue default="male" 
         useAttributeValue="sex"/>
      </featureValueAssignment>
      <featureValueAssignment>
        <feature>surName</feature>
        <values concatenate="true" delimiter=" ">
          <basicValue useAttributeValue="rank" default="Lt"/>
          <basicValue useElementContent="lastName"/>
        </values>
      </featureValueAssignment>
    </elementToTypeMapping>
    <elementToTypeMapping>
      <element>//doc</element>
      <type>com.ibm.omnifind.types.PoliceReport</type>
      <featureValueAssignment>
        <feature>crimeDescription</feature>
        <basicValue useElementContent="abstract" trim="true">
        </basicValue>
      </featureValueAssignment>
    </elementToTypeMapping>
  </elementToTypeMappings>

</xmlCasInitializerConfiguration>
The mapping file is split into two sections:
<contentElements> element
Use this element if you want specific content extraction. The sample mapping file extracts the content in the <doc> section of a document and ignores other sections in the document. In the XML police report, the image might be large and not very useful for text processing. By specifying <doc> as a content element and not <image>, the image is filtered out before any text processing begins.
<elementToTypeMappings>
Use this element to specify which individual XML elements (specified in an <elementToTypeMapping> element) in the document to map to which feature structures in the common analysis structure.

If you use the content extraction option, the XML elements that are specified in the <elementToTypeMappings> section must be contained within the XML elements that are specified in the <contentElements> section.

Procedure

To create an XML to the common analysis structure mapping file:

  1. Create an XML file. To avoid XML syntax errors, use an XML editor or XML authoring tool to validate the XML. The XSD schema for the mapping file is called XMLCasInitSchema.xsd in the ES_INSTALL_ROOT/configurations/parserservice/jediidata directory.
  2. Include your mappings in an <xmlCasInitializerConfiguration xmlns="http://www.ibm.com/2005/uima/jedii_ci_xml"> element. The namespace (specified in the xmlns attribute) must be exactly as shown.
  3. Add a <contentElements> element if you want to extract specific content from sections in the document and a <elementToTypeMappings> element that specifies which individual XML elements in the document you want to map to which feature structures in the common analysis area.
  4. Add an <identifier> element and a <description> element. The identifier determines which mapping to use for which XML document. The identifier must contain the root element of the document, such as doc. If the identifier is set to Default, the root element of the document is irrelevant and the mapping is applied to any XML document.
  5. Add a <contentElements> element if you want to extract information that is contained only in relevant parts of a document. It has the following component element:
    • One or more <element> elements that contain the path of an XML element in the document and follows XPath syntax, for example <element>/doc/crimeType</element>.
  6. Add an <elementToTypeMappings> element if you want to specify which XML elements in the document to map to which feature structures in the common analysis structure. It has the following component elements:
    • One or more <elementToTypeMapping> elements. This element must have the following nested elements:
      • An <element> element that is used to specify the path of an XML element and follows XPath syntax: A leading forward slash (/) means that a full path is given. For example, abstract under the root element doc. Two forward slashes (//) means any path subset. For example, birthDate must occur within reportingOfficer, although other elements can occur between these two.
      • A <type> element, which specifies a type that is defined in the type system description. It must be of type Annotation.
      • Zero or more <featureValueAssignment> elements.
  7. In a <featureValueAssignment> element, name a feature of type String in the <feature> element and assign a value in the <basicValue> element. Multiple <basicValue> elements can be added between a <values> element.

    The <basicValue> element can have attributes. These include useAttributeValue, useElementContent, default, and trim.

Example

Use useAttributeValue if you want to use the value of an attribute as the value for a feature. For example:
<elementToTypeMapping>
  <element>/doc//reportingOfficer</element>
  <type>com.ibm.omnifind.types.Person</type>
  <featureValueAssignment>
    <feature>role</feature>
    <basicValue default="Reporting officer"/>
  </featureValueAssignment>
  <featureValueAssignment>
    <feature>gender</feature>
      <basicValue default="male" useAttributeValue="sex"/>
  </featureValueAssignment>
</elementToTypeMapping>
This example results in the following output:
  • For each <reportingOfficer> XML tag that occurs somewhere within a <doc> XML tag in the document, a feature structure of type com.ibm.omnifind.types.Person is created.
  • If the <reportingOfficer> tag contains an attribute sex, the feature gender of the newly created feature structure is set to the value of the attribute.
Use the attribute useElementContent to add content as the value of a feature. For example, in the following mapping snippet:
<elementToTypeMapping>
  <element>//doc</element>
  <type>com.ibm.omnifind.types.PoliceReport</type>
    <featureValueAssignment>
       <feature>crimeDescription</feature>
         <basicValue useElementContent="abstract" trim="true"/>
    </featureValueAssignment>
</elementToTypeMapping>
the text covered by the element <abstract> in <doc> becomes the value of the feature structure crimeDescription. All leading and trailing blanks are removed.
More than one value can be specified between the <values> element for the following cases:
  • The feature to be set is of type StringArray.
  • Many strings are concatenated to one string by using the delimiter attribute and therefore map to a feature of type String. For example, the title Mr. is a constant, the first name is the value of an attribute, and the last name is covered by an XML element:
    <elementToTypeMapping>
      <element>//doc//reportingOfficer</element>
      <type>com.ibm.omnifind.types.Person</type>
      <featureValueAssignment>
        <feature>surName</feature>
        <values concatenate="true" delimiter=" ">
          <basicValue default="Mr."/>
          <basicValue useAttributeValue="rank" 
             default="Lt."/>
          <basicValue useElementContent="lastName"/>
        </values>
      </featureValueAssignment>
    </elementToTypeMapping>

String feature values are extracted from the mapping file as is. The values retain any leading or trailing blanks. However, names of types and features are trimmed of any blanks. For example, <type> com.ibm.omnifind.types.Person </type> becomes <type>com.ibm.omnifind.types.Person</type>.

Set conditions on attributes by using the <condition> element. For example, the feature structure of type com.ibm.omnifind.types.Person is created only if <suspectDescription> occurs in the document with attribute armed set to yes:
<elementToTypeMapping>
  <element>//suspectDescription</element>
  <type>com.ibm.omnifind.types.Person</type>
      <condition attribute="armed" value="yes"/>
</elementToTypeMapping>
Based on the sample police report and the defined mapping file, the following feature structures are created:
com.ibm.omnifind.types.PoliceReport
  • covered text: "Car theft 04/23/05 09:23 pm 27 Main Street, Brynston, Springfield, New Jersey Jakob Collins 14th Precinct Male, dark haired, dark glasses, blue jeans with dark, probably black, jacket A Mercedes CLK was ... Nothing was stolen out of the car.
  • begin = 2
  • end = 904
  • knownSuspects = null
  • crimeDescription = "A Mercedes CLK was stolen on 04/23/2005 from a parking lot in front of the Blue Lagoon restaurant on 27 Main Street, Brynston.(serial number: 32 2761 50871)"
com.ibm.omnifind.types.Person
  • covered text = "Jakob Collins"
  • begin = 112
  • end = 127
  • role = "Reporting officer"
  • firstName = null
  • surName = "Lt Collins"
  • gender = "male"

What to do next

After you create the mapping file, you must upload it to the Watson Explorer Content Analytics server. In the administration console, select the XML to the common analysis structure mapping file with your other custom analysis selections.