XML SAX Parser

Directory Integrator, Version 7.1.1

XML SAX Parser

The XML SAX Parser is based on the Apache Xerces library. It is used for reading large sized XML documents that the DOM based XML parser won't be able to handle because of memory constraints. It extracts data enclosed within the 'Group tag' supplied in the configuration and creates an Entry with the attributes present in the data. You can specify multiple group tags by separating each tag name with a comma. This will cause the SAX parser to break on any the tags specified. When specifying multiple group tags the SAX parser will use a first-in-win approach where the group tag that was first encountered will be tag that closes the group. As an example, if you have A and B as group tags and the document has a structure where B is a child of A, then A will be the tag closing the entry (as A is found before B and thus takes precedence).

Once a group tag has been found, then any nested occurrence of group tags will have no effect on the current Entry.

If no group tags have been defined, the entire XML document will be returned as a single Entry.

The entry attribute name is composed of surrounding tag names with "@" as the separator. For example, consider the following XML file -

<?xml version="1.0" encoding="UTF-8"?>
<DocRoot>
	<Entry>
		<Company>
			<Name incorporated="yes">IBM Corporation</Name>
			<Country>USA</Country>
		</Company>
	</Entry>
	<Entry>
		<Company>
			<Name incorporated="no">Smith Brothers</Name>
			<Country>USA</Country>
		</Company>
	</Entry>
</DocRoot>

Using "Entry" as the GroupTag, the above XML document would yield two entries as follows -

Entry 1

Attribute name: DocRoot@Entry@Company@Name 
Attribute value: IBM Corporation  
Attribute name: DocRoot@Entry@Company@Name#incorporated 
Attribute value: yes 
Attribute name:DocRoot@Entry@Company@Country 
Attribute value: USA

Entry 2

Attribute name: DocRoot@Entry@Company@Name#incorporated
Attribute value: Smith Brothers
Attribute name: DocRoot@Entry@Company@Name#incorporated
Attribute value: no
Attribute name:DocRoot@Entry@Company@Country
Attribute value: USA

The attribute name may be shortened by specifying a 'Remove Prefix' value in the configuration. For example, a 'Remove Prefix' value of "DocRoot@Entry@Company" in the above example will result in the Entry containing attributes like -

Attribute name: Name 
Attribute value: IBM Corporation
Attribute name: Name#incorporated
Attribute value: yes
Attribute name: Country
Attribute value:  USA
...

When the Connector is initialized, the XML Parser tries to perform Document Type Definition (DTD) verification if a DTD tag is present. The parser will read multi-valued attributes, although only one of the multi-value attributes will be shown when browsing the data in the Schema tab.

If the XML file has nested entry tags, all Entry tags enclosed within the outermost Entry tag, will be treated as normal XML tags. For example,

<entry>
	<entry>
		<company>IBM</company>
	</entry>
</entry>

Here the entry will contain the following attribute:

attribute name: entry@entry@company
attribute value: IBM

Configuration

Group Tag: XML Group tag name(s) that encloses entries. Specify multiple tags by separating each tag name with a comma; or use the root tag if this parameter is not specified (and the entire XML document will be returned as a single Entry).
Remove prefix: Specify the prefix to remove from the attribute names.
Ignore attributes: Asks the parser to ignore attributes of the group tag and its children.
Character Encoding: Character Encoding to be used; the default is UTF-8. Also see Character encoding.
Document Validation: Checking this field, requests the validation of the file on basis of the DTD/XSchema used.
Use XSD Validation: If this field is checked, XSD is used instead of DTD to validate the XML file.
Namespace Aware: Checking this field, requests a namespace aware XML parser.
Read Timeout: The time in seconds, after which the parser stops if no data is received.
Detailed Log: If this field is checked, additional log messages are generated.

Character encoding

The default and recommended Character Encoding to use when deploying the XML SAX Parser is UTF-8. This will preserve data integrity of your XML data in most cases. When you are forced to use a different encoding, the Parser will handle the various encodings in the following way:

When reading a file the parser will look for encoding in the following order:

If the parser’s CharacterSet config parameter is set and is not set to UTF-8, the encoding is set to the value specified in this parameter. However, check #2 is attempted and will overwrite this check if successful when the encoding specified is UTF-32 or UTF-16.
The XML you are parsing is checked for the existence of an encoding attribute from the XML declaration. If the encoding attribute from the XML declaration is found, this value will be used.
The default encoding of the JRE is used if none of the above are true (Normally, UTF-8)

XML SAX Parser

Configuration

Character encoding

See also