How does WebSphere Message Broker (WMB) handle the XML message encoding?
Are there any recommendations on how to properly handle the encoding of XML messages?
You have an application that creates an XML message and sets the XML prolog encoding to "ISO-8859-15".
<?xml version="1.0" encoding="ISO-8859-15" standalone ="yes"?>
The application sets the MQMD Format field to MQSTR, and puts the message to a WebSphere MQ queue that is read by a message flow.
Your message broker is running on a Solaris 8 server, and has a locale of "iso_8859_1". The broker's queue manager was created with a default locale of "iso_8859_1". The MQInput node Convert checkbox is checked. The CodedCharSetId and Encoding are defaulted to the queue manager settings.
The broker's default XML parser used to parse the XML message. At this stage, the XML parser sees an encoding of "ISO-8859-15" and parses the XML message as this. However, the MQGet will have converted it to "ISO-8859-1", thereby creating an encoding inconsistency.
When an XML document is written as the message buffer for a WebSphere MQ message, then, to WebSphere MQ, this is a buffer like any other. MQ considers the buffer as hexadecimal bytes and it is up to the application to give this data meaning.
Although MQ does not treat this as XML data, to maintain data consistency on the MQ Transport, information must be provided relating to the CodedCharSetId, Encoding and Format, so that the necessary conversions can take place. When specified correctly, these allow character data and numeric data to identified in the buffer and converted correctly. When dealing with a buffer that contains XML data, then the Format of MQSTR is often used to indicate that the data is character data.
When a data conversion takes place on a WebSphere MQ queue manager then the CodedCharSetId, Encoding and Format fields are used as the source parameters, and the details specified in the MQMD and MQGET are used as the target information. This is the same conversion whether it be conversion on an WebSphere MQ channel or on an MQGET from a WebSphere MQ application. In both cases an MQGET is issued with the MQGMO_CONVERT option, and this requests that the queue manager present the message buffer in the target codepage and encoding that is specified in the MQGET. When the queue manager performs this conversion, then it does not use information that is within the message buffer. So in this specific scenario, the queue manager does not treat this as an XML message, and therefore does not use any encoding that could be specified in the XML prolog.
Now let us consider WebSphere MQ Integrator (WMQI) in this scenario. The message flow is the Message Broker application that will give meaning to the message buffer, and treat it as an XML document. Before processing such a message buffer, the message flow must first retrieve the message from its input queue by using an input source such as an MQInput node. This node issues an MQGET on the named input queue, and if successful, the message flow will be returned a message buffer. Like any other message buffer, this will be hexadecimal bytes that are represented in the CodedCharSetId and Encoding that is in the MQMD associated with this buffer.
If the application does not want to receive data in the CodedCharSetId that is specified by each individual input message, then it can issue the MQGET with MQGMO_CONVERT and in the MQMD of this MQGET, specify the CodedCharSetId and Encoding that it requires the data in. The MQInput node has three attributes that can be filled in to request that WebSphere MQ perform the conversion before the data is returned as part of the MQGET. If the Convert checkbox is checked, and no CodedCharSetId or Encoding is specified, then these will default to brokers queue managers codepage, and MQENC_NATIVE respectively. Alternatively the user can specify the CodedCharSetId and Encoding on the MQInput node and the message buffer will be converted to this on input. If an MQGET is issued with MQGMO_CONVERT and the message data is already in the codepage being requested, then the queue manager does not perform any conversion.
However, in the majority of cases for message flows, it is not necessary to request that WebSphere MQ perform the conversion. This is because, the message buffer is usually assigned an owning parser which is responsible for creating a message tree from the bitstream. For consistency across the broker platforms, this message tree is created in Unicode and therefore during parsing, an automatic conversion takes place between the bitstreams codepage and UTF-8.
Therefore, it can be seen that in these situations performing a conversion on the MQInput node is not needed. For example, if an AIX broker were being used (CodedCharSetId=819), and an input message was received in Codepage 437, then the owning parser would have to perform conversions 437 -> UTF-8 when constructing the message tree. If the user checked the Convert box on the MQInput node, then this would request that the MQGET convert the data to Codepage 819. Therefore, the owning parser would perform conversions 819 -> UTF-8. However, in this later scenario the performance overhead of WebSphere MQ data conversion has been added. The user could have checked the Convert box on the MQInput and specified a CodedCharSetId and Encoding of 500 and 785 respectively. This would request that EBCDIC data be retrieved as part of the MQGET, and therefore an EBCDIC bitstream would be presented to the parser. As you may now see, this would result in the parser performing conversions from codepage 500 to UTF-8. In all cases, the user would be presented with the same fields in the message tree, and so adding this conversion on the MQInput node is unnecessary.
However, if the user is manipulating the message in the BLOB domain then they might want the bitstream to presented in a common codepage. If this is the case, then they could use the Conversion on the MQInput node to achieve this. This especially should be done if the user has already written WebSphere MQ Data Conversion exits to convert their application structures, for which no modelling has been done in Message Broker.
From the description so far it can be seen that the XML parser (Xerces) parser in the broker will be presented with a bitstream in a given codepage no matter whether data conversion is used or not. The XML parser will then attempt to parse this bitstream as an XML document and produce an XML message tree. This XML document could contain an Encoding attribute in the XML prolog, however, this is not used to control the parsing in anyway. This is because all data parsed by the XML parser is treated as character data, and therefore encodings are not used with character data.
Therefore, when transporting XML messages using WebSphere MQ messages, we recommend that the MQMD CodedCharSetId and Encoding fields are accurate to reflect the contents of the XML message
06267 090 754
WMB MB WebSphere Message Broker MQ Integrator WBIMB WBI-MB MQSI WMQI