Improved mechanism to create and update XIT documents
Starting with IBM Content Collector V2.2 Fix Pack 2 and V3.0, the IBM FileNet P8 Connector creates and updates the XML Instance Text (XIT) document in the FileNet P8 data model for IBM Legacy Content Search Engine (formerly known as Verity or Autonomy) differently. This change improves resilience in the processing of duplicate email documents and of email documents that failed to be processed completely in a previous archiving attempt.
Re-collection of unique email documents
If the archiving task route processes an email that has no duplicates and processing for that document fails, the following objects exist in the FileNet P8 repository:
- A distinct email instance (DEI) object without content. This DEI object is not checked in.
- Possibly, an email instance (EI) object that references the DEI object.
When the failed item is re-collected, checking for duplicates in the P8 Create Document task shows that the DEI object already exists.
Before IBM Content Collector V2.2 Fix Pack 2, checks on the DEI were performed before flagging an item as a duplicate on the task route, which included checking in and out the DEI to acquire a lock on the item across threads and across machines in scale-out mode. This made processing more complex and thus more error-prone. Because no XIT object existed for the failed item flagged as a duplicate, the task responsible for handling the XIT perpetually timed out and blocked threads in the task route service, thus preventing the processing of other items.
Now, the processing in the P8 Create document task works as follows:
1. Check whether the DEI exists in the FileNet P8 repository. The identifier is the document ID.
2. If the DEI exists, check whether it has content.
- If the DEI has no content, flag it as unique.
- If the DEI has content, check whether an XIT exists on the DEI. If not, flag the DEI as unique.
With these processing steps, the DEI is never checked back out after it has been checked in with its content in the P8 Save Prepared Text as XML task, thus saving resources used for connecting to and sending and receiving data from FileNet P8 Content Engine.
To handle the duplicate item processing that would have been prevented by the check in and check out performed by the P8 Create Document task, the P8 Save Prepared Text as XML task has been updated to detect duplicates.
Additionally, when the P8 Save Prepared Text as XML task creates an XIT, as opposed to updating one, it performs all of its operations in an atomic fashion. This removes the need to have the FileNet P8 Connector perform a cleanup in the event of an error and reduces the chance that malformed objects are left in the object store.
The P8 Save Prepared Text as XML task determines whether it is dealing with a duplicate based on the information returned by FileNet P8 Content Engine during the creation of an XIT. When a particular error is returned, the task switches from trying to create the XIT to trying to archive metadata that is used later to update an existing XIT.
The errors that the P8 Save Prepared Text as XML task monitors to determine if an email document is a duplicate are:
The E_OBJECT_MODIFIED error is the most common of the errors listed to occur. Seeing these errors in the FileNet P8 Content Engine server logs is normal behavior for the P8 Connector when processing duplicates.
Re-collection of duplicate email documents
The only way to update document content in FileNet P8 is to check out the document, update the content, and check it back in. This way of updating the XIT costs time and increases the chance of errors when working in a multithreaded or scale-out environment. In addition, the FileNet P8 Content Engine API does not allow the check in and check out operations to be part of a single transaction. Therefore, the IBM FileNet P8 Connector cannot make use of automatic rollback if an error occurs. This means that rolling back requires additional FileNet P8 Content Engine API calls and server operations which might fail as well.
To address these problems, the following changes were made:
- A new transient data model component named ICCMailSearchUpdateAnnotation was added. This is an Annotation subclass.
When processing a duplicate, the P8 Save Prepared Text as XML task creates an instance of this annotation instead of modifying the XIT. The annotation contains the information required to update the XIT. The annotation instance is linked to the DEI.
- A maintenance task was added to the FileNet P8 Connector. This task retrieves the annotations and performs the XIT updates in an atomic and threadsafe manner. It deletes the annotations when the update has been completed successfully.
By default, the ICCMailSearchUpdateAnnotation is added, if necessary, to the P8 object store automatically when the P8 Connector runs. The installation of this class requires administrative privileges on the P8 object store that is in use; if the P8 Connector user that is configured does not have the appropriate permissions, the installation of the annotation class fails and the P8 tasks cannot completely archive the email until the user manually installs the new component. The user must run the AFUComplianceInstaller.exe tool if using IBM Content Collector V2.2 Fix Pack 2 or the ICCComplianceInstaller.exe tool if using IBM Content Collector V3.0. The tools can be found under the ctms folder in the Content Collector installation directory and must be run on a command line with administrative privileges against the P8 object store that Content Collector is configured with. For example:
AFUComplianceInstaller.exe -username P8Administrator -password password -connection http://server:port/wsi/FNCEWS40MTOM/ -domain P8Domain -objectstore P8OS -datamodel email -version 2 -classpath "<icc_installation_root>\ctms"
ICCComplianceInstaller.exe -username P8Administrator -password password -connection http://server:port/wsi/FNCEWS40MTOM/ -domain P8Domain -objectstore P8OS -datamodel em -version 2 -classpath "<icc_installation_root>\ctms"
The maintenance task runs only on the primary node and uses a thread pool of initially 15 threads to improve the rate at which XIT updates can be performed. The task runs against all IBM FileNet P8 object stores that are configured in IBM Content Collector Configuration Manager. This ensures that only one thread on one machine updates the content of independent XITs.
The maintenance task performs queries on each object store for ICCMailSearchUpdateAnnotation instances. If an object store does not contain any ICCMailSearchUpdateAnnotation instances or if it is not possible to connect to an object store, the object store is skipped, and the task proceeds with the next object store in the Content Collector configuration.
In a clustered scale out environment with multiple groups of IBM Content Collector nodes and multiple primary nodes, the maintenance task might be running against the same object store on different machines in parallel. By default, the maintenance task runs on each of these primary nodes. The task can detect any update conflicts that might occur when different primary nodes try to update the same document. In this case, a message is logged, and the task proceeds to the next item. However, this type of handling update conflicts requires monitoring exceptions and can become costly over time. As a best practice, have the maintenance task enabled on the nodes in one cluster only and disable the task on all nodes (primary and extension nodes) in the other clusters.
By default, the maintenance task is enabled, has a thread pool size of 15 threads, and runs every 15 minutes on an interval schedule. To change these settings for IBM Content Collector V2.2 Fix Pack 2:
- Close IBM Content Collector Configuration Manager.
- Navigate to the \ctms\ADF subdirectory of the Content Collector installation directory and edit the P84x.adf file. Change the values set for the <enableMaintenanceTask>, <maintenanceTaskThreadPoolSize>, and <xitScheduleIntervalMinutes> elements.
<enableMaintenanceTask> can be '0' (disabled) or '1' (enabled).
<maintenanceTaskThreadPoolSize> can be any integer value greater than '0'.
<xitScheduleIntervalMinutes> can be any integer value greater than '0'.
- Save the P84x.adf file.
- Start the Configuration Manager.
- Go to the IBM FileNet P8 Connector configuration section and resave the connector settings. For example, change the log settings to a different value and then back to the original value to activate the save button.
To change the settings for IBM Content Collection V3.0 or later:
- Start IBM Content Collector Configuration Manager.
- In the Connectors view, select the IBM FileNet P8 Connector.
- Select the Maintenance Task tab.
- Select whether to enable or disable the task for the primary server cluster.
- Optional: Modify the thread count to suit your environment capabilities.
- Optional: Modify the scheduling of the task to suit your environment capabilities. Note that the smallest interval that can be configured in an interval schedule is 15 minutes.
- Save the configuration.
To check the number of annotations that remain to be consolidated for indexing, you can run a simple query against the Annotation table of the FileNet P8 object store. The number of items that are of the class ICCMailSearchUpdateAnnotation is the number of email duplicates that still need to be added to the XIT for indexing. An example query to use is:
SELECT count(*) FROM ClassDefinition cd, Annotation a WHERE a.object_class_id=cd.object_id AND symbolic_name='ICCMailSearchUpdateAnnotation'
In addition to these changes, three new counters were introduced for monitoring runtime data related to the maintenance task:
- Number of XITs Updated
- Rate of XITs Updated
- Number of Maintenance Task Invocations
Use these counters to roughly determine the throughput of the maintenance task and how long the task runs with respect to the configured schedule interval.