[Java programming language only]

Two-phase commit and error recovery

[Version 8.6 and later] The two-phase commit protocol coordinates all the partitions that participate in a distributed transaction on whether to commit or roll back the transaction.

In a distributed data grid, partitions are distributed across multiple Java™ virtual machines (JVM). These JVMs can be on more than one system. A transaction that writes to multiple partitions might involve transactional decisions that affect more than one system. When the transaction is committed with a two-phase commit protocol, this commit process ensures that the entire transaction is persisted, or none of transaction is persisted. The two-phase commit process ensures this outcome despite partition, system, or communication failures. If a failure occurs in the second phase, the WebSphere® eXtreme Scale client attempts to resolve the failure automatically, unless the error meets certain criteria for which you can manually intervene.

A transaction that is enabled to write to multiple partitions uses the two-phase commit protocol. A two-phase commit protocol ensures that the commit process is consistent across all partitions and systems. WebSphere eXtreme Scale acts as the coordinator that controls the two-phase commit process. The partitions that are involved in the transaction are called the participants or resource managers (RM). During the second phase of the commit protocol, the coordinator delegates one of the partitions to act as the transaction manager (TM). The TM is responsible for tracking the decision of each transaction and recovering the transaction if a failure occurs.

First phase:
When an application commits a transaction, WebSphere eXtreme Scale client starts the first phase by sending a prepare to commit request to each partition identified as an RM. Each partition applies the transaction changes to the backing maps and holds all locks to ensure data integrity. The RM notifies WebSphere eXtreme Scale client. After all partitions identified as an RM respond with success, WebSphere eXtreme Scale client begins the second phase of the commit protocol.
Second phase:
If at least one partition fails during the first phase, then the coordinator rolls back all partitions during the second phase. If all RM partitions respond with success, then the WebSphere eXtreme Scale client delegates one of the partitions to act as the TM partition. As the coordinator, WebSphere eXtreme Scale begins the second phase of the commit protocol by sending a commit or a rollback request to all partitions that are involved in the transaction. Each partition that is identified as an RM then either applies or rolls back the changes to the backing map and releases all the locks. The RM then notifies WebSphere eXtreme Scale client. If at least one partition failed during the second phase, then the delegated TM partition automatically recovers the transaction. Automatic recovery ensures all the partitions that are involved in the transaction are consistent.
In doubt phase:
The indoubt phase is the period between when the RM partition successfully processes the first phase, and is waiting to begin the second phase. During the indoubt period, the RM partition does not know whether to commit or roll back the transaction. The RM partition holds onto locks. Holding the locks can result in an increase in lock contention for other transactions.

Error recovery during a two-phase commit

If a failure occurs during the first phase, WebSphere eXtreme Scale client rolls back the transaction. If one of the partitions fails to commit the transaction, then the TM ensures that the transaction is committed by periodically attempting to commit the transaction. An example of log messages that occur in this scenario follow:

00000099 TransactionLog I CWOBJ8705I: Automatic resolution of transaction WXS-40000139-DF01-216D-E002-1CB456931719 at RM:TestGrid:TestSet2:20 is still waiting for a decision. Another attempt to resolve the transaction will occur in 30 seconds.

Allow WebSphere eXtreme Scale client to resolve the transaction. Attempt to intervene manually only if the transaction is not recovered within 1 minute or the application is experiencing a high volume of lock contention because it is an indoubt transaction. For more information about how to manually recover a transaction, see Troubleshooting lock timeout exceptions for a multi-partition transaction.