Why you should use DB2 Multisystem

Performance improvements can be quite significant for certain queries.

Testing has shown that for queries that have a large amount of data to be processed, but with a relatively small result set, the performance gain is almost proportional to the number of systems the file is distributed across. For example, suppose you have a 5 million (5 000 000) record file that you want to query for the top 10 revenue producers. With DB2® Multisystem, the response time for the query is cut by nearly one-half by partitioning the file equally across two systems. On three systems, the response time is nearly one-third the time of running the query on a single system. This best case scenario does not apply to complex join operations where data needs to be moved between nodes.

If a file is fairly small or is primarily used for single-record read or write processing, little or no performance gain can be realized from partitioning the file. Instead, a slight degradation in performance might occur. In these cases, query performance becomes more dependent on the speed of the physical connection. However, even in these situations, the users on all the systems in the node group still have the advantage of being able to access the data, even though it is distributed, using the traditional local file database methods with which they are familiar. In all environments, users have the benefits of this local-system transparency and the possible elimination of data redundancy across the systems in the node group.

Another parallelism feature, DB2 UDB Symmetric Multiprocessing, can also increase performance. With symmetric multiprocessing (SMP), when a partitioned file is processed and if any of the systems are multiprocessor systems, you can achieve a multiplier effect in terms of performance gains. If you partitioned a file across three systems and each system is a 4-way processor system, the functions of DB2 Multisystem and SMP work together. Using the previous 5 million record example, the response time is approximately one-twelfth of what it would have been had the query been run without using any of the parallelism features. The file sizes and the details of the query can affect the improvement that you actually see.

When you do queries, the bulk of the work to run the query is done in parallel, which improves the overall performance of query processing. The system divides up the query and processes the appropriate parts of the query on the appropriate system. This makes for the most efficient processing, and it is done automatically; you do not need to specify anything to make this highly efficient processing occur.

Note: The performance of some queries might not improve, especially if a large volume of data has to be moved.

Each node only has to process the records that are physically stored on that node. If the query specifies selection against the partitioning key, the query optimizer might determine that only one node needs to be queried. In the following example, the ZIP code field is the partitioning key within the SQL statement for the ORDERS file:

  SELECT NAME, ADDRESS, BALANCE FROM PRODLIB/ORDERS WHERE ZIP='48009'

When the statement is run, the optimizer determines that only one node needs to be queried. Remember that all the records that contain the 48009 ZIP code are distributed to the same node.

In the next SQL statement example, the processor capability of all the IBM® i models in the node group can be used to process the statement in parallel:

  SELECT ZIP, SUM(BALANCE) FROM PRODLIB/ORDERS GROUP BY ZIP

Another advantage of having the optimizer direct I/O requests only to systems that contain pertinent data is that queries can still run if one or more of the systems are not active. An example is a file that is partitioned such that each branch of a business has its data stored on a different system. If one system is unavailable, file I/O operations can still be performed for data associated with the remaining branches. The I/O request fails for the branch that is not active.

The optimizer uses two-phase commit protocols to ensure the integrity of the data. Because multiple systems are being accessed, if you request commitment control, all of the systems use protected conversations. A protected conversation means that if a system failure occurs in the middle of a transaction or in the middle of a single database operation, all of the changes made up to that point are rolled back.

When protected conversations are used, some commitment control options are changed at the remote nodes to enhance performance. The Wait for outcome option is set to Y, and the Vote read-only permitted option is set to Y. To further enhance performance, you can use the Change Commitment Options (QTNCHGCO) API to change the Wait for outcome option to N on the system where your queries are initiated. Refer to the APIs topic in the Information Center to understand the effects of these commitment option values.