IBM Support

IBM PureData System for Operational Analytics High Availability toolkit component update for March 2016.

Troubleshooting


Problem

The High Availability toolkit component in an IBM PureData System for Operational Analytics V1.0 FP4 and V1.1 GA environment may re-order logical partitions on fail-over, fail to fire event alerts, cause increased internal mirror I/O traffic, may fail to start the database when nodes are excluded from TSA domains, fail to report accurate results from hals when multiple hals commands are run in parallel on the same host or by non-root users.

Symptom

There are several symptoms addressed by the fixes in this update.

1. Logical partition assignments in the core warehouse instance db2nodes.cfg file may be out of order after certain types of failover sequences. One possible re-order is that the coordinator partition may change.

2. When utilizing the HA_TOOLS event fire mechanism there are failover scenarios where events will not fire.

3. I/O traffic as measured by iostat or topas show very heavy usage on hdisk0 and hdisk1 on core nodes. This is especially true on nodes with 3 or more nodes in an HA Group.

4. In some troubleshooting scenarios one or more nodes may be excluded in the TSA domain associated with their HA Group. If this node is the default connection node for the HA Group then hastartdb2 will report that there are no partitions to start.

5. When hals is run in parallel on the same host the output from hals may be incomplete. Also, when hals is run as a non-root user you may see permission errors on the /tmp filesystem.

6. The db2set parameter PMODEL may be overwritten if customized after deployment.

Environment

1. This defect occurs on all versions of IBM PureData System for Operational Analytics environments.

2. This defect occurs on V1.0.0.3 or higher and V1.1.0.0 environments.

3. This defect occurs on V1.0.0.4 or V1.1.0.0 en.

4. This defect occurs on all environments.

5. This defect occurs on V1.0.0.4 or V1.1.0.0 environments.

Diagnosing The Problem

On the management node run the following.


$ miinfo -h $(hostname) -d -c | grep 'High availability toolkit'
High availability toolkit 2.0.0.3

If this reports 2.0.0.3 or earlier then the system is at risk to the issues fixed in this update.

Resolving The Problem

PDOA V1 customers must upgrade to FP4 prior to applying this update. All PDOA V1.1 customers should apply this update.

This update is targeted for PDOA V1 FP5 and PDOA V1.1 FP1.

UPDATED TEXT: 2017-11-16:

The PDOA V1.0.0.5 and V1.1.0.1 fixpacks were delivered in March 2017. These fixpacks contain ha tools version 2.0.0.5 which includes all of the fixes mentioned in this technote. Do not follow any of the instructions below if these fixpacks have been applied. Customers with environments that are experiencing these symptoms on PDOA V1.0.0.4 or V1.1.0.0 may follow these instructions if needed. Customers with environments on PDOA V1.0.0.3 should plan to apply PDOA V1.1.0.5 directly to resove these ha tools issues. See the IBM Puredata System for Operational Analytics V1.0 FP5 Readme for more information.

This update affects active monitoring files used by Tivoli System Automation for GPFS filesystem mounts. Therefore it is better to perform the following tasks during a planned outage. This allows time for correcting for any mistakes made during the process of applying the patch as well as testing failover scenarios. While it is possible to do this change online, there are risks involved in doing so.

Warnings: If you made modifications to any High Availability toolkit files for another PMR, please notify IBM Support to ensure that this update will not impact any changes made.

Contact IBM Support Refer to the table at the end of this page to obtain the files below. update files from development The ID Number for the update is 7248 and t There are two files associated with the update:


  • task7248_ha_tools_2.0.0.4.tar
    task7248_ha_tools_2.0.0.4.cksums
 

Steps:

1. Make a backup copy of your existing HA_TOOLS directories. From the management node as root.

  • a. Establish one timestamp for all backups. All backups should run in this same session.
      mydate=$(date +%Y%m%d%H%M%S)    

    b. make a backup on all ha tools files.
      dsh -n ${ALL} "cp -pr /usr/IBM/analytics/ha_tools /usr/IBM/analytics/ha_tools_${mydate}_2.0.0.3_backup"    

    c. Verify the backup from the same session. This preserves the ${mydate} variable.
      $ dsh -n ${ALL} "ls -lad /usr/IBM/analytics/ha_tools_${mydate}_2.0.0.3_backup"    flashdancehostname01: drwxr-xr-x    3 root     system         4096 Mar 10 10:01 /usr/IBM/analytics/ha_tools_20160310100604_2.0.0.3_backup    flashdancehostname03: drwxr-xr-x    3 root     system         4096 Mar 10 08:23 /usr/IBM/analytics/ha_tools_20160310100604_2.0.0.3_backup    flashdancehostname05: drwxr-xr-x    3 root     system         4096 Mar 10 08:23 /usr/IBM/analytics/ha_tools_20160310100604_2.0.0.3_backup    flashdancehostname02: drwxr-xr-x    3 root     system         4096 Mar 10 08:23 /usr/IBM/analytics/ha_tools_20160310100604_2.0.0.3_backup    flashdancehostname07: drwxr-xr-x    3 root     system         4096 Mar 10 08:23 /usr/IBM/analytics/ha_tools_20160310100604_2.0.0.3_backup    flashdancehostname04: drwxr-xr-x    3 root     system         4096 Mar 10 08:23 /usr/IBM/analytics/ha_tools_20160310100604_2.0.0.3_backup    flashdancehostname06: drwxr-xr-x    3 root     system         4096 Mar 10 08:23 /usr/IBM/analytics/ha_tools_20160310100604_2.0.0.3_backup    

    d. Backup the sapolicies files.
      dsh -n ${ALL} "cp -pr /usr/sbin/rsct/sapolicies /usr/sbin/rsct/sapolicies_${mydate}_2.0.0.3_backup"    

    e. Verify the backup of the policies directory:
      $ dsh -n ${ALL} "ls -lad /usr/sbin/rsct/sapolicies_${mydate}_2.0.0.3_backup"    flashdancehostname01: drwxr-xr-x    8 bin      bin             256 Nov 27 01:04 /usr/sbin/rsct/sapolicies_20160310100604_2.0.0.3_backup    flashdancehostname05: drwxr-xr-x    6 bin      bin             256 Nov 26 22:57 /usr/sbin/rsct/sapolicies_20160310100604_2.0.0.3_backup    flashdancehostname03: drwxr-xr-x    8 bin      bin             256 Nov 27 01:04 /usr/sbin/rsct/sapolicies_20160310100604_2.0.0.3_backup    flashdancehostname07: drwxr-xr-x    6 bin      bin             256 Nov 26 22:56 /usr/sbin/rsct/sapolicies_20160310100604_2.0.0.3_backup    flashdancehostname02: drwxr-xr-x    6 bin      bin             256 Nov 26 22:59 /usr/sbin/rsct/sapolicies_20160310100604_2.0.0.3_backup    flashdancehostname06: drwxr-xr-x    6 bin      bin             256 Nov 26 22:57 /usr/sbin/rsct/sapolicies_20160310100604_2.0.0.3_backup    flashdancehostname04: drwxr-xr-x    6 bin      bin             256 Nov 26 22:57 /usr/sbin/rsct/sapolicies_20160310100604_2.0.0.3_backup    

    f. Copy the file two patch files to /usr/IBM/analytics. The directory should appear similar to this on the management node.
      $ ls -la /usr/IBM/analytics/    total 448    drwxr-xr-x    4 root     system          256 Mar 10 19:28 .    drwxr-xr-x    6 bin      bin             256 Jan 07 14:35 ..    drwxr-xr-x    3 root     system         4096 Jan 19 21:47 ha_tools    drwxr-xr-x    3 root     system         4096 Jan 19 21:47 ha_tools_20160310192632_2.0.0.3_backup    -rw-r--r--    1 root     system          488 Mar 10 19:18 task7248_ha_tools_2.0.0.4.cksums    -rw-r--r--    1 root     system       215040 Mar 10 19:14 task7248_ha_tools_2.0.0.4.tar    

    g. Unpack the update tar file in /usr/IBM/analytics
      $ tar -xvf task7248_ha_tools_2.0.0.4.tar    x ha_tools_2.0.0.4    x ha_tools_2.0.0.4/ha_tools    x ha_tools_2.0.0.4/ha_tools/HAscripts    x ha_tools_2.0.0.4/ha_tools/HAscripts/crISAS_rsrc.ksh, 12052 bytes, 24 media blocks.    x ha_tools_2.0.0.4/ha_tools/HAscripts/db2ISAS_start.ksh, 9191 bytes, 18 media blocks.    x ha_tools_2.0.0.4/ha_tools/HAscripts/mountISAS_monitor.ksh, 13405 bytes, 27 media blocks.    x ha_tools_2.0.0.4/ha_tools/hafunctions, 157370 bytes, 308 media blocks.    x ha_tools_2.0.0.4/ha_tools/hals, 8235 bytes, 17 media blocks.    x ha_tools_2.0.0.4/ha_tools/version.txt, 9 bytes, 1 media blocks.    



       
    h. Verify the checksums for the updated files..
      $ find ha_tools_2.0.0.4 -type f | xargs cksum | while read cksum;do grep "$cksum" task7248_ha_tools_2.0.0.4.cksums || echo "ERROR: $cksum is incorrect.";done    3279239073 12052 ha_tools_2.0.0.4/ha_tools/HAscripts/crISAS_rsrc.ksh    2687186877 9191 ha_tools_2.0.0.4/ha_tools/HAscripts/db2ISAS_start.ksh    805294149 13405 ha_tools_2.0.0.4/ha_tools/HAscripts/mountISAS_monitor.ksh    3203761328 157370 ha_tools_2.0.0.4/ha_tools/hafunctions    3385168046 8235 ha_tools_2.0.0.4/ha_tools/hals    2981299425 9 ha_tools_2.0.0.4/ha_tools/version.txt    



       
    i. Verify the permissions and file ownership
      $ find ha_tools_2.0.0.4 -ls    240114    1 drwxr-xr-x  3 root      system         256 Mar 10 19:10 ha_tools_2.0.0.4    240115    1 drwxr-xr-x  3 root      system         256 Mar 10 19:10 ha_tools_2.0.0.4/ha_tools    240116    1 drwxr-xr-x  2 root      system         256 Mar 10 19:08 ha_tools_2.0.0.4/ha_tools/HAscripts    240117   12 -rwxr-xr-x  1 root      system       12052 Mar 10 19:07 ha_tools_2.0.0.4/ha_tools/HAscripts/crISAS_rsrc.ksh    240118    9 -rwxr-xr-x  1 root      system        9191 Mar 10 19:08 ha_tools_2.0.0.4/ha_tools/HAscripts/db2ISAS_start.ksh    240119   14 -rwxr-xr-x  1 root      system       13405 Mar 10 19:08 ha_tools_2.0.0.4/ha_tools/HAscripts/mountISAS_monitor.ksh    240120  154 -rwxr-xr-x  1 root      system      157370 Mar 10 19:06 ha_tools_2.0.0.4/ha_tools/hafunctions    240121    9 -rwxr-xr-x  1 root      system        8235 Mar 10 19:07 ha_tools_2.0.0.4/ha_tools/hals    240122    1 -rw-r--r--  1 root      system           9 Mar 10 19:07 ha_tools_2.0.0.4/ha_tools/version.txt    

2. Decision Point. If you want to take an outage, proceed with step 2. If you understand the risks and want to do an online change go to step 3.

  • a. Stop the environment. First consult with your applications teams and stop those applications before moving to step b.
    b. With all applications and users quiesced and/or off of the system. Bring the system services down. After each command verify with 'hals' that the associated service is Offline.
      hastopdpm    hastopapp    hastopdb2    hadomain -core stop    hadomain -mgmt stop    



       
    c. Proceed to step 4.
3. If skipping step 2, then go into manual mode.
  hadomain -core manual    hadomain -mgmt manual    



     
4. Apply the updates.
  • a. Apply the update hatools update. This is run from the management node.
      dcp -n ${ALL} -pR /usr/IBM/analytics/ha_tools_2.0.0.4/ha_tools /usr/IBM/analytics/    



       
    b. Verify the new files are copied. By verifying the checksums.
      find /usr/IBM/analytics/ha_tools_2.0.0.4 -type f | sed 's|/usr/IBM/analytics/ha_tools_2.0.0.4/||' | while read f;do cksum=$(grep "$f" /usr/IBM/analytics/task7248_ha_tools_2.0.0.4.cksums | while read a b;do echo $a;done);dsh -n ${ALL} "(cksum /usr/IBM/analytics/$f | grep \"$cksum\" > /dev/null && echo Checksum good for $f) || echo Checksum $cksum bad for $f." ;done | sort | dshbak -c    HOSTS -------------------------------------------------------------------------    hostname01, hostname02, hostname03, hostname04, hostname05, hostname06, hostname07    -------------------------------------------------------------------------------    Checksum good for ha_tools/HAscripts/crISAS_rsrc.ksh    Checksum good for ha_tools/HAscripts/db2ISAS_start.ksh    Checksum good for ha_tools/HAscripts/mountISAS_monitor.ksh    Checksum good for ha_tools/hafunctions    Checksum good for ha_tools/hals    Checksum good for ha_tools/version.txt    

    c. Apply the HAscripts update. This updates the policy files from hatools.
      find /usr/IBM/analytics/ha_tools_2.0.0.4/ha_tools/HAscripts/ -type f | while read f;do dcp -n ${ALL} -p $f /usr/sbin/rsct/sapolicies/db2/;done    



       
    d. Verify the policies files are correct. This command should return no output.
      find /usr/IBM/analytics/ha_tools_2.0.0.4/ha_tools/HAscripts/ -type f | while read f;do g=$(basename $f);dsh -n ${ALL} diff /usr/IBM/analytics/ha_tools/HAscripts/$g /usr/sbin/rsct/sapolicies/db2/$g;done    

    e. Verify that the console is reporting the new level.
      $ miinfo -h $(hostname) -d -c | grep 'High availability toolkit'    High availability toolkit                         2.0.0.3    High availability toolkit                         2.0.0.4     Higher    

    f. If the system is in manual mode, proceed to step 5. Otherwise proceed to step 6.
        



       
5. Check the system using hals first and then lssam on each of the domains. There should be no Failed or Pending states in the environment. Let the environment run for 30 minutes to verify that there are no monitoring issues. If there are no failed states then remove the system from manual mode.

  • a. Resume auto mode.
        

    hadomain -core auto

    hadomain -mgmt auto

     


    b. Verify no failures after auto mode. hals is one of the components included in the update.
        

    hals

     


    c. Skip to step 7
        



       
6. Restart the system services. Note that hastartdb2 and hals are updated.

  • a. Start the domains.
        

    hadomain -core start

    hadomain -mgmt start

     


    b. Verify no failures exist after the domain starts.
        

    hals


    c. Start the services.
        

    hastartdb2

    hastartapp

    hastartdpm




       
     

    d. Proceed to step 7.

        



       
7. Testing the new update will require failovers of the core database instance.

  • a. Useful commands to run in window sessions during testing.

    i. Logs. For this patch use the following command to tail the logs. This lets you see how the system behaves from the system log point of view.

      tail -f /var/log/syslog.out | egrep "db2ISAS_stop|db2ISAS_start|hastartdb2|crISAS_rsrc.ksh"    

    ii. hals. The hals command shows a high level view. For this update we test it running every 10 seconds as root or the instance owner.
      while sleep 10;do echo "$(date) $(hostname) $(id)";hals;done    

    iii. lssam. lssam is a low level TSA command. Along with the logs allows a more granular view than hals. For multi-domain systems run the following on any node expected to be online in the domain during the test. This command will refresh often showing the state of the domain. This is run as root. Replace 'bcuaix' with the instance owner if different from the default.
       lssam -top -s "Name like 'db2_bcuaix_%'"    



       

    b. Setting up for testing. To watch the system start you can do any or all of the following. A two screen system is good if you have the real estate. Here are two examples.

    • 1. First screen Layout. System logs for a 2.5 DN environment. 6 sessions.
       
      Admin Node System Log Tail HA Group 2 Standby System Log Tail
      Standby Admin Node System Log Tail HA Group 2 DN1 System Log Tail
      Root session on management node for running commands. HA Group 2 DN2 System Log Tail

      • Real Screen Layout:

       

      2. Screen 2: hals and lssam screens.


      • Layout:

        root on mgmt hals loop admin node lssam
        instance owner on mgmt hals loop
        root on DN 1 hals loop dn1 lssam
        root on DN 1 hals loop
        instance owner on DN1 hals loop

        Real screen layout:






    •  
     

    8. Perform testing.


    • a. Run several hafailover commands.

      • i. Verify that all logical partition numbers in /db2home/bcuaix/sqllib/db2nodes.cfg are sequential per host for all partitions assigned to that host after each failover.

        ii. If the event firing mechanism is used, verify that the event scripts are fired as expected.

       

      b. For 1.5 DN environments it is possible to run scenarios with multiple failovers at the same time. Below is a simple way to test multiple failovers in a complete environment.


      • i. Identify the primary host for each partition set. The example below shows three partition sets. The primary host for each partition set is the first hostname in the Resource:Node[Membership] list. This method is the most accurate method to determine to determine the primary host for a partition set. As long as all of the domains are online this command will work. Other methods such as using 'hals' or looking at 'db2nodes.cfg' are only accurate when the partitions are actually online.
            
          • $ dsh -n ${BCUDB2ALL} 'lsequ | grep db2_bcuaix | while read equ rest;do lsequ -e $equ | egrep "Name|Resource";done' 2>&1 | dshbak -c
            HOSTS -------------------------------------------------------------------------
            hostname02, hostname04
            -------------------------------------------------------------------------------
            Name = db2_bcuaix_0_1_2_3_4_5-rg_group-equ
            Resource:Node[Membership] = {hostname02:hostname02,hostname04:hostname04}

            HOSTS -------------------------------------------------------------------------
            hostname05, hostname06, hostname07
            -------------------------------------------------------------------------------
            Name = db2_bcuaix_16_17_18_19_20_21_22_23_24_25-rg_group-equ
            Resource:Node[Membership] = {hostname06:hostname06,hostname07:hostname07}
            Name = db2_bcuaix_6_7_8_9_10_11_12_13_14_15-rg_group-equ
            Resource:Node[Membership] = {hostname05:hostname05,hostname07:hostname07}



        •  
         

        ii. Stop the instance using 'hastopdb2'.

        iii. Login to each primary host and run 'samctrl -u a $(hostname)'. When run on the primary host for that partition set this will exclude that node from the domain. This artificially makes that node unavailable to the partition set forcing a failover. You can have at most one node excluded per HA group. Note that in a PDOA environment an excluded node will unmount all GPFS filesystems that are managed by TSA on that node. GPFS filesystems that are not managed by GPFS will stay mounted.

        iv. A good practice when performing these test is to mark the log. For example the following command will create a system log entry on all core nodes. If you have setup your tail environment verify that the log entry shows up. If not, then the syslog may have wrapped and you will need to restart your tail. Notice that we use a keyword that will be picked out by the grep command in our tail screens. Once properly marked run the test soon after to avoid syslog wrapping.

          dsh -n ${BCUDB2ALL} 'logger "************** RUNNING hastartdb2 test"'    

        v. With nodes properly excluded run 'hastartdb2'. If you have setup your monitoring environment watch the various screens as the system reacts. If not, examining the system logs and hals output using another method.


        • NOTE: In this scenario we force a worse case than a single or even multiple failover event. In this case we have partition sets that are not failing over starting at the same time as partitions sets that will fail-over. While looking at the logs you will likely see that the fail-over partitions fail to start the first time resulting in a kill. This is expected. The partitions will attempt to start again and normally will start on the second try. In the case of multiple failovers all partition set that are involved in a failover will start one after the other. The order is dependent upon TSA and is not predictable. This means that larger environments in this test will have more fail-over partition sets. More fail-overs will increase the time it takes for the database to start.

File Link
Tar file containing fix. task7248_ha_tools_2.0.0.4.tar
Checksum file. Remove the .txt extension after download. task7248_ha_tools_2.0.0.4.cksums.txt

[{"Product":{"code":"SSH2TE","label":"PureData System for Operational Analytics A1801"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"--","Platform":[{"code":"PF002","label":"AIX"}],"Version":"1.0;1.1","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
13 December 2019

UID

swg21978758