Question & Answer
Question
Why do failover of shared file systems for Puredata for Operational Analytics(PDOA) Solution are slow causing the automatic failovers to fail for core warehouse hosts?
Cause
As a result of slow detection & failover mechanism of the shared file systems like /db2home which is mounted on all hosts by General Parallel File System(GPFS) layer, the automatic failover eventually timeout and fail since this is a prerequisite for a smooth failover on all hosts.
Answer
When checked at deeper level in GPFS layer for most cases it is observed that there are some agent threads which get overloaded with detection and activity they need carry out for quick failover of /db2home file system.
So in order to accelerate this process the count of such threads needs to be increased.
The following command will do that and will be effective for the whole domain if run from the GPFS manager node which is admin node
./gpfs.snap.Host02_07211954.out.tar.gz_unpack/Host02_07211954/mmfs.l
ogs.Host02:Fri Jun 17 15:34:59.164 2016: GPFS: 6027-630 Node
172.23.1.4 (Host02) appointed as manager for db2home.
./gpfs.snap.Host02_07211954.out.tar.gz_unpack/Host02_07211954/mmfs.l
ogs.Host02:Fri Jun 17 15:35:03.760 2016: GPFS: 6027-611 Recovery:
db2home, delay 16 sec. for safe recovery.
./gpfs.snap.Host02_07211954.out.tar.gz_unpack/Host02_07211954/mmfs.l
ogs.Host02:Fri Jun 17 15:40:08.750 2016: GPFS: 6027-643 Node
172.23.1.4 (Host02) completed take over for db2home..
You can see from log snippet above that for Host02 it took 6 minutes to failover /db2home and as a result the failover will timeout and thus fail.
The solution to fix this is to run command
mmchconfig tscWorkerPool=128
and take a GPFS software recycle to make this change effective from admin host.
Related Information
Product Synonym
PDOA
Was this topic helpful?
Document Information
Modified date:
17 October 2019
UID
swg21991409