Host exceeded average power usage threshold
Users receive an event alert email from a host (nzevent) running in IBM PureData System for Analytics (Also Known as Netezza/NPS) 100-1 appliance.
This is generated when the system is starting up. (system initiated alert)
Receiving the following event messages when the system is starting up.
CRITICAL: NPS system xxxx - host 1004 Needs attention. System initiated.
error string:Host exceeded average power usage threshold
average power usage value: 100 threshold: 95
event source:System initiated
The threshold value for average power usage might not be set correctly.
Diagnosing the problem
In order to check if it is caused by hardware issue, user should check the average power usage of host server via AMM or ipmitool command. If the average power usage is within the safe range or reported as OK status, it's not hardware issue and can be resolved by updating the NPS registry parameter.
To retrieve the average power usage of system, user can issue following commands on the system.
1. login as root user on host server, query the average power usage via AMM.
[root@XXXX]# ssh mm001 fuelg -T blade
system> fuelg -T blade
PM Capability: Dynamic Power Measurement with capping
Effective CPU Speed: 2399 MHz
Maximum CPU Speed: 2400 MHz
-pcap 959 (min: 598, max: 959)
Maximum Power: 122
Minimum Power: 107
Average Power: 111
Data captured at 03/01/14 16:40:55
2. login as root user on SPU, query the average power usage via ipmitool commnad.
[root@svntz001np ~]# ipmitool sdr | grep -i power
Avg Power | 110 Watts | ok
Host Power | 0x00 | ok
Avg Power 2 | disabled | ns
Avg Power 3 | disabled | ns
Avg Power 4 | disabled | ns
Avg Power 5 | disabled | ns
Resolving the problem
User can update the NPS parameter to a more appropriate value as seen from the AMM or ipmitool commands.
the higher the value used, the fewer warnings will be seen
1. To do this, pause the system as the nz user
Are you sure you want to pause the system (y|n)? [n] y
2. Then change the threshold value
nzsystem set -arg "sysmgr.hostPwrAvgThresholdToRiseEvent=115"
Are you sure you want to change the system configuration (y|n)? [n] y
3. Then unpause the system