A fix is available
APAR status
Closed as program error.
Error description
New netmon functionality to support HACMP on VIO
Local fix
Problem summary
HACMP customers using VIO within their clusters have been experiencing problems with specific scenarios where an entire CEC is unplugged from the network, but the HACMP node within does not detect a local adapter down event, because traffic being passed between the VIO clients looks like normal external traffic from the perspective of the LPAR's OS. Note: The remainder of this problem description was written based on review of a VIO customer's situation, and IVE was added later as another environment where this fix applies. Therefore, it may be that not all of the supporting explanation for VIO applies directly to IVE, but the final solution is still valid. There is already a restriction against two HACMP nodes in the same cluster using the same VIO server, because this would mean heartbeats can be passed between the nodes via the server even when no real network connectivity exists. The problem addressed by this APAR is not the same as that issue, although there are similarities. In HACMP, heartbeating is used as a reliable means of monitoring an adapter's state over a long period of time. When heartbeating is broken, a decision has to be made as to whether the local adapter has gone bad, or the neighbor (or something between them) has a problem. The local node only needs to take action if the local adapter is the problem; if its own adapter is good, then we assume it is still reachable by other clients regardless of the neighbor's state (the neighbor is responsible for acting on its local adapters failures). This decision (local vs remote bad) is made based on whether any network traffic can be seen on the local adapter, using the inbound byte count of the interface. Where VIO is involved, this test becomes unreliable since there is no way to distinguish whether inbound traffic came in from the VIO server's connection to the outside world, or just from a neighboring VIO client. (This is a design point of VIO that its virtual adapters be indistinguishable to the LPAR from a real adapter).
Problem conclusion
A long term solution will require cooperative design work between both VIO and HACMP/RSCT so that customers can take advantage of VIO's benefits, but HACMP is still aware enough of what is happening "below the surface" to react appropriately when it needs to. In the meantime, an intermediate solution for customers who are already using VIO is being provided. This fix allows customers to declare that a given adapter should only to be considered up if it can ping a set of specified targets. IMPORTANT: For this fix to be effective, the customer --------- *must* select targets that are outside the VIO environment, and not reachable simply by hopping from one VIO server to another. NOTE: Neither HACMP nor RSCT will be able to detect if this restriction is violated. NOTE: This applies to IVE as well -- however IVE is set up, you should not use targets which occupy another partition in the same physical box. Keep the single-point-of-failure rule in mind when selecting targets; do not us targets that are all on the same physical machine, and do not make all your targets adapters from the same HACMP cluster (otherwise any given node in that cluster cannot keep its adapters up when it is the only one powered on). Some good choices for targets are name servers and gateways, or reliable external IP addresses that will respond to a ping. These targets must be maintained through changes in the enterprise network infrastructure. How to use this fix: ------------------- Up to 32 different targets can be provided for each interface. If *any* given target is pingable, the adapter will be considered up. Targets are specified using the existing netmon.cf configuration file (see standard documentation for its location), using this new format: !REQD <owner> <target> Parameters: ---------- !REQD : An explicit string; it *must* be at the beginning of the line (no leading spaces). <owner> : The interface this line is intended to be used by; that is, the code monitoring the adapter specified here will determine its own up/down status by whether it can ping any of the targets (below) specified in these lines. The owner can be specified as a hostname, IP address, or interface name. In the case of hostname or IP address, it *must* refer to the boot name/IP (no service aliases). In the case of a hostname, it must be resolvable to an IP address or the line will be ignored. The string "!ALL" will specify all adapters. <target> : The IP address or hostname you want the owner to try to ping. As with normal netmon.cf entries, a hostname target must be resolvable to an IP address in order to be usable. The "traditional" format of the netmon.cf file has not changed -- one hostname or IP address per line. (Any adapters not matching one of the "!REQD" lines will still use the traditional lines as they always have; as extra targets for pinging in addition to known local or defined adapters, with the intent of increasing the inbound byte count of the interface.) Any adapter matching one or more "!REQD" lines (as the owner) will ignore any traditional lines. Order from one line to the other is unimportant; you can mix "!REQD" lines with traditional ones in any way. However, if using a full 32 traditional lines, do not put them all at the very beginning of the file -- otherwise each adapter will read in all the traditional lines (since those lines apply to any adapter by default), stop at 32 and quit reading the file there. The same problem is not true in reverse, as "!REQD" lines which do not apply to a given adapter will be skipped over and not count toward the 32 maximum. Comments (lines beginning with "#") are allowed on or between lines and will be ignored. If more than 32 "!REQD" lines are specified for the same owner, any extra will simply be ignored (just as with traditional lines). Some brief examples just to explain the syntax: ---------------------------------------------- (effective netmon.cf files should not be this small) 9.12.4.11 !REQD en2 100.12.7.9 9.12.4.13 !REQD en2 100.12.7.10 -- most adapters will use netmon in the traditional manner, pinging 9.12.4.11 and 9.12.4.13 along with other local adapters or known remote adapters, and will only care about the interface's inbound byte count for results. -- interface en2 will only be considered up if it can ping either 100.12.7.9 or 100.12.7.10. !REQD host1.ibm 100.12.7.9 !REQD host1.ibm host4.ibm !REQD 100.12.7.20 100.12.7.10 !REQD 100.12.7.20 host5.ibm -- The adapter owning "host1.ibm" will only be considered up if it can ping 100.12.7.9 or whatever host4.ibm resolves to. -- The adapter owning 100.12.7.20 will only be considered up if it can ping 100.12.7.10 or whatever host5.ibm resolves to. -- It is possible that 100.12.7.20 is the IP address for "host1.ibm" (we can't tell from this example); if that is true, then all four targets belong to that adapter. !REQD !ALL 100.12.7.9 !REQD !ALL 110.12.7.9 !REQD !ALL 111.100.1.10 !REQD en1 9.12.11.10 -- All adapters will be considered up only if they can ping 100.12.7.9, 110.12.7.9, or 111.100.1.10. -- en1 has one additional target: 9.12.11.10 -- (In this example having any traditional lines would be pointless, since all of the adapters have been defined to use the new method.) Important notes: --------------- This APAR will only take effect *if* valid updates in this new format are made to the netmon.cf file. As long as you only use the netmon.cf file in the traditional manner (or do not use it at all), then you can safely apply this APAR without changing your cluster's behaviour in any way. Similarly, any interfaces which are not included as an "owner" of one of the "!REQD" netmon.cf lines will continue to behave in the old manner, even if you are using this new function for other interfaces. This fix does *not* change heartbeating behavior itself in any way; it only changes how the decision is made as to whether a local adapter is up or down. So this new logic will be used upon startup (before heartbeating rings are formed), during heartbeat failure (when contact with a neighbor is initially lost), or during periods when heartbeating is not possible (such as when a node is the only one up in the cluster). WARNING: It is *not* recommended that any customers use ------- this new function unless they absolutely have to because of their VIO environment. Why: Invoking this fix changes the definition of a --- "good" adapter from: * Am I able to receive *any* network traffic? to: * Can I successfully ping certain addresses? (regardless of how much traffic I can see) This fact alone makes it inherently more likely for an adapter to be falsely considered down, since the second definition is more restrictive. For this same reason, customers who find they must take advantage of this new functionality are encouraged to be as generous as possible with the number of targets they provide for each interface (up to the limit).
Temporary fix
Comments
APAR Information
APAR number
IZ01331
Reported component name
RSCT/RMC FOR CS
Reported component ID
5765F07AP
Reported release
247
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Submitted date
2007-07-11
Closed date
2007-07-18
Last modified date
2009-09-16
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
RSCT/RMC FOR CS
Fixed component ID
5765F07AP
Applicable component levels
R247 PSY U813884
UP07/09/26 I 1000
PTF to Fileset Mapping
U813528 rsct.basic.rte 2.4.8.0
U813884 rsct.basic.rte 2.4.7.5
U813548 rsct.basic.rte 2.4.7.4
Rate this page:
Average rating
Copyright and trademark information
IBM, the IBM logo and ibm.com are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml.