IBM Support


A fix is available


APAR status

  • Closed as program error.

Error description

  • New netmon functionality to support HACMP on VIO

Local fix

Problem summary

  • HACMP customers using VIO within their clusters have been
    experiencing problems with specific scenarios where an
    entire CEC is unplugged from the network, but the HACMP
    node within does not detect a local adapter down event,
    because traffic being passed between the VIO clients
    looks like normal external traffic from the perspective
    of the LPAR's OS.
    Note: The remainder of this problem description was written
    based on review of a VIO customer's situation, and IVE was
    added later as another environment where this fix applies.
    Therefore, it may be that not all of the supporting explanation
    for VIO applies directly to IVE, but the final solution is
    still valid.
    There is already a restriction against two HACMP nodes in
    the same cluster using the same VIO server, because this
    would mean heartbeats can be passed between the nodes
    via the server even when no real network connectivity
    exists.  The problem addressed by this APAR is not the
    same as that issue, although there are similarities.
    In HACMP, heartbeating is used as a reliable means of
    monitoring an adapter's state over a long period of time.
    When heartbeating is broken, a decision has to be made as
    to whether the local adapter has gone bad, or the neighbor
    (or something between them) has a problem.
    The local node only needs to take action if the local
    adapter is the problem; if its own adapter is good,
    then we assume it is still reachable by other clients
    regardless of the neighbor's state (the neighbor is
    responsible for acting on its local adapters failures).
    This decision (local vs remote bad) is made based on
    whether any network traffic can be seen on the local
    adapter, using the inbound byte count of the interface.
    Where VIO is involved, this test becomes unreliable since
    there is no way to distinguish whether inbound traffic
    came in from the VIO server's connection to the outside
    world, or just from a neighboring VIO client.
    (This is a design point of VIO that its virtual adapters
    be indistinguishable to the LPAR from a real adapter).

Problem conclusion

  • A long term solution will require cooperative design work
    between both VIO and HACMP/RSCT so that customers can take
    advantage of VIO's benefits, but HACMP is still aware
    enough of what is happening "below the surface" to react
    appropriately when it needs to.
    In the meantime, an intermediate solution for customers
    who are already using VIO is being provided.
    This fix allows customers to declare that a given adapter
    should only to be considered up if it can ping a set of
    specified targets.
    IMPORTANT:  For this fix to be effective, the customer
    ---------   *must* select targets that are outside the
               VIO environment, and not reachable simply
               by hopping from one VIO server to another.
               NOTE:  Neither HACMP nor RSCT will be able
               to detect if this restriction is violated.
    NOTE: This applies to IVE as well -- however
    IVE is set up, you should not use targets
    which occupy another partition in the same
    physical box.
               Keep the single-point-of-failure rule in mind
               when selecting targets; do not us targets that
               are all on the same physical machine, and do
               not make all your targets adapters from the
               same HACMP cluster (otherwise any given node
               in that cluster cannot keep its adapters up
               when it is the only one powered on).
               Some good choices for targets are name servers
               and gateways, or reliable external IP
               addresses that will respond to a ping.  These
               targets must be maintained through changes in
               the enterprise network infrastructure.
    How to use this fix:
    Up to 32 different targets can be provided for each
    interface.  If *any* given target is pingable, the adapter
    will be considered up.  Targets are specified using the
    existing configuration file (see standard
    documentation for its location), using this new format:
      !REQD <owner> <target>
      !REQD : An explicit string; it *must* be at the
              beginning of the line (no leading spaces).
      <owner> : The interface this line is intended to be
                used by; that is, the code monitoring the
                adapter specified here will determine its
                own up/down status by whether it can ping
                any of the targets (below) specified in
                these lines.
                The owner can be specified as a hostname, IP
                address, or interface name.  In the case of
                hostname or IP address, it *must* refer to
                the boot name/IP (no service aliases).
                In the case of a hostname, it must be
                resolvable to an IP address or the line will
                be ignored.
                The string "!ALL" will specify all adapters.
      <target> : The IP address or hostname you want the
                 owner to try to ping.
                 As with normal entries, a hostname
                 target must be resolvable to an IP address
                 in order to be usable.
    The "traditional" format of the file has not
    changed -- one hostname or IP address per line.
     (Any adapters not matching one of the "!REQD" lines
      will still use the traditional lines as they always
      have; as extra targets for pinging in addition to
      known local or defined adapters, with the intent of
      increasing the inbound byte count of the interface.)
    Any adapter matching one or more "!REQD" lines (as the
    owner) will ignore any traditional lines.
    Order from one line to the other is unimportant; you can
    mix "!REQD" lines with traditional ones in any way.
    However, if using a full 32 traditional lines, do not put
    them all at the very beginning of the file -- otherwise
    each adapter will read in all the traditional lines
    (since those lines apply to any adapter by default), stop
    at 32 and quit reading the file there.  The same problem
    is not true in reverse, as "!REQD" lines which do not
    apply to a given adapter will be skipped over and not
    count toward the 32 maximum.
    Comments (lines beginning with "#") are allowed on or
    between lines and will be ignored.
    If more than 32 "!REQD" lines are specified for the same
    owner, any extra will simply be ignored (just as with
    traditional lines).
    Some brief examples just to explain the syntax:
    (effective files should not be this small)
      !REQD en2
      !REQD en2
      -- most adapters will use netmon in the traditional
         manner, pinging and along with
         other local adapters or known remote adapters, and
         will only care about the interface's inbound byte
         count for results.
      -- interface en2 will only be considered up if it
         can ping either or
      -- The adapter owning "" will only be
         considered up if it can ping or
         whatever resolves to.
      -- The adapter owning will only be
         considered up if it can ping or
         whatever resolves to.
      -- It is possible that is the IP address
         for "" (we can't tell from this example);
         if that is true, then all four targets belong to
         that adapter.
      !REQD !ALL
      !REQD !ALL
      !REQD !ALL
      !REQD en1
      -- All adapters will be considered up only if they can
         ping,, or
      -- en1 has one additional target:
      -- (In this example having any traditional lines would
          be pointless, since all of the adapters have been
          defined to use the new method.)
    Important notes:
    This APAR will only take effect *if* valid updates in
    this new format are made to the file.  As long
    as you only use the file in the traditional
    manner (or do not use it at all), then you can safely
    apply this APAR without changing your cluster's behaviour
    in any way.
    Similarly, any interfaces which are not included as an
    "owner" of one of the "!REQD" lines will
    continue to behave in the old manner, even if you are
    using this new function for other interfaces.
    This fix does *not* change heartbeating behavior itself in
    any way; it only changes how the decision is made as to
    whether a local adapter is up or down.  So this new logic
    will be used upon startup (before heartbeating rings are
    formed), during heartbeat failure (when contact with a
    neighbor is initially lost), or during periods when
    heartbeating is not possible (such as when a node is the
    only one up in the cluster).
    WARNING:  It is *not* recommended that any customers use
    -------   this new function unless they absolutely have
             to because of their VIO environment.
      Why: Invoking this fix changes the definition of a
      ---  "good" adapter from:
               * Am I able to receive *any* network traffic?
               * Can I successfully ping certain addresses?
                 (regardless of how much traffic I can see)
           This fact alone makes it inherently more likely
           for an adapter to be falsely considered down,
           since the second definition is more restrictive.
    For this same reason, customers who find they must take
    advantage of this new functionality are encouraged to be
    as generous as possible with the number of targets they
    provide for each interface (up to the limit).

Temporary fix


APAR Information

  • APAR number


  • Reported component name


  • Reported component ID


  • Reported release


  • Status


  • PE




  • Submitted date


  • Closed date


  • Last modified date


  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name


  • Fixed component ID


Applicable component levels

  • R247 PSY U813884

       UP07/09/26 I 1000

PTF to Fileset Mapping

Document information

More support for: AIX family

Software version: 247

Operating system(s): AIX

Reference #: IZ01331

Modified date: 16 September 2009