IBM Support

PM98008: Fail fast clients have delayed recovery after failover

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • Fail fast clients have delayed recovery after failover
    

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED:  WebSphere eXtreme Scale users who are       *
    *                  running                                     *
    *                  with fast fail clients that have catalog    *
    *                  servers and container servers failing at    *
    *                  the                                         *
    *                  same time.                                  *
    ****************************************************************
    * PROBLEM DESCRIPTION: During a failover where one or more     *
    *                      catalog servers fail at the same time   *
    *                      as one or more container servers,       *
    *                      fast fail clients take 30 seconds or    *
    *                      more to recover.                        *
    ****************************************************************
    * RECOMMENDATION:                                              *
    ****************************************************************
    A fast fail client has a very short or no requestRetryTimeout
    property time defined in the client properties or on the
    session. Therefore, the client does not retry the same request
    after a failure to route to the server.  When catalog servers
    and container servers fail at the same time, the client-side
    code
    waits for a new list of catalog server endpoints before trying
    to
    request new routing information from the catalog server. This
    action normally prevents the client from calling a failed
    catalog
    server, which can result in longer recovery times. The recovery
    seems to be delayed even if there is a valid catalog server to
    contact.
    The WebSphere eXtreme Scale client logs show route table
    updates after receiving catalog server bootstrap updates. For
    example:
    [8/21/13 9:52:47:462 EDT] 00000061 LocationServi I  ‚  
    CWOBJ2521I: The catalog server bootstrap addresses changed
    from host1:4809,host2:4809 to host1:4809.
    [8/21/13 9:52:47:476 EDT] 00000038 ClusterStore  ‚  I  ‚  
    CWOBJ1132I: An updated routing entry for domain:grid:epoch
    domain3:GridC:1377093141770 was obtained from the catalog
    server.
    [8/21/13 9:52:48:038 EDT] 00000061 LocationServi I  ‚  
    CWOBJ2521I: The catalog server bootstrap addresses changed
    from host5:3809,host6:3809 to host5:3809.
    [8/21/13 9:52:48:103 EDT] 00000038 ClusterStore  ‚  I  ‚  
    CWOBJ1132I: An updated routing entry for domain:grid:epoch
    domain2:GridB:1377093142734 was obtained from the catalog
    server.
    

Problem conclusion

  • Apply ifix for better fast fail client recovery after a failure.
    

Temporary fix

Comments

APAR Information

  • APAR number

    PM98008

  • Reported component name

    WS EXTREME SCAL

  • Reported component ID

    5724X6702

  • Reported release

    850

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt

  • Submitted date

    2013-09-27

  • Closed date

    2013-10-14

  • Last modified date

    2013-10-14

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    WS EXTREME SCAL

  • Fixed component ID

    5724X6702

Applicable component levels

  • R860 PSY

       UP

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSTVLU","label":"WebSphere eXtreme Scale"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"850","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
14 October 2013