OA40438: EIF: ERROR CODE 67 IS NOT HANDLED WHILE SENDING EVENTS

A fix is available

Subscribe

You can track all active APARs for this component.

APAR status

  • Closed as program error.

Error description

  • 1) When error code is 67 (i.e E_IPC_BROKEN ), the EIF sender
    should try to switch over or resend the event at the very least.
    
    
    2) The head pointer keeps getting moved even though the event
    has not been sent.
    
    
    I think both issues can be addressed if we add checks for
    E_IPC_BROKEN.
    
    With several events written to cache it looks like;
    
    
    +4F97A3BD.00C1 maxsz:      65536
    +4F97A3BD.00C1 head :         54
    +4F97A3BD.00C1 tail :        746
    
    I see for the first event;
    
    (4F97A3BD.00F6-2:sockeif.c,814,"_imp_eipc_recv_data")
    <0x41EEBB50,0x0>
    recv on fd 5, sock_error 0xFFFFFFFF, error 67
    (4F97A3BD.00F7-2:eipc.c,564,"get_peer_response_timed") peer
    response
    PEER_RESPONSE_UNKNOWN
    
    +4F97A3BD.010B maxsz:      65536
    +4F97A3BD.010B head :        227
    +4F97A3BD.010B tail :        746
    
    The head has moved up the number of bytes in the first event
    which was
    173 - even though there was an error returned - it seems to have
    been ignored and carried on.
    
    
    then for the second event, the same;
    
    (4F97A3BD.012B-2:sockeif.c,814,"_imp_eipc_recv_data")
    <0x41EEBB50,0x0>
    recv on fd 5, sock_error 0xFFFFFFFF, error 67
    (4F97A3BD.012C-2:eipc.c,564,"get_peer_response_timed") peer
    response
    PEER_RESPONSE_UNKNOWN
    
    Which I did not expect as in general the first event gets lost
    and the
    connection made for the second to be sent, again it appears not
    to be
    caught and the cache move the head up one event worth;
    
    +4F97A3BD.013A maxsz:      65536
    +4F97A3BD.013A head :        400
    +4F97A3BD.013A tail :        746
    
    Actualy, what I didn't expect was this for the third event;
    
    (4F97A3BD.015A-2:sockeif.c,814,"_imp_eipc_recv_data")
    <0x41EEBB50,0x0>
    recv on fd 5, sock_error 0xFFFFFFFF, error 67
    (4F97A3BD.015B-2:eipc.c,564,"get_peer_response_timed") peer
    response
    PEER_RESPONSE_UNKNOWN
    
    But this time it's rapidly followed by;
    
    (4F97A3BD.015D-2:sockeif.c,338,"_imp_do_send") send 40 bytes
    (4F97A3BD.015E-2:socket_imp.c,1741,"send_to") 174 bytes on send
    rc=-1
    (4F97A3BD.015F-2:socket_imp.c,1639,"socket_put_event_conn")
    Connection
    Oriented send failed will wait 120 seconds before resend.
    
    Which I didn't see for the first two, it then does a count down;
    
    (4F97A3C4.0000-2:socket_imp.c,1658,"socket_put_event_conn")
    resend
    approximate time remaining: 110 seconds
    (4F97A3CB.0000-2:socket_imp.c,1658,"socket_put_event_conn")
    resend
    approximate time remaining: 99 seconds
    (4F97A3D2.0000-2:socket_imp.c,1658,"socket_put_event_conn")
    resend
    approximate time remaining: 89 seconds
    (4F97A3D9.0000-2:socket_imp.c,1658,"socket_put_event_conn")
    resend
    approximate time remaining: 78 seconds
    (4F97A3E0.0000-2:socket_imp.c,1658,"socket_put_event_conn")
    resend
    approximate time remaining: 68 seconds
    (4F97A3E7.0000-2:socket_imp.c,1658,"socket_put_event_conn")
    resend
    approximate time remaining: 57 seconds
    (4F97A3EE.0000-2:socket_imp.c,1658,"socket_put_event_conn")
    resend
    approximate time remaining: 47 seconds
    (4F97A3F5.0000-2:socket_imp.c,1658,"socket_put_event_conn")
    resend
    approximate time remaining: 36 seconds
    (4F97A3FC.0000-2:socket_imp.c,1658,"socket_put_event_conn")
    resend
    approximate time remaining: 26 seconds
    (4F97A403.0000-2:socket_imp.c,1658,"socket_put_event_conn")
    resend
    approximate time remaining: 15 seconds
    (4F97A40A.0000-2:socket_imp.c,1658,"socket_put_event_conn")
    resend
    approximate time remaining: 5 seconds
    
    Then looks like it closes the connection;
    
    4F97A40D.0004-2:sockeif.c,255,"_imp_eipc_shutdown")
    _imp_eipc_shutdown
    fd 5  option 2 rc=-1
    (4F97A40D.0005-2:sockeif.c,259,"_imp_eipc_shutdown")
    _imp_eipc_shutdown
    shutdown - [sys errno 107] fd 5  option 2 rc=-1
    
    (surely rc=1 is a fail to close the connection?)
    
    Then the connection is created;
    
    (4F97A40D.001E-2:socket_imp.c,1920,"_create_eipc_client")
    Connected to
    [legacy_01] fujiobj <fujiobj.test.com@10.22.58.99>:9998 1
    
    The third event gets sent;
    
    +4F97A40D.0034 maxsz:      65536
    +4F97A40D.0034 head :        573
    +4F97A40D.0034 tail :        919
    
    (more events added, but the first AND second return the fail
    error 67
    BUT only the third event then does anything about it, closes the
    connection and re-establishes a good connection, it's purely my
    opinion (DS) but I think it should do what it did for the third
    event, for the first.
    
    
    
    A really good log, it seems to do the right thing for the third
    event, in that when the connection is detected as bad, it does
    not remove the current event from cache, but makes the
    connection and tries again, for event one and two it seems to
    ignore that the connection was bad and moves the cache marker up
    and effectively loses/deletes the event.
    
    customer's env is ITM 6.2.2 FP03
    
    Curious about the 120 second delay to re-establish a connection,
    unsure what the original intention there might have been, surely
    a) detect the connection has gone when dealing with the first
    event.
    b> remake the connection without a 120 second delay.
    
    All files on ecurep under pmr.
    RHEL 5.5 64bit given as cust env in pmr.
    

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED: All TEMS users.                              *
    ****************************************************************
    * PROBLEM DESCRIPTION: EIF: ERROR CODE 67 IS NOT HANDLED WHILE *
    *                      SENDING EVENTS.                         *
    ****************************************************************
    * RECOMMENDATION: Apply the PTF.                               *
    ****************************************************************
    If Error code 67 (Connection is broken) is seen while
    sending events to the Event Integration Facility (EIF)
    receiver, then the EIF sender ignores it and keeps sending
    events forward, even though the events are not being
    received by the EIF receiver.
    

Problem conclusion

  • The code has been modified to check for this error code and
    take action accordingly.  The action would be either to try
    to connect to a failover EIF receiver, if configured, or to
    keep the event in the cache file and mark it as unsent.
    This would ensure that for this error condition, events are
    not lost.  This fix is not applicable to 32 bit unix/linux
    or windows platforms.
    

Temporary fix

Comments

APAR Information

  • APAR number

    OA40438

  • Reported component name

    MGMT SERVER DS

  • Reported component ID

    5608A2800

  • Reported release

    623

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt

  • Submitted date

    2012-09-21

  • Closed date

    2012-10-03

  • Last modified date

    2012-11-01

  • APAR is sysrouted FROM one or more of the following:

    IV21752

  • APAR is sysrouted TO one or more of the following:

Modules/Macros

  • KEFLIB
    

Fix information

  • Fixed component name

    MGMT SERVER DS

  • Fixed component ID

    5608A2800

Applicable component levels

  • R623 PSY UA66835

       UP12/10/15 P F210

Fix is available

  • Select the PTF appropriate for your component level. You will be required to sign in. Distribution on physical media is not available in all countries.



Rate this page:

(0 users)Average rating

Document information


More support for:

Tivoli Management Server for Distributed Systems on z/OS

Software version:

623

Reference #:

OA40438

Modified date:

2012-11-01

Translate my page

Machine Translation

Content navigation