TroubleShooting: Object Request Broker (ORB) problems
TroubleShooting for ORB problems with IBM WebSphere Application Server. This should help address common issues with this component before calling IBM support and save you time.
Resolving The Problem
Troubleshooting ORB problems in IBM WebSphere Application Server. This page should help you address common issues with ORB before engaging IBM support which can save you time in resolving the issue.
This topic discusses common connectivity issues and errors that can occur when using ORB.
DNS switching is a method of load-balancing to a pool of backend servers. Clients connect to the backend servers using a single virtual hostname which the DNS server resolves to one of the actual IPs of the backend servers. For each subsequent hostname resolution call, the DNS selects a new IP in a round-robin fashion, which ultimately gives the effect of spraying the clients across all the backend servers. It also allows for removing servers from the pool for maintenance or other problems by removing a server's IP from the virtual hostname DNS configuration.
In order to utilize DNS switching, the following APARs and configuration properties need to be in place:
· WAS fix PI24231. IX90122 FAILS TO WORK PROPERLY WHEN SECURITY IS NOT ENABLED
The following SDK fixes are also highly recommended:
· IX90131: ORB THREADS MAY HANG WHILE CONNECTING TO AN INVALID HOSTNAME AFTER APPLYING APAR IX90122
· IX90132 NoSuchMethodError may be seen during ORB exception logging or after applying APAR IX90114
· IX90152 JAVA.NET.UNKNOWNHOSTEXCEPTION THROWN FROM COM.IBM.RMI.IIOP.ORB.LOOKUPLOCALOBJECT AFTER APPLYING IX90122
Custom JVM properties
· Set the network address cache time to live to 0 in the java security file (<javaHome>/lib/security/java.security)
Understanding, Tuning, and Testing the InetAddress Class and Cache
· Disable JNDI name caching and host name normalization on InitialContext lookup using 1 of the options below:
o Set the following environment properties when creating an InitialContext in your application code:
o Set the following JNDI properties as a JVM argument or in the jndi.properties file in WAS:
For more detail on JNDI caching:
Developing applications that use JNDI
The following are common topics related to F5 Load Balancer functionality and configuration within a WebSphere environment.
· Load Balancing
· Inactive Connection Handling
· Active Health Monitoring
A typical EJB client will first use a providerURL (or corbaloc reference) in order to contact one of the WebSphere Name Servers. This corbaloc/corbaname reference will have the F5 virtual IP address and will, therefore, go through the F5 to be load balanced to one of the actual WAS servers. However, all subsequent client requests (naming lookups, EJB calls) will bypass the F5 and go directly to the WAS servers. This is due to the nature of the IIOP protocol where naming lookups and locateRequests result in the clients receiving a direct reference to the servers vs references which point to the F5 virtual IP.
In order to achieve a load-balancing effect on separate EJB calls, the client must obtain a new InitialContext and do a new naming lookup (which uses the corbaloc F5 virtual address) before each EJB call.
The Load balancer support for IBM FileNet P8 doc also reiterates this behavior pattern.
Inactive Connection Handling
The F5 has a feature which allows it to monitor and remove "inactive" connections. Since the ORB keeps connections open and reuses them, problems can result (COMM_FAILURE) when the ORB attempts to use a connection that the F5 removed due to inactivity. If this inactivity timeout is set on the F5, ensure the appropriate OS TCP keepalive settings are configured appropriately in order to keep ORB connections from being removed due to "inactivity". See the Unexpected ORB Connection Removal section for more details on TCP keepalive settings.
Active Health Monitoring
Certain F5 "health monitoring" functionality can cause the F5 to send non-GIOP "health check" messages/packets to ORB server port(s) causing unwanted exceptions in the ORB
When active health monitoring is engaged on the F5, it will send a probe (non-GIOP message) to the WebSphere servers to determine if they are up and responding in a timely manner. If the probe is sent to any of the ORB server ports, IOExceptions will occur on the server. The number and frequency of the IOExceptions will depend on how often the probes are sent. Example of server IOExceptions:
|[9/1/15 9:33:14:157 CST] 00000d48 ORBRas 3com.ibm.rmi.iiop.Connection doReaderWorkOnce:3259RT=2957:P=988667:O=0:WSTCPTransportConnection[addr=126.96.36.199,port=4433,local=9810] The following exception was logged java.io.IOException: bytesRead < 0 at com.ibm.rmi.iiop.Connection.readMoreData(Connection.java:1704) at com.ibm.rmi.iiop.Connection.createInputStream(Connection.java:1504) at com.ibm.rmi.iiop.Connection.doReaderWorkOnce(Connection.java:3250) at com.ibm.rmi.transport.ReaderThread.run(ReaderPoolImpl.java:141)[9/1/15 9:33:14:158 CST] 00000d48 ORBRas >com.ibm.rmi.iiop.Connection purge_calls:2019RT=2957:P=988667:O=0:WSTCPTransportConnection[addr=188.8.131.52,port=4433,local=9810] Entry Reason: CONN_ABORT (1) State: ESTABLISHED (2)|
These exceptions occur because the ORB is expecting a GIOP message, and the F5 is simply establishing a connection and then sending FIN-ACK TCP packets to close the connection. This is what causes the IOExceptions on the ORB Reader Thread.
In order to prevent these exceptions, the F5 should be configured to send health-monitor probes to something other than the ORB server ports.
Network Address Translation (NAT)
The use of a NAT device to route client EJB/ORB traffic to WebSphere servers in a private network is not supported. Due to the nature of IIOP traffic, servers embed their host/port information within reply messages to the clients. The clients will then have the private IP server information and attempt to make direct connections via the private IP addresses rather than the "public" IP. In this case, an IIOP-aware NAT router is required, which will introspect the outgoing server messages, find any IORs and perform NAT on them. IBM neither supplies nor endorses any such products.
Unexpected ORB Connection Removal
The ORB caches connections, so when a firewall or F5 type device removes one of those ORB connections due to inactivity, the ORB may try to use that removed connection and an exception will result. Usually either a "connection reset" or "socket closed" message occurs:
|[8/9/17 10:12:30:226 EDT] 0000698b ORBRas 3
com.ibm.rmi.iiop.Connection doReaderWorkOnce:3385 -- ConnId: -987274822
al=46854] The following exception was logged
java.net.SocketException: Connection reset
There are 2 ways to avoid this scenario. Option #1 is recommended.
1. Configure TCP to send keepalive probes sooner.
Each client-side socket is created with the SO_KEEPALIVE option set, so that keepalive probes will be automatically sent over the connection periodically. Most operating systems wait 2 hours before sending keepalive probes which is longer than most firewall inactivity settings. Hence, it is ineffective in preventing removed connections. The solution is to alter the OS TCP keepalive setting to keep the ORB connections looking "active" to the firewall by setting the keepalive interval smaller than the firewall inactivity timeout. The following details keepalive settings for various operating systems:
|Operating System||Keepalive Parameter|
For more details on setting the Keepalive parameter, see the TCP/IP keepalive settings and related DB2 registry variables page.
2. Bypass the ORB connection cache.
There is not actually an "official" way to do this (ie. property which turns it on/off). Instead, setting both com.ibm.CORBA.MaxOpenConnections and MinOpenConnections to 1 will effectively bypass the ORB connection cache (though this is not a fool-proof solution and may not result in the desired effect 100% of the time). This effectively means the ORB will create a new connection for each request. These settings should be made on the client side and will in effect keep only one server connection in the cache at a time.
When a client attempts to create a new ORB connection to a server, it's possible to see the following "connection refused" exception:
|[3/8/17 22:30:00:269 CET] 00001031 ORBRas 1 com.ibm.ws.orbimpl.transport.WSTCPTransportConnection connect:413 SchedulerWorkManager.Alarm Pool : 1 The following exception was logged
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
Root Cause 1: Firewall is blocking the connection attempt
Solution: Open the server port(s) in the firewall
Root Cause 2: The server is not listening on the port to which the client is connecting
· Check that the server JVM is up
· Check that the client is connecting to the correct server port
Root Cause 3: The server's backlog queue is full and unable to accept any new connections
· Ensure the server is not under a Denial of Service attack.
· Server may not be able to handle the client load, necessitating additional servers.
· Server may be experiencing resource issue (CPU spike, etc) such that incoming connections are not being processed in a timely manner.
Root Cause 4: Client is connecting to the server via TCP when SSL is required (server port is 0)
When a server's CSIv2 inbound communication transport settings are set to "SSL-required", its IORs are created with a TCP port of 0 in order to force or "require" incoming SSL connections be made by clients. The client's security configuration needs to be changed to enable outgoing SSL connections.
· Thin/java clients:
Change settings within the sas.client.props file according to the following documentation:
(See Examples 1-4, which speak to sas.client.props settings)
Example 2: Configuring basic authentication, identity assertion, and client certificates
· WAS server acting as a client:
Adjust the settings in the admin console to allow for SSL connections. From the admin console, go to "Global security > CSIv2 outbound communications". In the "CSIv2 Transport Layer" section, ensure the "Transport" field is set to either "SSL-support" or "SSL-required".
Common Secure Interoperability Version 2 outbound communications settings
When the ORB has to make a new connection to a server, a timeout can occur when trying to connect the client socket to the server, resulting in a stack similar to that below:
|java.net.ConnectException: Connection timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
At the most basic level, a connect timeout occurs when a TCP SYN packet is sent from a client to the server and the server does not respond with a SYN-ACK. There can be multiple reasons for the failure which will be discussed below. Once client has gotten a SYN-ACK, it enters into the backlog queue and waits for server to do a socket.accept().
Unlike a "connection refused" exception which is more or less immediate, a connect timeout can take longer to fail depending on the ORB property com.ibm.CORBA.ConnectTimeout (set in seconds). If this property is set to 0, there will be no timeout at the ORB layer, and the underlying OS TCP settings will govern how long a connect attempt will wait before timing out.
For more information on this property, review the link below:
Object Request Broker custom properties
The following are the OS TCP parameters which control connect timeouts:
|OS||Property name||Default value|
Root Cause 1: Server not up and running
Solution: Ensure server has been started and is listening on the server port to which the client is connecting.
Root Cause 2: Network issues (dropped packets, retransmission errors, congestion, etc)
Solution: This requires analysis of network traffic to identify potential causes.
Root Cause 3: Server overload
Solution: If the client traffic is expected to be heavy at certain times and the server(s) just can't handle the incoming load of new connections, do one or more of the following:
· Increase com.ibm.CORBA.ConnectTimeout on the clients
· Add server JVM(s) to handle the increased load
Root Cause 4: Denial of Service attack
Solution: Identify the malicious client IPs and block them via a firewall
Root Cause 5: Firewall is blocking the traffic on a particular WAS server port
Solution: Ensure all WAS server ports are open on the firewall
SSL Handshake Problems
When a client tries to create a new SSL connection to a server, an SSL handshake must take place before the connection is fully completed. During this SSL handshake, both the client and server will set a timeout on socket reads for the duration of the handshake. This is to prevent threads from hanging on either side while reading the individual handshake messages. If a socket read takes too long, a timeout will occur and the SSL handshake can fail as follows:
Server side exception stack:
|[20:43:32:208 EST] 0000003c ORBRas 1 com.ibm.ws.security.orbssl.WSSSLServerSocketFactoryImpl getPeerCertificateChain LT=1:P=511663:O=0:port=9502 The following exception was logged
java.net.SocketException: Socket Closed
20:43:32:208 EST] 0000003c ORBRas 3 com.ibm.ws.security.orbssl.WSSSLServerSocketFactoryImpl
LT=1:P=511663:O=0:port=9502 exception occurred when trying to set the
timeout back to 0, most likely the socket is closed since the handshake
took too long and reader thread times it out
Client side exception stack:
|[10/25/11 19:37:51:270 GMT] 00000033 ORBRas E com.ibm.ws.security.orbssl.WSSSLClientSocketFactoryImpl createSSLSocket ORB.thread.pool : 28 JSSL0130E: java.io.IOException: Signals that an I/O exception of some sort has occurred. Reason: Read timed out java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
SSL Handshake Timeout recommendations:
In general the following settings are recommended:
Values are in milliseconds.
Root Cause 1: Increased number of concurrent SSL connection attempts.
This scenario can occur if the overall number of clients has significantly increased or if many of the clients (for whatever reason) are trying to connect at the same time (e.g. after a server recycle). Depending on the scenario, some additional tuning may be necessary on the server to accommodate the additional load.
· For all WAS versions: Increase the com.ibm.ws.orb.SSLHandshakeTimeout on the clients.
This causes the clients to give the server more time to respond to the SSL handshake initiation.
· For WAS 9.0 and later: Set com.ibm.ws.orb.transport.DeferSSLHandshake=true
This is an additional measure to take which frees the ORB SSL ListenerThread from having to perform the SSL handshake and defers the handshake processing to the individual ReaderThreads created for each client.
· In some cases, it may be necessary to add additional servers to handle increased client load.
Root Cause 2: Server's ORB connection cache is set too low for the number of concurrent connections (TCP and SSL).
The ORB is designed to cache connections, so that the initial time needed to perform an SSL handshake between the client and server must only be done once and the resulting connection can be reused for the lifetime of the client (or server) until recycling. When the Connection Cache size hits the maximum, every time a new connection comes in, the server's ListenerThread will attempt to cleanup old, unused connections. This can sometimes slow down the processing of new connections and hence the server gets behind in handling the new SSL connections, ultimately causing many to fail due to timeouts on the client side. ORB trace on the server side will reveal both the size of the Connection Cache and how often the cleanup method is invoked.
Sample server trace:
|[7/17/17 13:22:39:705 EDT] 00000009 ORBRas 3 com.ibm.rmi.transport.ListenerThread run:181 LT=1:P=316379:O=0:port=9405 JORB0010I: A ListenerThread accepted the following socket: 9c121b2a[SSL_NULL_WITH_NULL_NULL: Socket[addr=/184.108.40.206,port=61940,localport=9405]].
[7/17/17 13:22:39:726 EDT] 00000009 ORBRas < com.ibm.rmi.transport.ConnectionTableImpl addConnection:321 LT=1:P=316379:O=0:port=9405 Exit
size=344 << ConnectionCache is over the MaxOpenConnections, so cleanUp happens
[7/17/17 13:22:39:726 EDT] 00000009 ORBRas > com.ibm.rmi.transport.ConnectionTableImpl cleanUp:183 LT=1:P=316379:O=0:port=9405 Entry
lowWaterMark=100, highWaterMark=240 << this is MaxOpenConnections value
Solution: Increase com.ibm.CORBA.MaxOpenConnections on the server, so that the Connection Cache cleanup is not occurring after each new incoming connection. The server trace above will give a good indication of how large the Cache becomes during heavy loads. It's also recommended to increase com.ibm.ws.orb.SSLHandshakeTimeout on the clients.
Root Cause 3: The network is slow or experiencing dropped packets.
Solution: Fix the network issues if possible and/or increase the SSLHandshakeTimeout on the client and/or server.
Root Cause 4: CPU utilization.
A steady increase or spike in CPU utilization on the machine can cause the server's ORB ListenerThread to fall behind in handling new SSL connections.
Solution: Using OS performance tools such as vmstat, determine if the CPU is running at full capacity or if a temporary spike occurred. Identify whether the situation is long-term or temporary and take appropriate measures to ensure sufficient CPU headroom.
Port hijacking can cause problems for both WAS clients and servers and can manifest with a variety of symptoms and problem scenarios. A port has been "hijacked" when an initial process binds to a particular port (usually a server port) and then another process starts up later and binds to that same port. This can cause data to be routed to the wrong JVM (usually traffic intended for a WAS server ends up going to an unknown non-WebSphere JVM).
The following are the most common scenarios in which port hijacking occurs. In general, when the client ORB trace shows it successfully sent a request message to the WAS server, but the WAS server trace shows no evidence of having received that message, then port hijacking should be suspected as a possible cause. In almost all cases, the solution is the same and is provided at the end of this section.
Scenario 1 : A client ORB request fails due to CORBA.NO_RESPONSE
Here, the AppServer (client) during startup is trying to contact the NodeAgent (server).
The AppServer sends a request to the NodeAgent to get a reference to the LocationService, and this call fails with a CORBA.NO_RESPONSE exception. From the AppServer ORB trace:
|[10/18/17 18:45:35:068 PDT] 00000001 ORBRas 3 com.ibm.rmi.ras.Trace dump:84 P=534912:O=0:CT
Locate Request Message
Date: October 18, 2017 6:45:35 PM PDT
Thread Info: P=534912:O=0:CT
Local Port: 56818 (0xDDF2)
Local IP: 220.127.116.11
Remote Port: 9101 (0x238D) << this is the ORB_LISTENER port of the NodeAgent (NA)
Remote IP: 18.104.22.168
0000: 47494F50 01000003 00000017 00000006 GIOP............
0010: 0000000F 4C6F6361 74696F6E 53657276 ....LocationServ
0020: 696365 ice
[10/18/17 18:48:35:090 PDT] 00000001 ORBRas 3 com.ibm.rmi.iiop.Connection getCallStream:2436 P=534912:O=0:CT The following exception was logged
org.omg.CORBA.NO_RESPONSE: Request 6 timed out vmcid: IBM minor code: B01 completed: Maybe
Another JVM has hijacked the NA port (in this example, port 9101). This causes the locateRequest message from the AppServer to be sent to the hijacking JVM rather than the NA. Since this JVM does nothing with the request message from the AppServer, eventually the locateRequest times out with NO_RESPONSE causing the subsequent startup failure.
Scenario 2 : A client ORB request fails with a connection reset.
Symptoms: This scenario resembles Scenario 1 in that the client successfully sends an ORB request message supposedly to the server. Again, ORB trace on the client confirms the message was sent successfully, however the server side shows no evidence of having received the message. Network trace can also help confirm this scenario.
Explanation: Here, the hijacking process receives the ORB request message, doesn't know what to do with it and simply responds by resetting the socket (which is seen as an RST packet in the network trace).
Scenario 3 : A client, when trying to establish a new SSL connection with the server, gets an SSL Handshake timeout.
|[1/12/18 11:33:36:131 EST] 00000001 ORBRas E com.ibm.ws.security.orbssl.WSSSLClientSocketFactoryImpl createSSLSocket P=732129:O=0:CT JSSL0130E: java.io.IOException: Signals that an I/O exception of some sort has occurred. Reason: Read timed out Remote Host: 22.214.171.124 Remote Port: 31002 java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
Explanation: The client is attempting to establish a new SSL connection with a WAS server. In this scenario, another process has hijacked the WAS SSL server port. When the client tries to do an SSL Handshake, the non-WAS process does not respond and hence the client's read of the expected SSL Handshake message times out.
To determine if another JVM has hijacked a WebSphere server port (NodeAgent or AppServer), a combination of the following can be done:
1. Run netstat to detect if more than 1 JVM is listening on the port in question (In this example, the suspected hijacked port is 9101):
# netstat –ano | find "9101" (Windows)
# netstat –ano |grep 9101 (Unix)
2. The lsof command can also be used:
# lsof -I :9101
3. Both netstat and lsof need to be run as close to the time of the exception so as to catch the hijacking JVM ASAP since it may be a process that is not long-lived.
4. Check if the WebSphere port(s) are in the Operating System's ephemeral port range. If so, this could explain the conflict. Either ensure WAS server ports are outside the ephemeral port range or change the range to exclude the WAS server ports.
5. ORB trace and/or network trace may be needed if netstat and lsof fail to reveal an obvious port hijacking scenario.
|WebSphere Application Server - Express||AIX, HP-UX, Linux, Solaris, Windows||6.0, 5.1, 5.0|
|Runtimes for Java Technology||Java SDK|
More support for:
WebSphere Application Server
Component: Object Request Broker (ORB)
Software version: 7.0, 8.0, 8.5, 8.5.5, 9.0
Operating system(s): AIX, HP-UX, Linux, Solaris, Windows
Software edition: Base, Express, Network Deployment
Reference #: 1237101
Modified date: 29 November 2018