IBM Support

A DataStage parallel job running on multiple nodes on a single server machine fails with error "**** Parallel startup failed ****"

Troubleshooting


Problem

A parallel DataStage job with configuration file setup to run multiple nodes on a single server fails with error: Message: main_program: **** Parallel startup failed ****

Resolving The Problem

The full text for this "parallel startup failed" error provides some additional information about possible causes:


    This is usually due to a configuration error, such as not having the Orchestrate install directory properly mounted on all nodes, rsh permissions not correctly set (via /etc/hosts.equiv or .rhosts), or running from a directory that is not mounted on all nodes. Look for error messages in the preceding output.


For the situation where a site is attempting to run multiple nodes on multiple server machines, the above statement is correct. More information on setting up ssh/rsh and parallel processing can be found in the following topics:
Configuring remote and secure shells
Configuring a parallel processing environment


However, in the case where all nodes are running on a single server machine, the "Parallel startup failed" message is usually an indication that the fastname defined in the configuration file does not match the name output by the server's "hostname" command.

In a typical node configuration file, the server name where each node runs is indicated by the fastname, i.e., /opt/IBM/InformationServer/Server/Configurations/default.apt :
{
  node "node1"
  {
     fastname "server1"
     pools ""
     resource disk "/opt/resource/node1/Datasets" {pools ""}

      resource scratchdisk "/opt/resource/node1/Scratch" {pools ""}
  }
  node "node2"
  {
     fastname "server1"

      pools ""
     resource disk "/opt/resource/node2/Datasets" {pools ""}
     resource scratchdisk "/opt/resource/node2/Scratch" {pools ""}
  }
}

Login to the DataStage server machine and at the operating system command prompt, enter command:
hostname

If the hostname output EXACTLY matches the fastname defined for local nodes, then the job will run correctly on that server. However, if the "hostname" command outputs the hostname in a different format (such as with domain name appended) then the names defined for fastname will be considered remote nodes and a failed attempt will be made to access the node via rsh/ssh.

Using the above example, if the hostname output was server1.mydomain.com then prior to the "Parallel startup failed" error in job log you will likely see the following error:
    Message: main_program: Accept timed out retries = 4
    server1: Connection refused

The above problem will occur even if your /etc/hosts file maps server1 and server1.mydomain.com to the same address since it is not the inability to resolve either address that causes this issue, but rather that the fastname in node configuration file does not exactly match the system hostname (or value of APT_PM_CONDUCTOR_HOSTNAME if defined).


You have several options to deal with this situation:
  • change fastname for nodes in configuration file to exactly match the output of hostname command.
  • set APT_PM_CONDUCTOR_HOSTNAME to the same value as fastname. This would need to be defined either in every project or every job.
  • You should NOT change the hostname of server to match fastname. Information Server / DataStage stores some information based on the current hostname. If you change the hostname after installation of Information Server / DataStage, then you will need to contact support team for additional instructions to allow DataStage to work correctly with the new hostname.

One other possible cause for the error occurs when the network name resolution via domain name server (DNS server) either returns bad address for a hostname, or times out. When the DNS server cannot resolve a hostname, resolves it to a bad address, or resolves the short hostname and long (with domain) hostname to different addresses, then DataStage may return the error discussed in this technote when trying to start session on that hostname. Also, if a timeout occurs while resolving the hostname that can also cause the same error.

By default an AIX server will look to network for hostname resolution before looking in the /etc/hosts file, so if the DNS server cannot resolve name, or the first server in the list of DNS servers is invalid or not running, then timeouts can occur. Both problems can be checked/resolved via the following steps:
  • Ping the hostname used in apt configuration file, from both the DataStage engine machine and the client machine. If the names do not resolve the same, or to valid addresses, then have network administrator correct the DNS problem, or update the /etc/hosts file on server machine and the \Windows\system32\drivers\etc\hosts file on client machine.
  • To have the system name resolution process on the DataStage engine machine look in the /etc/hosts file before going to network for name resolution, look for file /etc/netsvc.conf (that is the name of file on AIX that determines search order for name resolution...the name might vary on other platforms). Locate the hosts= line in this file. If it is commented out, then add line to file:
    hosts = local,bind
    which will cause the local /etc/hosts file to be checked first. If the hosts= line already exists but either does not mention "local" or does not have local first, then move local to the start of the comma delimited list and then save changes. The change should take effect immediately.

[{"Product":{"code":"SSVSEF","label":"IBM InfoSphere DataStage"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"--","Platform":[{"code":"PF033","label":"Windows"},{"code":"PF027","label":"Solaris"},{"code":"PF016","label":"Linux"},{"code":"PF010","label":"HP-UX"},{"code":"PF002","label":"AIX"}],"Version":"8.5;8.1;8.0.1;8.0;7.5","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
16 June 2018

UID

swg21434065