IBM Support

Resolve the problem of the Intel MPI job requires more than one hundred cores hang in the cluster.

Troubleshooting


Problem

Intel MPI blaunch job sometimes will be hang if the request core number is larger than 100. The job will be hang randomly. If you can find many processes shown the following status: "[blaunch] " on the host,you can refer this solution.

Symptom

The MPI job hang in the cluster without done status.

Cause

Intel MPI edition is low

Resolving The Problem

You should upgrade the InterlMPI to the latest edition which above the 4.1.0.036 edition.

The following is an example of submitting the MPI job.

============================

#!/bin/sh
#BSUB -n t #t is the number of the slots requirement
#BSUB -e intelmpi_%J.err
#BSUB -o intelmpi_%J.out
#BSUB -R "span[ptile=n]" #n is the process number run on the each host
export INTELMPI_TOP=/opt/mpi/intelmpi/impi/4.1.3
export PATH=$INTELMPI_TOP/bin:$PATH
export I_MPI_HYDRA_BOOTSTRAP=lsf
export I_MPI_HYDRA_BRANCH_COUNT=m #m is number of hosts
export I_MPI_LSF_USE_COLLECTIVE_LAUNCH=1
mpiexec.hydra intelmpi_program
============================

The variable I_MPI_LSF_USE_COLLECTIVE_LAUNCH=1 is necessary. I_MPI_LSF_USE_COLLECTIVE_LAUNCH=1, intelMPI use single blaunch -z, which should be recommended LSF + intelMPI usage.
I_MPI_LSF_USE_COLLECTIVE_LAUNCH=0, intelMPI use multiple blaunch -n, which is not stable in the current LSF + intelMPI usage. Do not suggest use it.

[{"Product":{"code":"SSETD4","label":"Platform LSF"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"Not Applicable","Platform":[{"code":"PF016","label":"Linux"}],"Version":"9.1.1","Edition":"Standard","Line of Business":{"code":"LOB10","label":"Data and AI"}},{"Product":{"code":"SSWRJV","label":"IBM Spectrum LSF"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":null,"Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
17 June 2018

UID

isg3T1020816