Troubleshooting
Problem
Intel MPI blaunch job sometimes will be hang if the request core number is larger than 100. The job will be hang randomly. If you can find many processes shown the following status: "[blaunch]
Symptom
The MPI job hang in the cluster without done status.
Cause
Intel MPI edition is low
Resolving The Problem
You should upgrade the InterlMPI to the latest edition which above the 4.1.0.036 edition.
The following is an example of submitting the MPI job.
============================
#!/bin/sh #BSUB -n t #t is the number of the slots requirement #BSUB -e intelmpi_%J.err #BSUB -o intelmpi_%J.out #BSUB -R "span[ptile=n]" #n is the process number run on the each host export INTELMPI_TOP=/opt/mpi/intelmpi/impi/4.1.3 export PATH=$INTELMPI_TOP/bin:$PATH export I_MPI_HYDRA_BOOTSTRAP=lsf export I_MPI_HYDRA_BRANCH_COUNT=m #m is number of hosts export I_MPI_LSF_USE_COLLECTIVE_LAUNCH=1 mpiexec.hydra intelmpi_program |
The variable I_MPI_LSF_USE_COLLECTIVE_LAUNCH=1 is necessary. I_MPI_LSF_USE_COLLECTIVE_LAUNCH=1, intelMPI use single blaunch -z, which should be recommended LSF + intelMPI usage.
I_MPI_LSF_USE_COLLECTIVE_LAUNCH=0, intelMPI use multiple blaunch -n, which is not stable in the current LSF + intelMPI usage. Do not suggest use it.
Was this topic helpful?
Document Information
Modified date:
17 June 2018
UID
isg3T1020816