IBM Support

Problem Running Platform MPI Jobs Requiring InfiniBand After Updating to RHEL6.3

Troubleshooting


Problem

After updating the OS from RHEL6.2 to RHEL6.3 in Platform HPC 3.2, the Platform MPI InfiniBand jobs fail with error: ibv_open_device() failed

Symptom

Step 1 - Install hpc product.
Step 2 - Update the OS form rhel6.2 to rhel6.3. Then, provision a package node.
Step 3 - Run a job requesting Platform MPI and InfiniBand network resource:
1. su - hpcadmin
2. module clear
3. module load PMPI/modulefile
4. mpicc -mpi64 /opt/platform_mpi/help/hello_world.c -o /home/hpcadmin/hello_world
5. bsub -I mpirun -np 2 -IBV /home/hpcadmin/hello_world

Error message:

[hpcadmin@ip110test ~]$ bsub -I mpirun -np 2 -IBV /home/hpcadmin/hello_world
Job <988> is submitted to default queue <medium_priority>.

<<Waiting for dispatch ...>>

<<Starting on compute000>>

hello_world: Rank 0:0: MPI_Init: ibv_open_device() failed

hello_world: Rank 0:1: MPI_Init: ibv_open_device() failed
hello_world: Rank 0:0: MPI_Init: ibv_open_device() failed
hello_world: Rank 0:1: MPI_Init: ibv_open_device() failed
hello_world: Rank 0:0: MPI_Init: didn't find active interface/port
hello_world: Rank 0:1: MPI_Init: didn't find active interface/port
hello_world: Rank 0:0: MPI_Init: Can't initialize RDMA device
hello_world: Rank 0:1: MPI_Init: Can't initialize RDMA device
hello_world: Rank 0:0: MPI_Init: Internal Error: Cannot initialize RDMA protocol
hello_world: Rank 0:1: MPI_Init: Internal Error: Cannot initialize RDMA protocol
MPI Application rank 0 exited before MPI_Init() with status 1

Cause

This issue is caused by the upgrade of rdma package.
The previous version of rdma (rdma-1.0-14.el6.noarch.rpm) provides /etc/udev/rules.d/90-rdma.rules, while the new version of rdma(rdma-3.3-3.el6.noarch.rpm) removes that rules file from the package. The rules file was used to set the permission of /dev/infiniband/rdma_cm and /dev/infiniband/uverbs0 to 666, which is required to run mpi job through infiniband

Resolving The Problem

The workaround to this issue is as follows:

I - Create 90-rdma.rules:

1. create 90-rdma.rules under /etc/udev/rules.d/
2. insert the lines below to /etc/udev/rules.d/90-rdma.rules

KERNEL=="umad*", SYMLINK+="infiniband/%k"
KERNEL=="issm*", SYMLINK+="infiniband/%k"
KERNEL=="ucm*", SYMLINK+="infiniband/%k", MODE="0666"
KERNEL=="uverbs*", SYMLINK+="infiniband/%k", MODE="0666"
KERNEL=="uat", SYMLINK+="infiniband/%k", MODE="0666"
KERNEL=="ucma", SYMLINK+="infiniband/%k", MODE="0666"
KERNEL=="rdma_cm", SYMLINK+="infiniband/%k", MODE="0666"

3. reboot the machine

II - Create 60-ipath.rules:

1. create 60-ipath.rules under /etc/udev/rules.d/
2. insert the following lines into /etc/udev/rules.d/60-ipath.rules
KERNEL=="ipath", MODE="0666"
KERNEL=="ipath[0-9]*", MODE="0666"
KERNEL=="ipath_*", MODE="0600"
KERNEL=="kcopy[0-6][0-9]", NAME="kcopy/n", MODE="0666"

3. reboot the machine

openmpi job successfully ran.

[{"Product":{"code":"SSDV85","label":"Platform Cluster Manager"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"--","Platform":[{"code":"PF016","label":"Linux"}],"Version":"3.2","Edition":"Enterprise","Line of Business":{"code":"LOB10","label":"Data and AI"}},{"Product":{"code":"SSZUCA","label":"IBM Spectrum Cluster Foundation"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":null,"Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
16 September 2018

UID

isg3T1019235