Troubleshooting
Problem
After updating the OS from RHEL6.2 to RHEL6.3 in Platform HPC 3.2, the Platform MPI InfiniBand jobs fail with error: ibv_open_device() failed
Symptom
Step 1 - Install hpc product.
Step 2 - Update the OS form rhel6.2 to rhel6.3. Then, provision a package node.
Step 3 - Run a job requesting Platform MPI and InfiniBand network resource:
1. su - hpcadmin
2. module clear
3. module load PMPI/modulefile
4. mpicc -mpi64 /opt/platform_mpi/help/hello_world.c -o /home/hpcadmin/hello_world
5. bsub -I mpirun -np 2 -IBV /home/hpcadmin/hello_world
Error message:
[hpcadmin@ip110test ~]$ bsub -I mpirun -np 2 -IBV /home/hpcadmin/hello_world
Job <988> is submitted to default queue <medium_priority>.
<<Waiting for dispatch ...>>
<<Starting on compute000>>
hello_world: Rank 0:0: MPI_Init: ibv_open_device() failed
hello_world: Rank 0:1: MPI_Init: ibv_open_device() failed
hello_world: Rank 0:0: MPI_Init: ibv_open_device() failed
hello_world: Rank 0:1: MPI_Init: ibv_open_device() failed
hello_world: Rank 0:0: MPI_Init: didn't find active interface/port
hello_world: Rank 0:1: MPI_Init: didn't find active interface/port
hello_world: Rank 0:0: MPI_Init: Can't initialize RDMA device
hello_world: Rank 0:1: MPI_Init: Can't initialize RDMA device
hello_world: Rank 0:0: MPI_Init: Internal Error: Cannot initialize RDMA protocol
hello_world: Rank 0:1: MPI_Init: Internal Error: Cannot initialize RDMA protocol
MPI Application rank 0 exited before MPI_Init() with status 1
Cause
This issue is caused by the upgrade of rdma package.
The previous version of rdma (rdma-1.0-14.el6.noarch.rpm) provides /etc/udev/rules.d/90-rdma.rules, while the new version of rdma(rdma-3.3-3.el6.noarch.rpm) removes that rules file from the package. The rules file was used to set the permission of /dev/infiniband/rdma_cm and /dev/infiniband/uverbs0 to 666, which is required to run mpi job through infiniband
Resolving The Problem
The workaround to this issue is as follows:
I - Create 90-rdma.rules:
1. create 90-rdma.rules under /etc/udev/rules.d/
2. insert the lines below to /etc/udev/rules.d/90-rdma.rules
KERNEL=="umad*", SYMLINK+="infiniband/%k"
KERNEL=="issm*", SYMLINK+="infiniband/%k"
KERNEL=="ucm*", SYMLINK+="infiniband/%k", MODE="0666"
KERNEL=="uverbs*", SYMLINK+="infiniband/%k", MODE="0666"
KERNEL=="uat", SYMLINK+="infiniband/%k", MODE="0666"
KERNEL=="ucma", SYMLINK+="infiniband/%k", MODE="0666"
KERNEL=="rdma_cm", SYMLINK+="infiniband/%k", MODE="0666"
3. reboot the machine
II - Create 60-ipath.rules:
1. create 60-ipath.rules under /etc/udev/rules.d/
2. insert the following lines into /etc/udev/rules.d/60-ipath.rules
KERNEL=="ipath", MODE="0666"
KERNEL=="ipath[0-9]*", MODE="0666"
KERNEL=="ipath_*", MODE="0600"
KERNEL=="kcopy[0-6][0-9]", NAME="kcopy/n", MODE="0666"
3. reboot the machine
openmpi job successfully ran.
Was this topic helpful?
Document Information
Modified date:
16 September 2018
UID
isg3T1019235