• 如果在配置过程中有新的问题产生,可以在文档中进行留存记录,如果能帮忙完善文档,那就更好了 :)

IB 网卡配置

首先查看是否具有 Infiniband controller :

$ lspci | grep Mell
86:00.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]

如果成功返回以上的信息,说明 IB网卡插入成功,如果没有返回,则说明网卡未成功插入,可能需要重新插入网卡。

查看IB网卡名称:

$ ip a | grep ib
4: ibs43: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband 00:00:08:7d:fe:80:00:00:00:00:00:00:08:c0:eb:03:00:2c:48:9c brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 10.10.10.2/24 brd 10.10.10.255 scope global ibs43

​ 这条指令只是为了查看是否识别出了ib网卡的端口,如果没有也没有影响,继续后面的操作。

查看系统配置

wujintian@haslab4:/etc/netplan$ uname -r
5.4.0-152-generic
wujintian@haslab4:/etc/netplan$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.5 LTS
Release:        20.04
Codename:       focal

根据Linux系统版本、IB网卡版本,选择合适的IB网卡驱动,网站如下:

https://downloaders.azurewebsites.net/downloaders/mlnx_ofed_downloader/downloader3.html#

不能使用wget来下载,在选择相应的版本后,点击进入,记得选择“I Accept”:

image-20230704172657313

安装相应依赖文件:

  1. 解压刚刚下载的文档:
tar -zxvf MLNX_OFED_LINUX-5.8-2.0.3.0-ubuntu22.04-x86_64.tgz
  1. 安装IB驱动:

切换到安装目录:

cd MLNX_OFED_LINUX-5.8-2.0.3.0-ubuntu22.04-x86_64

运行安装命令:

wujintian@haslab4:~/MLNX_OFED_LINUX-5.4-3.7.5.0-ubuntu20.04-x86_64$ sudo ./mlnxofedinstall --force
	....(安装最后有)
Installation passed successfully
To load the new driver, run:
/etc/init.d/openibd restart
  1. 启动服务:
/etc/init.d/openibd restart

/etc/init.d/opensmd restart
  1. 查看ib网卡状态:
$ ibstat
CA 'mlx5_0'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.31.2006
        Hardware version: 0
        Node GUID: 0x08c0eb03002c489c
        System image GUID: 0x08c0eb03002c489c
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 5
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e84a
                Port GUID: 0x08c0eb03002c489c
                Link layer: InfiniBand

如果状态是 Active 、LinkUp,那么说明网卡启动了。如果没有,reboot尝试一下。

网卡网络配置:

  1. 使用指令查看IB网卡的网口:
wujintian@haslab4:/etc/netplans $ ibdev2netdev
mlx5_0 port 1 => ibs43 (Down)
  1. 得到 ib 的网口名称是:ibs43 , 给 n 号机网卡配置 10.10.10.n/24 的ip
$ cd /etc/netplan/
$ ls
00-installer-config.yaml

# 配置ip
$ vim 00-installer-config.yaml
# This is the network config written by 'subiquity'
network:
  ethernets:
    ens31f0:
      dhcp4: true
    ens31f1:
      dhcp4: true
    ibs43: # 需要修改为这个服务器的ib设备名称
            dhcp4: no
            dhcp6: no
            addresses: [10.10.10.2/24]
  version: 2
  1. 应用配置并尝试ping其他网卡
$ netplan apply
$ ping 10.10.10.5
PING 10.10.10.5 (10.10.10.5) 56(84) bytes of data.
64 bytes from 10.10.10.5: icmp_seq=1 ttl=64 time=0.129 ms
64 bytes from 10.10.10.5: icmp_seq=2 ttl=64 time=0.101 ms
64 bytes from 10.10.10.5: icmp_seq=3 ttl=64 time=0.102 ms
64 bytes from 10.10.10.5: icmp_seq=4 ttl=64 time=0.100 ms
  1. 使用rdma_cm API的程序测试,测试代码

可能出现的问题

1. Unloading mlx_compat [FAILED]

$: sudo /etc/init.d/openibd restart
Unloading mlx_compat [FAILED]
rmmod: ERROR: Module mlx_compat is in use by: rpcrdma svcrdma

执行/etc/init.d/openibd force-stop,然后再启动

sudo /etc/init.d/openibd force-stop
sudo /etc/init.d/openibd restart

2. 出现State:Initializing

如果出现下面所示的 Initializing 状态:

$ ibstat
CA 'mlx5_0'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.31.2006
        Hardware version: 0
        Node GUID: 0x08c0eb03002c489c
        System image GUID: 0x08c0eb03002c489c
        Port 1:
                State: Initializing
                Physical state: LinkUp
                Rate: 100
                Base lid: 5
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e84a
                Port GUID: 0x08c0eb03002c489c
                Link layer: InfiniBand
State: Initializing
Physical state: LinkUp

则运行以下命令即可恢复正常

systemctl restart opensm

或者命令:

sudo systemctl start opensmd 
sudo systemctl enable opensmd

再次运行IB网卡查看命令

$ ibstat
CA 'mlx5_0'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.31.2006
        Hardware version: 0
        Node GUID: 0x08c0eb03002c489c
        System image GUID: 0x08c0eb03002c489c
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 5
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e84a
                Port GUID: 0x08c0eb03002c489c
                Link layer: InfiniBand

参考链接

State: Initializing或State: Down的解决办法

RDMA 编程实例(rdma_cm API)

测试代码