- 如果在配置过程中有新的问题产生,可以在文档中进行留存记录,如果能帮忙完善文档,那就更好了 :)
IB 网卡配置
首先查看是否具有 Infiniband controller :
$ lspci | grep Mell
86:00.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
如果成功返回以上的信息,说明 IB网卡插入成功,如果没有返回,则说明网卡未成功插入,可能需要重新插入网卡。
查看IB网卡名称:
$ ip a | grep ib
4: ibs43: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
link/infiniband 00:00:08:7d:fe:80:00:00:00:00:00:00:08:c0:eb:03:00:2c:48:9c brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
inet 10.10.10.2/24 brd 10.10.10.255 scope global ibs43
这条指令只是为了查看是否识别出了ib网卡的端口,如果没有也没有影响,继续后面的操作。
查看系统配置
wujintian@haslab4:/etc/netplan$ uname -r
5.4.0-152-generic
wujintian@haslab4:/etc/netplan$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.5 LTS
Release: 20.04
Codename: focal
根据Linux系统版本、IB网卡版本,选择合适的IB网卡驱动,网站如下:
https://downloaders.azurewebsites.net/downloaders/mlnx_ofed_downloader/downloader3.html#
不能使用wget来下载,在选择相应的版本后,点击进入,记得选择“I Accept”:
安装相应依赖文件:
- 解压刚刚下载的文档:
tar -zxvf MLNX_OFED_LINUX-5.8-2.0.3.0-ubuntu22.04-x86_64.tgz
- 安装IB驱动:
切换到安装目录:
cd MLNX_OFED_LINUX-5.8-2.0.3.0-ubuntu22.04-x86_64
运行安装命令:
wujintian@haslab4:~/MLNX_OFED_LINUX-5.4-3.7.5.0-ubuntu20.04-x86_64$ sudo ./mlnxofedinstall --force
....(安装最后有)
Installation passed successfully
To load the new driver, run:
/etc/init.d/openibd restart
- 启动服务:
/etc/init.d/openibd restart
/etc/init.d/opensmd restart
- 查看ib网卡状态:
$ ibstat
CA 'mlx5_0'
CA type: MT4119
Number of ports: 1
Firmware version: 16.31.2006
Hardware version: 0
Node GUID: 0x08c0eb03002c489c
System image GUID: 0x08c0eb03002c489c
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 5
LMC: 0
SM lid: 1
Capability mask: 0x2651e84a
Port GUID: 0x08c0eb03002c489c
Link layer: InfiniBand
如果状态是 Active 、LinkUp,那么说明网卡启动了。如果没有,reboot尝试一下。
网卡网络配置:
- 使用指令查看IB网卡的网口:
wujintian@haslab4:/etc/netplans $ ibdev2netdev
mlx5_0 port 1 => ibs43 (Down)
- 得到 ib 的网口名称是:
ibs43
, 给 n 号机网卡配置10.10.10.n/24
的ip
$ cd /etc/netplan/
$ ls
00-installer-config.yaml
# 配置ip
$ vim 00-installer-config.yaml
# This is the network config written by 'subiquity'
network:
ethernets:
ens31f0:
dhcp4: true
ens31f1:
dhcp4: true
ibs43: # 需要修改为这个服务器的ib设备名称
dhcp4: no
dhcp6: no
addresses: [10.10.10.2/24]
version: 2
- 应用配置并尝试ping其他网卡
$ netplan apply
$ ping 10.10.10.5
PING 10.10.10.5 (10.10.10.5) 56(84) bytes of data.
64 bytes from 10.10.10.5: icmp_seq=1 ttl=64 time=0.129 ms
64 bytes from 10.10.10.5: icmp_seq=2 ttl=64 time=0.101 ms
64 bytes from 10.10.10.5: icmp_seq=3 ttl=64 time=0.102 ms
64 bytes from 10.10.10.5: icmp_seq=4 ttl=64 time=0.100 ms
- 使用rdma_cm API的程序测试,测试代码
可能出现的问题
1. Unloading mlx_compat [FAILED]
$: sudo /etc/init.d/openibd restart
Unloading mlx_compat [FAILED]
rmmod: ERROR: Module mlx_compat is in use by: rpcrdma svcrdma
执行/etc/init.d/openibd force-stop,然后再启动
sudo /etc/init.d/openibd force-stop
sudo /etc/init.d/openibd restart
2. 出现State:Initializing
如果出现下面所示的 Initializing 状态:
$ ibstat
CA 'mlx5_0'
CA type: MT4119
Number of ports: 1
Firmware version: 16.31.2006
Hardware version: 0
Node GUID: 0x08c0eb03002c489c
System image GUID: 0x08c0eb03002c489c
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 100
Base lid: 5
LMC: 0
SM lid: 1
Capability mask: 0x2651e84a
Port GUID: 0x08c0eb03002c489c
Link layer: InfiniBand
State: Initializing
Physical state: LinkUp
则运行以下命令即可恢复正常
systemctl restart opensm
或者命令:
sudo systemctl start opensmd
sudo systemctl enable opensmd
再次运行IB网卡查看命令
$ ibstat
CA 'mlx5_0'
CA type: MT4119
Number of ports: 1
Firmware version: 16.31.2006
Hardware version: 0
Node GUID: 0x08c0eb03002c489c
System image GUID: 0x08c0eb03002c489c
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 5
LMC: 0
SM lid: 1
Capability mask: 0x2651e84a
Port GUID: 0x08c0eb03002c489c
Link layer: InfiniBand