-
Notifications
You must be signed in to change notification settings - Fork 103
Open
Description
Steps
1.1 In the SimCCL folder, I first need to chmod +x src/device/generate.py otherwise there will be an error saying that permission denied.
1.2 I ran make -j src.build which succeeded.
1.3 cd test, make
1.4 bash run.sh
==> Error!
Log
NCCL version 2.20.5-MockNCCL+cuda12.6
inspur-gpu-server-15:2563676:2563678 [0] misc/cudawrap.cc:34 NCCL WARN Cuda failure 'no CUDA-capable device is detected'
inspur-gpu-server-15:2563676:2563678 [0] init.cc:270 NCCL WARN Cuda failure 3 'initialization error'
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Mocking start: mock_nNodes(8), mock_nRanks(64), mock_nRanksPerNode(8).
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO info_index(0): 0x7f9f08096a30, rank: 0, cudaDev: 0, nvmlDev: 0, gdrSupport: 1,hostHash: 2935421449793392623, pidHash: 12347059415262710392, busId: ffffffffffffffff:ff:ff.f, cudaCompCap: -1.
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO comm_index(0): 0x55c7f6f30c00, node: 0, nNodes: 0, localRank: 0, localRanks: 0, maxLocalRanks: 0, intraRank: 0, intraRanks: 0, ncclCollNet: 0.
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO nNodes: 64.
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO comm_index(0): 0x55c7f6f30c00, node: 0, nNodes: 0, localRank: 0, localRanks: 0, maxLocalRanks: 0, intraRank: 0, intraRanks: 1, ncclCollNet: 0.
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO After ncclTopoGetSystem, system_info{nodes[NET].count: 5, nodes[GPU].count: 8, maxBW: 0.000000, totalBW: 0.000000}.
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[0].id: 0000:13:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[1].id: 0000:19:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[2].id: 0000:48:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[3].id: 0000:4d:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[4].id: 0000:89:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[5].id: 0000:8e:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[6].id: 0000:ad:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[7].id: 0000:b3:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO After ncclTopoGetSystem, system_info{nodes[NET].count: 5, nodes[GPU].count: 8, maxBW: 0.000000, totalBW: 0.000000}.
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[0].id: 0000:13:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[1].id: 0000:19:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[2].id: 0000:48:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[3].id: 0000:4d:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[4].id: 0000:89:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[5].id: 0000:8e:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[6].id: 0000:ad:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[7].id: 0000:b3:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO After ncclTopoSearchInit, system_info{nodes[NET].count: 5, nodes[GPU].count: 8, maxBW: 24.000000, totalBW: 240.000000}.
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO GPU/13000 :GPU/13000 (0/5000.000000/LOC) GPU/19000 (2/240.000000/NVL) GPU/48000 (2/240.000000/NVL) GPU/4D000 (2/240.000000/NVL) GPU/89000 (2/240.000000/NVL) GPU/8E000 (2/240.000000/NVL) GPU/AD000 (2/240.000000/NVL) GPU/B3000 (2/240.000000/NVL) NVS/0 (1/240.000000/NVL) CPU/0 (3/24.000000/PHB) CPU/1 (4/10.000000/SYS) NET/0 (5/24.000000/PHB) NET/1 (4/24.000000/PXB) NET/2 (6/24.000000/PXN) NET/3 (6/24.000000/PXN) NET/4 (6/24.000000/PXN)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO GPU/19000 :GPU/13000 (2/240.000000/NVL) GPU/19000 (0/5000.000000/LOC) GPU/48000 (2/240.000000/NVL) GPU/4D000 (2/240.000000/NVL) GPU/89000 (2/240.000000/NVL) GPU/8E000 (2/240.000000/NVL) GPU/AD000 (2/240.000000/NVL) GPU/B3000 (2/240.000000/NVL) NVS/0 (1/240.000000/NVL) CPU/0 (3/24.000000/PHB) CPU/1 (4/10.000000/SYS) NET/0 (5/24.000000/PHB) NET/1 (4/24.000000/PXB) NET/2 (6/24.000000/PXN) NET/3 (6/24.000000/PXN) NET/4 (6/24.000000/PXN)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO GPU/48000 :GPU/13000 (2/240.000000/NVL) GPU/19000 (2/240.000000/NVL) GPU/48000 (0/5000.000000/LOC) GPU/4D000 (2/240.000000/NVL) GPU/89000 (2/240.000000/NVL) GPU/8E000 (2/240.000000/NVL) GPU/AD000 (2/240.000000/NVL) GPU/B3000 (2/240.000000/NVL) NVS/0 (1/240.000000/NVL) CPU/0 (3/24.000000/PHB) CPU/1 (4/10.000000/SYS) NET/0 (5/24.000000/PHB) NET/1 (6/24.000000/PXN) NET/2 (4/24.000000/PXB) NET/3 (6/24.000000/PXN) NET/4 (6/24.000000/PXN)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO GPU/4D000 :GPU/13000 (2/240.000000/NVL) GPU/19000 (2/240.000000/NVL) GPU/48000 (2/240.000000/NVL) GPU/4D000 (0/5000.000000/LOC) GPU/89000 (2/240.000000/NVL) GPU/8E000 (2/240.000000/NVL) GPU/AD000 (2/240.000000/NVL) GPU/B3000 (2/240.000000/NVL) NVS/0 (1/240.000000/NVL) CPU/0 (3/24.000000/PHB) CPU/1 (4/10.000000/SYS) NET/0 (5/24.000000/PHB) NET/1 (6/24.000000/PXN) NET/2 (4/24.000000/PXB) NET/3 (6/24.000000/PXN) NET/4 (6/24.000000/PXN)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO GPU/89000 :GPU/13000 (2/240.000000/NVL) GPU/19000 (2/240.000000/NVL) GPU/48000 (2/240.000000/NVL) GPU/4D000 (2/240.000000/NVL) GPU/89000 (0/5000.000000/LOC) GPU/8E000 (2/240.000000/NVL) GPU/AD000 (2/240.000000/NVL) GPU/B3000 (2/240.000000/NVL) NVS/0 (1/240.000000/NVL) CPU/0 (4/10.000000/SYS) CPU/1 (3/24.000000/PHB) NET/0 (6/10.000000/SYS) NET/1 (6/24.000000/PXN) NET/2 (6/24.000000/PXN) NET/3 (4/24.000000/PXB) NET/4 (6/24.000000/PXN)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO GPU/8E000 :GPU/13000 (2/240.000000/NVL) GPU/19000 (2/240.000000/NVL) GPU/48000 (2/240.000000/NVL) GPU/4D000 (2/240.000000/NVL) GPU/89000 (2/240.000000/NVL) GPU/8E000 (0/5000.000000/LOC) GPU/AD000 (2/240.000000/NVL) GPU/B3000 (2/240.000000/NVL) NVS/0 (1/240.000000/NVL) CPU/0 (4/10.000000/SYS) CPU/1 (3/24.000000/PHB) NET/0 (6/10.000000/SYS) NET/1 (6/24.000000/PXN) NET/2 (6/24.000000/PXN) NET/3 (4/24.000000/PXB) NET/4 (6/24.000000/PXN)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO GPU/AD000 :GPU/13000 (2/240.000000/NVL) GPU/19000 (2/240.000000/NVL) GPU/48000 (2/240.000000/NVL) GPU/4D000 (2/240.000000/NVL) GPU/89000 (2/240.000000/NVL) GPU/8E000 (2/240.000000/NVL) GPU/AD000 (0/5000.000000/LOC) GPU/B3000 (2/240.000000/NVL) NVS/0 (1/240.000000/NVL) CPU/0 (4/10.000000/SYS) CPU/1 (3/24.000000/PHB) NET/0 (6/10.000000/SYS) NET/1 (6/24.000000/PXN) NET/2 (6/24.000000/PXN) NET/3 (6/24.000000/PXN) NET/4 (4/24.000000/PXB)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO GPU/B3000 :GPU/13000 (2/240.000000/NVL) GPU/19000 (2/240.000000/NVL) GPU/48000 (2/240.000000/NVL) GPU/4D000 (2/240.000000/NVL) GPU/89000 (2/240.000000/NVL) GPU/8E000 (2/240.000000/NVL) GPU/AD000 (2/240.000000/NVL) GPU/B3000 (0/5000.000000/LOC) NVS/0 (1/240.000000/NVL) CPU/0 (4/10.000000/SYS) CPU/1 (3/24.000000/PHB) NET/0 (6/10.000000/SYS) NET/1 (6/24.000000/PXN) NET/2 (6/24.000000/PXN) NET/3 (6/24.000000/PXN) NET/4 (4/24.000000/PXB)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO NET/0 :GPU/13000 (5/24.000000/PHB) GPU/19000 (5/24.000000/PHB) GPU/48000 (5/24.000000/PHB) GPU/4D000 (5/24.000000/PHB) GPU/89000 (6/10.000000/SYS) GPU/8E000 (6/10.000000/SYS) GPU/AD000 (6/10.000000/SYS) GPU/B3000 (6/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) NET/0 (0/5000.000000/LOC) NET/1 (5/24.000000/PHB) NET/2 (5/24.000000/PHB) NET/3 (6/10.000000/SYS) NET/4 (6/10.000000/SYS)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO NET/1 :GPU/13000 (4/24.000000/PXB) GPU/19000 (4/24.000000/PXB) GPU/48000 (6/24.000000/PHB) GPU/4D000 (6/24.000000/PHB) GPU/89000 (7/10.000000/SYS) GPU/8E000 (7/10.000000/SYS) GPU/AD000 (7/10.000000/SYS) GPU/B3000 (7/10.000000/SYS) CPU/0 (3/24.000000/PHB) CPU/1 (4/10.000000/SYS) NET/0 (5/24.000000/PHB) NET/1 (0/5000.000000/LOC) NET/2 (6/24.000000/PHB) NET/3 (7/10.000000/SYS) NET/4 (7/10.000000/SYS)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO NET/2 :GPU/13000 (6/24.000000/PHB) GPU/19000 (6/24.000000/PHB) GPU/48000 (4/24.000000/PXB) GPU/4D000 (4/24.000000/PXB) GPU/89000 (7/10.000000/SYS) GPU/8E000 (7/10.000000/SYS) GPU/AD000 (7/10.000000/SYS) GPU/B3000 (7/10.000000/SYS) CPU/0 (3/24.000000/PHB) CPU/1 (4/10.000000/SYS) NET/0 (5/24.000000/PHB) NET/1 (6/24.000000/PHB) NET/2 (0/5000.000000/LOC) NET/3 (7/10.000000/SYS) NET/4 (7/10.000000/SYS)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO NET/3 :GPU/13000 (7/10.000000/SYS) GPU/19000 (7/10.000000/SYS) GPU/48000 (7/10.000000/SYS) GPU/4D000 (7/10.000000/SYS) GPU/89000 (4/24.000000/PXB) GPU/8E000 (4/24.000000/PXB) GPU/AD000 (6/24.000000/PHB) GPU/B3000 (6/24.000000/PHB) CPU/0 (4/10.000000/SYS) CPU/1 (3/24.000000/PHB) NET/0 (6/10.000000/SYS) NET/1 (7/10.000000/SYS) NET/2 (7/10.000000/SYS) NET/3 (0/5000.000000/LOC) NET/4 (6/24.000000/PHB)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO NET/4 :GPU/13000 (7/10.000000/SYS) GPU/19000 (7/10.000000/SYS) GPU/48000 (7/10.000000/SYS) GPU/4D000 (7/10.000000/SYS) GPU/89000 (6/24.000000/PHB) GPU/8E000 (6/24.000000/PHB) GPU/AD000 (4/24.000000/PXB) GPU/B3000 (4/24.000000/PXB) CPU/0 (4/10.000000/SYS) CPU/1 (3/24.000000/PHB) NET/0 (6/10.000000/SYS) NET/1 (7/10.000000/SYS) NET/2 (7/10.000000/SYS) NET/3 (6/24.000000/PHB) NET/4 (0/5000.000000/LOC)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO s->nodes[GPU].nodes[0].gpu.rank: 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO s->nodes[GPU].nodes[1].gpu.rank: 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO s->nodes[GPU].nodes[2].gpu.rank: 2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO s->nodes[GPU].nodes[3].gpu.rank: 3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO s->nodes[GPU].nodes[4].gpu.rank: 4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO s->nodes[GPU].nodes[5].gpu.rank: 5
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO s->nodes[GPU].nodes[6].gpu.rank: 6
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO s->nodes[GPU].nodes[7].gpu.rank: 7
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO comm->compCap: 80, comm->minCompCap: 80, comm->maxCompCap: 80
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO mock_comm[0].compCap: 80, mock_comm[0].minCompCap: 80, mock_comm[0].maxCompCap: 80
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 4, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 0 : NET/1 GPU/0 GPU/7 GPU/6 GPU/5 GPU/4 GPU/3 GPU/1 GPU/2 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 1 : NET/2 GPU/2 GPU/1 GPU/7 GPU/6 GPU/5 GPU/4 GPU/3 GPU/0 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 2 : NET/3 GPU/4 GPU/3 GPU/2 GPU/1 GPU/0 GPU/7 GPU/5 GPU/6 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 3 : NET/4 GPU/6 GPU/5 GPU/2 GPU/1 GPU/0 GPU/7 GPU/3 GPU/4 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 4, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 0 : NET/1 GPU/0 GPU/1 GPU/2 GPU/3 GPU/4 GPU/5 GPU/6 GPU/7 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 1 : NET/2 GPU/2 GPU/3 GPU/4 GPU/5 GPU/6 GPU/7 GPU/0 GPU/1 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 2 : NET/3 GPU/4 GPU/5 GPU/6 GPU/7 GPU/0 GPU/1 GPU/2 GPU/3 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 3 : NET/4 GPU/6 GPU/7 GPU/0 GPU/1 GPU/2 GPU/3 GPU/4 GPU/5 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock ring Graph[node: 0]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 4, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 0 : NET/1 GPU/0 GPU/7 GPU/6 GPU/5 GPU/4 GPU/3 GPU/1 GPU/2 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 1 : NET/2 GPU/2 GPU/1 GPU/7 GPU/6 GPU/5 GPU/4 GPU/3 GPU/0 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 2 : NET/3 GPU/4 GPU/3 GPU/2 GPU/1 GPU/0 GPU/7 GPU/5 GPU/6 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 3 : NET/4 GPU/6 GPU/5 GPU/2 GPU/1 GPU/0 GPU/7 GPU/3 GPU/4 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock tree Graph[node: 0]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 4, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 0 : NET/1 GPU/0 GPU/1 GPU/2 GPU/3 GPU/4 GPU/5 GPU/6 GPU/7 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 1 : NET/2 GPU/2 GPU/3 GPU/4 GPU/5 GPU/6 GPU/7 GPU/0 GPU/1 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 2 : NET/3 GPU/4 GPU/5 GPU/6 GPU/7 GPU/0 GPU/1 GPU/2 GPU/3 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 3 : NET/4 GPU/6 GPU/7 GPU/0 GPU/1 GPU/2 GPU/3 GPU/4 GPU/5 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock ring Graph[node: 1]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 4, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 0 : NET/1 GPU/8 GPU/15 GPU/14 GPU/13 GPU/12 GPU/11 GPU/9 GPU/10 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 1 : NET/2 GPU/10 GPU/9 GPU/15 GPU/14 GPU/13 GPU/12 GPU/11 GPU/8 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 2 : NET/3 GPU/12 GPU/11 GPU/10 GPU/9 GPU/8 GPU/15 GPU/13 GPU/14 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 3 : NET/4 GPU/14 GPU/13 GPU/10 GPU/9 GPU/8 GPU/15 GPU/11 GPU/12 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock tree Graph[node: 1]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 4, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 0 : NET/1 GPU/8 GPU/9 GPU/10 GPU/11 GPU/12 GPU/13 GPU/14 GPU/15 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 1 : NET/2 GPU/10 GPU/11 GPU/12 GPU/13 GPU/14 GPU/15 GPU/8 GPU/9 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 2 : NET/3 GPU/12 GPU/13 GPU/14 GPU/15 GPU/8 GPU/9 GPU/10 GPU/11 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 3 : NET/4 GPU/14 GPU/15 GPU/8 GPU/9 GPU/10 GPU/11 GPU/12 GPU/13 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock ring Graph[node: 2]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 4, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 0 : NET/1 GPU/16 GPU/23 GPU/22 GPU/21 GPU/20 GPU/19 GPU/17 GPU/18 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 1 : NET/2 GPU/18 GPU/17 GPU/23 GPU/22 GPU/21 GPU/20 GPU/19 GPU/16 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 2 : NET/3 GPU/20 GPU/19 GPU/18 GPU/17 GPU/16 GPU/23 GPU/21 GPU/22 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 3 : NET/4 GPU/22 GPU/21 GPU/18 GPU/17 GPU/16 GPU/23 GPU/19 GPU/20 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock tree Graph[node: 2]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 4, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 0 : NET/1 GPU/16 GPU/17 GPU/18 GPU/19 GPU/20 GPU/21 GPU/22 GPU/23 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 1 : NET/2 GPU/18 GPU/19 GPU/20 GPU/21 GPU/22 GPU/23 GPU/16 GPU/17 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 2 : NET/3 GPU/20 GPU/21 GPU/22 GPU/23 GPU/16 GPU/17 GPU/18 GPU/19 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 3 : NET/4 GPU/22 GPU/23 GPU/16 GPU/17 GPU/18 GPU/19 GPU/20 GPU/21 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock ring Graph[node: 3]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 4, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 0 : NET/1 GPU/24 GPU/31 GPU/30 GPU/29 GPU/28 GPU/27 GPU/25 GPU/26 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 1 : NET/2 GPU/26 GPU/25 GPU/31 GPU/30 GPU/29 GPU/28 GPU/27 GPU/24 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 2 : NET/3 GPU/28 GPU/27 GPU/26 GPU/25 GPU/24 GPU/31 GPU/29 GPU/30 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 3 : NET/4 GPU/30 GPU/29 GPU/26 GPU/25 GPU/24 GPU/31 GPU/27 GPU/28 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock tree Graph[node: 3]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 4, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 0 : NET/1 GPU/24 GPU/25 GPU/26 GPU/27 GPU/28 GPU/29 GPU/30 GPU/31 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 1 : NET/2 GPU/26 GPU/27 GPU/28 GPU/29 GPU/30 GPU/31 GPU/24 GPU/25 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 2 : NET/3 GPU/28 GPU/29 GPU/30 GPU/31 GPU/24 GPU/25 GPU/26 GPU/27 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 3 : NET/4 GPU/30 GPU/31 GPU/24 GPU/25 GPU/26 GPU/27 GPU/28 GPU/29 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock ring Graph[node: 4]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 4, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 0 : NET/1 GPU/32 GPU/39 GPU/38 GPU/37 GPU/36 GPU/35 GPU/33 GPU/34 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 1 : NET/2 GPU/34 GPU/33 GPU/39 GPU/38 GPU/37 GPU/36 GPU/35 GPU/32 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 2 : NET/3 GPU/36 GPU/35 GPU/34 GPU/33 GPU/32 GPU/39 GPU/37 GPU/38 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 3 : NET/4 GPU/38 GPU/37 GPU/34 GPU/33 GPU/32 GPU/39 GPU/35 GPU/36 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock tree Graph[node: 4]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 4, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 0 : NET/1 GPU/32 GPU/33 GPU/34 GPU/35 GPU/36 GPU/37 GPU/38 GPU/39 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 1 : NET/2 GPU/34 GPU/35 GPU/36 GPU/37 GPU/38 GPU/39 GPU/32 GPU/33 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 2 : NET/3 GPU/36 GPU/37 GPU/38 GPU/39 GPU/32 GPU/33 GPU/34 GPU/35 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 3 : NET/4 GPU/38 GPU/39 GPU/32 GPU/33 GPU/34 GPU/35 GPU/36 GPU/37 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock ring Graph[node: 5]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 4, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 0 : NET/1 GPU/40 GPU/47 GPU/46 GPU/45 GPU/44 GPU/43 GPU/41 GPU/42 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 1 : NET/2 GPU/42 GPU/41 GPU/47 GPU/46 GPU/45 GPU/44 GPU/43 GPU/40 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 2 : NET/3 GPU/44 GPU/43 GPU/42 GPU/41 GPU/40 GPU/47 GPU/45 GPU/46 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 3 : NET/4 GPU/46 GPU/45 GPU/42 GPU/41 GPU/40 GPU/47 GPU/43 GPU/44 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock tree Graph[node: 5]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 4, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 0 : NET/1 GPU/40 GPU/41 GPU/42 GPU/43 GPU/44 GPU/45 GPU/46 GPU/47 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 1 : NET/2 GPU/42 GPU/43 GPU/44 GPU/45 GPU/46 GPU/47 GPU/40 GPU/41 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 2 : NET/3 GPU/44 GPU/45 GPU/46 GPU/47 GPU/40 GPU/41 GPU/42 GPU/43 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 3 : NET/4 GPU/46 GPU/47 GPU/40 GPU/41 GPU/42 GPU/43 GPU/44 GPU/45 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock ring Graph[node: 6]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 4, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 0 : NET/1 GPU/48 GPU/55 GPU/54 GPU/53 GPU/52 GPU/51 GPU/49 GPU/50 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 1 : NET/2 GPU/50 GPU/49 GPU/55 GPU/54 GPU/53 GPU/52 GPU/51 GPU/48 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 2 : NET/3 GPU/52 GPU/51 GPU/50 GPU/49 GPU/48 GPU/55 GPU/53 GPU/54 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 3 : NET/4 GPU/54 GPU/53 GPU/50 GPU/49 GPU/48 GPU/55 GPU/51 GPU/52 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock tree Graph[node: 6]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 4, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 0 : NET/1 GPU/48 GPU/49 GPU/50 GPU/51 GPU/52 GPU/53 GPU/54 GPU/55 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 1 : NET/2 GPU/50 GPU/51 GPU/52 GPU/53 GPU/54 GPU/55 GPU/48 GPU/49 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 2 : NET/3 GPU/52 GPU/53 GPU/54 GPU/55 GPU/48 GPU/49 GPU/50 GPU/51 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 3 : NET/4 GPU/54 GPU/55 GPU/48 GPU/49 GPU/50 GPU/51 GPU/52 GPU/53 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock ring Graph[node: 7]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 4, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 0 : NET/1 GPU/56 GPU/63 GPU/62 GPU/61 GPU/60 GPU/59 GPU/57 GPU/58 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 1 : NET/2 GPU/58 GPU/57 GPU/63 GPU/62 GPU/61 GPU/60 GPU/59 GPU/56 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 2 : NET/3 GPU/60 GPU/59 GPU/58 GPU/57 GPU/56 GPU/63 GPU/61 GPU/62 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 3 : NET/4 GPU/62 GPU/61 GPU/58 GPU/57 GPU/56 GPU/63 GPU/59 GPU/60 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock tree Graph[node: 7]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 4, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 0 : NET/1 GPU/56 GPU/57 GPU/58 GPU/59 GPU/60 GPU/61 GPU/62 GPU/63 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 1 : NET/2 GPU/58 GPU/59 GPU/60 GPU/61 GPU/62 GPU/63 GPU/56 GPU/57 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 2 : NET/3 GPU/60 GPU/61 GPU/62 GPU/63 GPU/56 GPU/57 GPU/58 GPU/59 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 3 : NET/4 GPU/62 GPU/63 GPU/56 GPU/57 GPU/58 GPU/59 GPU/60 GPU/61 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO comm->nNodes: 8
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO treeGraph.nChannels: 4, ringGraph.nChannels: 4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 00/08 : 0 7 6 5 4 3 1 2 10 9 15 14 13 12 11 8 16 23 22 21 20 19 17 18 26 25 31 30 29 28 27 24 32 39 38 37 36 35 33 34 42 41 47 46 45 44 43 40 48 55 54 53 52 51 49 50 58 57 63 62 61 60 59 56
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 01/08 : 0 8 15 14 13 12 11 9 10 18 17 23 22 21 20 19 16 24 31 30 29 28 27 25 26 34 33 39 38 37 36 35 32 40 47 46 45 44 43 41 42 50 49 55 54 53 52 51 48 56 63 62 61 60 59 57 58 2 1 7 6 5 4 3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 02/08 : 0 7 5 6 14 13 10 9 8 15 11 12 20 19 18 17 16 23 21 22 30 29 26 25 24 31 27 28 36 35 34 33 32 39 37 38 46 45 42 41 40 47 43 44 52 51 50 49 48 55 53 54 62 61 58 57 56 63 59 60 4 3 2 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 03/08 : 0 7 3 4 12 11 10 9 8 15 13 14 22 21 18 17 16 23 19 20 28 27 26 25 24 31 29 30 38 37 34 33 32 39 35 36 44 43 42 41 40 47 45 46 54 53 50 49 48 55 51 52 60 59 58 57 56 63 61 62 6 5 2 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 04/08 : 0 7 6 5 4 3 1 2 10 9 15 14 13 12 11 8 16 23 22 21 20 19 17 18 26 25 31 30 29 28 27 24 32 39 38 37 36 35 33 34 42 41 47 46 45 44 43 40 48 55 54 53 52 51 49 50 58 57 63 62 61 60 59 56
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 05/08 : 0 8 15 14 13 12 11 9 10 18 17 23 22 21 20 19 16 24 31 30 29 28 27 25 26 34 33 39 38 37 36 35 32 40 47 46 45 44 43 41 42 50 49 55 54 53 52 51 48 56 63 62 61 60 59 57 58 2 1 7 6 5 4 3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 06/08 : 0 7 5 6 14 13 10 9 8 15 11 12 20 19 18 17 16 23 21 22 30 29 26 25 24 31 27 28 36 35 34 33 32 39 37 38 46 45 42 41 40 47 43 44 52 51 50 49 48 55 53 54 62 61 58 57 56 63 59 60 4 3 2 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 07/08 : 0 7 3 4 12 11 10 9 8 15 13 14 22 21 18 17 16 23 19 20 28 27 26 25 24 31 29 30 38 37 34 33 32 39 35 36 44 43 42 41 40 47 45 46 54 53 50 49 48 55 51 52 60 59 58 57 56 63 61 62 6 5 2 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO mock_comm->config.maxCTAs: 32, mock_comm->config.minCTAs: 1, mock_comm->sharedRes->tpNChannels: 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 00/08 : 0 7 6 5 4 3 1 2 10 9 15 14 13 12 11 8 16 23 22 21 20 19 17 18 26 25 31 30 29 28 27 24 32 39 38 37 36 35 33 34 42 41 47 46 45 44 43 40 48 55 54 53 52 51 49 50 58 57 63 62 61 60 59 56
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 01/08 : 0 8 15 14 13 12 11 9 10 18 17 23 22 21 20 19 16 24 31 30 29 28 27 25 26 34 33 39 38 37 36 35 32 40 47 46 45 44 43 41 42 50 49 55 54 53 52 51 48 56 63 62 61 60 59 57 58 2 1 7 6 5 4 3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 02/08 : 0 7 5 6 14 13 10 9 8 15 11 12 20 19 18 17 16 23 21 22 30 29 26 25 24 31 27 28 36 35 34 33 32 39 37 38 46 45 42 41 40 47 43 44 52 51 50 49 48 55 53 54 62 61 58 57 56 63 59 60 4 3 2 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 03/08 : 0 7 3 4 12 11 10 9 8 15 13 14 22 21 18 17 16 23 19 20 28 27 26 25 24 31 29 30 38 37 34 33 32 39 35 36 44 43 42 41 40 47 45 46 54 53 50 49 48 55 51 52 60 59 58 57 56 63 61 62 6 5 2 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 04/08 : 0 7 6 5 4 3 1 2 10 9 15 14 13 12 11 8 16 23 22 21 20 19 17 18 26 25 31 30 29 28 27 24 32 39 38 37 36 35 33 34 42 41 47 46 45 44 43 40 48 55 54 53 52 51 49 50 58 57 63 62 61 60 59 56
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 05/08 : 0 8 15 14 13 12 11 9 10 18 17 23 22 21 20 19 16 24 31 30 29 28 27 25 26 34 33 39 38 37 36 35 32 40 47 46 45 44 43 41 42 50 49 55 54 53 52 51 48 56 63 62 61 60 59 57 58 2 1 7 6 5 4 3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 06/08 : 0 7 5 6 14 13 10 9 8 15 11 12 20 19 18 17 16 23 21 22 30 29 26 25 24 31 27 28 36 35 34 33 32 39 37 38 46 45 42 41 40 47 43 44 52 51 50 49 48 55 53 54 62 61 58 57 56 63 59 60 4 3 2 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 07/08 : 0 7 3 4 12 11 10 9 8 15 13 14 22 21 18 17 16 23 19 20 28 27 26 25 24 31 29 30 38 37 34 33 32 39 35 36 44 43 42 41 40 47 45 46 54 53 50 49 48 55 51 52 60 59 58 57 56 63 61 62 6 5 2 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_0 Trees [0] 1/32/-1->0->-1 [1] 1/-1/-1->0->7 [2] 1/-1/-1->0->7 [3] 1/-1/-1->0->7 [4] 1/-1/-1->0->8 [5] 1/-1/-1->0->7 [6] 1/-1/-1->0->7 [7] 1/-1/-1->0->7
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_1 Trees [0] 2/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_2 Trees [0] 3/-1/-1->2->1 [1] 3/34/-1->2->-1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->10 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_3 Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] -1/-1/-1->3->2 [7] 4/-1/-1->3->2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_4 Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/36/-1->4->-1 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->12 [7] 5/-1/-1->4->3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_5 Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] -1/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] -1/-1/-1->5->4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_6 Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/38/-1->6->-1 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->14
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_7 Trees [0] -1/-1/-1->7->6 [1] 0/-1/-1->7->6 [2] 0/-1/-1->7->6 [3] 0/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] 0/-1/-1->7->6 [6] 0/-1/-1->7->6 [7] 0/-1/-1->7->6
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_8 Trees [0] 9/-1/-1->8->17 [1] 9/-1/-1->8->15 [2] 9/-1/-1->8->15 [3] 9/-1/-1->8->15 [4] 9/0/-1->8->24 [5] 9/-1/-1->8->15 [6] 9/-1/-1->8->15 [7] 9/-1/-1->8->15
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_9 Trees [0] 10/-1/-1->9->8 [1] -1/-1/-1->9->8 [2] 10/-1/-1->9->8 [3] 10/-1/-1->9->8 [4] 10/16/-1->9->8 [5] -1/-1/-1->9->8 [6] 10/-1/-1->9->8 [7] 10/-1/-1->9->8
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_10 Trees [0] 11/-1/-1->10->9 [1] 11/-1/-1->10->19 [2] 11/-1/-1->10->9 [3] 11/-1/-1->10->9 [4] 11/-1/-1->10->9 [5] 11/2/-1->10->26 [6] 11/-1/-1->10->9 [7] 11/-1/-1->10->9
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_11 Trees [0] 12/-1/-1->11->10 [1] 12/-1/-1->11->10 [2] -1/-1/-1->11->10 [3] 12/-1/-1->11->10 [4] 12/-1/-1->11->10 [5] 12/18/-1->11->10 [6] -1/-1/-1->11->10 [7] 12/-1/-1->11->10
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_12 Trees [0] 13/-1/-1->12->11 [1] 13/-1/-1->12->11 [2] 13/-1/-1->12->21 [3] 13/-1/-1->12->11 [4] 13/-1/-1->12->11 [5] 13/-1/-1->12->11 [6] 13/4/-1->12->28 [7] 13/-1/-1->12->11
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_13 Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12 [2] 14/-1/-1->13->12 [3] -1/-1/-1->13->12 [4] 14/-1/-1->13->12 [5] 14/-1/-1->13->12 [6] 14/20/-1->13->12 [7] -1/-1/-1->13->12
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_14 Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13 [2] 15/-1/-1->14->13 [3] 15/-1/-1->14->23 [4] 15/-1/-1->14->13 [5] 15/-1/-1->14->13 [6] 15/-1/-1->14->13 [7] 15/6/-1->14->30
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_15 Trees [0] -1/-1/-1->15->14 [1] 8/-1/-1->15->14 [2] 8/-1/-1->15->14 [3] 8/-1/-1->15->14 [4] -1/-1/-1->15->14 [5] 8/-1/-1->15->14 [6] 8/-1/-1->15->14 [7] 8/22/-1->15->14
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_16 Trees [0] 17/24/-1->16->33 [1] 17/-1/-1->16->23 [2] 17/-1/-1->16->23 [3] 17/-1/-1->16->23 [4] 17/-1/-1->16->9 [5] 17/-1/-1->16->23 [6] 17/-1/-1->16->23 [7] 17/-1/-1->16->23
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_17 Trees [0] 18/8/-1->17->16 [1] -1/-1/-1->17->16 [2] 18/-1/-1->17->16 [3] 18/-1/-1->17->16 [4] 18/-1/-1->17->16 [5] -1/-1/-1->17->16 [6] 18/-1/-1->17->16 [7] 18/-1/-1->17->16
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_18 Trees [0] 19/-1/-1->18->17 [1] 19/26/-1->18->35 [2] 19/-1/-1->18->17 [3] 19/-1/-1->18->17 [4] 19/-1/-1->18->17 [5] 19/-1/-1->18->11 [6] 19/-1/-1->18->17 [7] 19/-1/-1->18->17
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_19 Trees [0] 20/-1/-1->19->18 [1] 20/10/-1->19->18 [2] -1/-1/-1->19->18 [3] 20/-1/-1->19->18 [4] 20/-1/-1->19->18 [5] 20/-1/-1->19->18 [6] -1/-1/-1->19->18 [7] 20/-1/-1->19->18
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_20 Trees [0] 21/-1/-1->20->19 [1] 21/-1/-1->20->19 [2] 21/28/-1->20->37 [3] 21/-1/-1->20->19 [4] 21/-1/-1->20->19 [5] 21/-1/-1->20->19 [6] 21/-1/-1->20->13 [7] 21/-1/-1->20->19
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_21 Trees [0] 22/-1/-1->21->20 [1] 22/-1/-1->21->20 [2] 22/12/-1->21->20 [3] -1/-1/-1->21->20 [4] 22/-1/-1->21->20 [5] 22/-1/-1->21->20 [6] 22/-1/-1->21->20 [7] -1/-1/-1->21->20
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_22 Trees [0] 23/-1/-1->22->21 [1] 23/-1/-1->22->21 [2] 23/-1/-1->22->21 [3] 23/30/-1->22->39 [4] 23/-1/-1->22->21 [5] 23/-1/-1->22->21 [6] 23/-1/-1->22->21 [7] 23/-1/-1->22->15
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_23 Trees [0] -1/-1/-1->23->22 [1] 16/-1/-1->23->22 [2] 16/-1/-1->23->22 [3] 16/14/-1->23->22 [4] -1/-1/-1->23->22 [5] 16/-1/-1->23->22 [6] 16/-1/-1->23->22 [7] 16/-1/-1->23->22
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_24 Trees [0] 25/-1/-1->24->16 [1] 25/-1/-1->24->31 [2] 25/-1/-1->24->31 [3] 25/-1/-1->24->31 [4] 25/8/-1->24->56 [5] 25/-1/-1->24->31 [6] 25/-1/-1->24->31 [7] 25/-1/-1->24->31
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_25 Trees [0] 26/-1/-1->25->24 [1] -1/-1/-1->25->24 [2] 26/-1/-1->25->24 [3] 26/-1/-1->25->24 [4] 26/40/-1->25->24 [5] -1/-1/-1->25->24 [6] 26/-1/-1->25->24 [7] 26/-1/-1->25->24
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_26 Trees [0] 27/-1/-1->26->25 [1] 27/-1/-1->26->18 [2] 27/-1/-1->26->25 [3] 27/-1/-1->26->25 [4] 27/-1/-1->26->25 [5] 27/10/-1->26->58 [6] 27/-1/-1->26->25 [7] 27/-1/-1->26->25
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_27 Trees [0] 28/-1/-1->27->26 [1] 28/-1/-1->27->26 [2] -1/-1/-1->27->26 [3] 28/-1/-1->27->26 [4] 28/-1/-1->27->26 [5] 28/42/-1->27->26 [6] -1/-1/-1->27->26 [7] 28/-1/-1->27->26
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_28 Trees [0] 29/-1/-1->28->27 [1] 29/-1/-1->28->27 [2] 29/-1/-1->28->20 [3] 29/-1/-1->28->27 [4] 29/-1/-1->28->27 [5] 29/-1/-1->28->27 [6] 29/12/-1->28->60 [7] 29/-1/-1->28->27
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_29 Trees [0] 30/-1/-1->29->28 [1] 30/-1/-1->29->28 [2] 30/-1/-1->29->28 [3] -1/-1/-1->29->28 [4] 30/-1/-1->29->28 [5] 30/-1/-1->29->28 [6] 30/44/-1->29->28 [7] -1/-1/-1->29->28
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_30 Trees [0] 31/-1/-1->30->29 [1] 31/-1/-1->30->29 [2] 31/-1/-1->30->29 [3] 31/-1/-1->30->22 [4] 31/-1/-1->30->29 [5] 31/-1/-1->30->29 [6] 31/-1/-1->30->29 [7] 31/14/-1->30->62
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_31 Trees [0] -1/-1/-1->31->30 [1] 24/-1/-1->31->30 [2] 24/-1/-1->31->30 [3] 24/-1/-1->31->30 [4] -1/-1/-1->31->30 [5] 24/-1/-1->31->30 [6] 24/-1/-1->31->30 [7] 24/46/-1->31->30
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_32 Trees [0] 33/48/-1->32->0 [1] 33/-1/-1->32->39 [2] 33/-1/-1->32->39 [3] 33/-1/-1->32->39 [4] 33/-1/-1->32->40 [5] 33/-1/-1->32->39 [6] 33/-1/-1->32->39 [7] 33/-1/-1->32->39
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_33 Trees [0] 34/16/-1->33->32 [1] -1/-1/-1->33->32 [2] 34/-1/-1->33->32 [3] 34/-1/-1->33->32 [4] 34/-1/-1->33->32 [5] -1/-1/-1->33->32 [6] 34/-1/-1->33->32 [7] 34/-1/-1->33->32
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_34 Trees [0] 35/-1/-1->34->33 [1] 35/50/-1->34->2 [2] 35/-1/-1->34->33 [3] 35/-1/-1->34->33 [4] 35/-1/-1->34->33 [5] 35/-1/-1->34->42 [6] 35/-1/-1->34->33 [7] 35/-1/-1->34->33
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_35 Trees [0] 36/-1/-1->35->34 [1] 36/18/-1->35->34 [2] -1/-1/-1->35->34 [3] 36/-1/-1->35->34 [4] 36/-1/-1->35->34 [5] 36/-1/-1->35->34 [6] -1/-1/-1->35->34 [7] 36/-1/-1->35->34
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_36 Trees [0] 37/-1/-1->36->35 [1] 37/-1/-1->36->35 [2] 37/52/-1->36->4 [3] 37/-1/-1->36->35 [4] 37/-1/-1->36->35 [5] 37/-1/-1->36->35 [6] 37/-1/-1->36->44 [7] 37/-1/-1->36->35
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_37 Trees [0] 38/-1/-1->37->36 [1] 38/-1/-1->37->36 [2] 38/20/-1->37->36 [3] -1/-1/-1->37->36 [4] 38/-1/-1->37->36 [5] 38/-1/-1->37->36 [6] 38/-1/-1->37->36 [7] -1/-1/-1->37->36
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_38 Trees [0] 39/-1/-1->38->37 [1] 39/-1/-1->38->37 [2] 39/-1/-1->38->37 [3] 39/54/-1->38->6 [4] 39/-1/-1->38->37 [5] 39/-1/-1->38->37 [6] 39/-1/-1->38->37 [7] 39/-1/-1->38->46
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_39 Trees [0] -1/-1/-1->39->38 [1] 32/-1/-1->39->38 [2] 32/-1/-1->39->38 [3] 32/22/-1->39->38 [4] -1/-1/-1->39->38 [5] 32/-1/-1->39->38 [6] 32/-1/-1->39->38 [7] 32/-1/-1->39->38
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_40 Trees [0] 41/-1/-1->40->49 [1] 41/-1/-1->40->47 [2] 41/-1/-1->40->47 [3] 41/-1/-1->40->47 [4] 41/32/-1->40->25 [5] 41/-1/-1->40->47 [6] 41/-1/-1->40->47 [7] 41/-1/-1->40->47
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_41 Trees [0] 42/-1/-1->41->40 [1] -1/-1/-1->41->40 [2] 42/-1/-1->41->40 [3] 42/-1/-1->41->40 [4] 42/48/-1->41->40 [5] -1/-1/-1->41->40 [6] 42/-1/-1->41->40 [7] 42/-1/-1->41->40
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_42 Trees [0] 43/-1/-1->42->41 [1] 43/-1/-1->42->51 [2] 43/-1/-1->42->41 [3] 43/-1/-1->42->41 [4] 43/-1/-1->42->41 [5] 43/34/-1->42->27 [6] 43/-1/-1->42->41 [7] 43/-1/-1->42->41
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_43 Trees [0] 44/-1/-1->43->42 [1] 44/-1/-1->43->42 [2] -1/-1/-1->43->42 [3] 44/-1/-1->43->42 [4] 44/-1/-1->43->42 [5] 44/50/-1->43->42 [6] -1/-1/-1->43->42 [7] 44/-1/-1->43->42
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_44 Trees [0] 45/-1/-1->44->43 [1] 45/-1/-1->44->43 [2] 45/-1/-1->44->53 [3] 45/-1/-1->44->43 [4] 45/-1/-1->44->43 [5] 45/-1/-1->44->43 [6] 45/36/-1->44->29 [7] 45/-1/-1->44->43
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_45 Trees [0] 46/-1/-1->45->44 [1] 46/-1/-1->45->44 [2] 46/-1/-1->45->44 [3] -1/-1/-1->45->44 [4] 46/-1/-1->45->44 [5] 46/-1/-1->45->44 [6] 46/52/-1->45->44 [7] -1/-1/-1->45->44
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_46 Trees [0] 47/-1/-1->46->45 [1] 47/-1/-1->46->45 [2] 47/-1/-1->46->45 [3] 47/-1/-1->46->55 [4] 47/-1/-1->46->45 [5] 47/-1/-1->46->45 [6] 47/-1/-1->46->45 [7] 47/38/-1->46->31
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_47 Trees [0] -1/-1/-1->47->46 [1] 40/-1/-1->47->46 [2] 40/-1/-1->47->46 [3] 40/-1/-1->47->46 [4] -1/-1/-1->47->46 [5] 40/-1/-1->47->46 [6] 40/-1/-1->47->46 [7] 40/54/-1->47->46
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_48 Trees [0] 49/56/-1->48->32 [1] 49/-1/-1->48->55 [2] 49/-1/-1->48->55 [3] 49/-1/-1->48->55 [4] 49/-1/-1->48->41 [5] 49/-1/-1->48->55 [6] 49/-1/-1->48->55 [7] 49/-1/-1->48->55
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_49 Trees [0] 50/40/-1->49->48 [1] -1/-1/-1->49->48 [2] 50/-1/-1->49->48 [3] 50/-1/-1->49->48 [4] 50/-1/-1->49->48 [5] -1/-1/-1->49->48 [6] 50/-1/-1->49->48 [7] 50/-1/-1->49->48
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_50 Trees [0] 51/-1/-1->50->49 [1] 51/58/-1->50->34 [2] 51/-1/-1->50->49 [3] 51/-1/-1->50->49 [4] 51/-1/-1->50->49 [5] 51/-1/-1->50->43 [6] 51/-1/-1->50->49 [7] 51/-1/-1->50->49
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_51 Trees [0] 52/-1/-1->51->50 [1] 52/42/-1->51->50 [2] -1/-1/-1->51->50 [3] 52/-1/-1->51->50 [4] 52/-1/-1->51->50 [5] 52/-1/-1->51->50 [6] -1/-1/-1->51->50 [7] 52/-1/-1->51->50
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_52 Trees [0] 53/-1/-1->52->51 [1] 53/-1/-1->52->51 [2] 53/60/-1->52->36 [3] 53/-1/-1->52->51 [4] 53/-1/-1->52->51 [5] 53/-1/-1->52->51 [6] 53/-1/-1->52->45 [7] 53/-1/-1->52->51
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_53 Trees [0] 54/-1/-1->53->52 [1] 54/-1/-1->53->52 [2] 54/44/-1->53->52 [3] -1/-1/-1->53->52 [4] 54/-1/-1->53->52 [5] 54/-1/-1->53->52 [6] 54/-1/-1->53->52 [7] -1/-1/-1->53->52
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_54 Trees [0] 55/-1/-1->54->53 [1] 55/-1/-1->54->53 [2] 55/-1/-1->54->53 [3] 55/62/-1->54->38 [4] 55/-1/-1->54->53 [5] 55/-1/-1->54->53 [6] 55/-1/-1->54->53 [7] 55/-1/-1->54->47
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_55 Trees [0] -1/-1/-1->55->54 [1] 48/-1/-1->55->54 [2] 48/-1/-1->55->54 [3] 48/46/-1->55->54 [4] -1/-1/-1->55->54 [5] 48/-1/-1->55->54 [6] 48/-1/-1->55->54 [7] 48/-1/-1->55->54
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_56 Trees [0] 57/-1/-1->56->48 [1] 57/-1/-1->56->63 [2] 57/-1/-1->56->63 [3] 57/-1/-1->56->63 [4] 57/24/-1->56->-1 [5] 57/-1/-1->56->63 [6] 57/-1/-1->56->63 [7] 57/-1/-1->56->63
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_57 Trees [0] 58/-1/-1->57->56 [1] -1/-1/-1->57->56 [2] 58/-1/-1->57->56 [3] 58/-1/-1->57->56 [4] 58/-1/-1->57->56 [5] -1/-1/-1->57->56 [6] 58/-1/-1->57->56 [7] 58/-1/-1->57->56
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_58 Trees [0] 59/-1/-1->58->57 [1] 59/-1/-1->58->50 [2] 59/-1/-1->58->57 [3] 59/-1/-1->58->57 [4] 59/-1/-1->58->57 [5] 59/26/-1->58->-1 [6] 59/-1/-1->58->57 [7] 59/-1/-1->58->57
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_59 Trees [0] 60/-1/-1->59->58 [1] 60/-1/-1->59->58 [2] -1/-1/-1->59->58 [3] 60/-1/-1->59->58 [4] 60/-1/-1->59->58 [5] 60/-1/-1->59->58 [6] -1/-1/-1->59->58 [7] 60/-1/-1->59->58
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_60 Trees [0] 61/-1/-1->60->59 [1] 61/-1/-1->60->59 [2] 61/-1/-1->60->52 [3] 61/-1/-1->60->59 [4] 61/-1/-1->60->59 [5] 61/-1/-1->60->59 [6] 61/28/-1->60->-1 [7] 61/-1/-1->60->59
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_61 Trees [0] 62/-1/-1->61->60 [1] 62/-1/-1->61->60 [2] 62/-1/-1->61->60 [3] -1/-1/-1->61->60 [4] 62/-1/-1->61->60 [5] 62/-1/-1->61->60 [6] 62/-1/-1->61->60 [7] -1/-1/-1->61->60
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_62 Trees [0] 63/-1/-1->62->61 [1] 63/-1/-1->62->61 [2] 63/-1/-1->62->61 [3] 63/-1/-1->62->54 [4] 63/-1/-1->62->61 [5] 63/-1/-1->62->61 [6] 63/-1/-1->62->61 [7] 63/30/-1->62->-1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_63 Trees [0] -1/-1/-1->63->62 [1] 56/-1/-1->63->62 [2] 56/-1/-1->63->62 [3] 56/-1/-1->63->62 [4] -1/-1/-1->63->62 [5] 56/-1/-1->63->62 [6] 56/-1/-1->63->62 [7] 56/-1/-1->63->62
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO comm->buffSizes[NCCL_PROTO_LL]: 524288
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO comm->buffSizes[NCCL_PROTO_LL128]: 4915200
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO comm->buffSizes[NCCL_PROTO_SIMPLE]: 4194304
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO comm->p2pChunkSize: 131072
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO comm->p2pnChannels: 8, comm->p2pnChannelsPerPeer: 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 8 coll channels, 0 collnet channels, 0 nvls channels, 8 p2p channels, 1 p2p channels per peer
run.sh: 第 11 行: 2563676 段错误 (核心已转储) LD_LIBRARY_PATH="$SCRIPT_DIR/../build/lib/" NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=MOCK_ROOT,MOCK NCCL_CROSS_NIC=1 NCCL_TOPO_FILE=$SCRIPT_DIR/topo_8A100_4CX6.xml NCCL_NVLS_ENABLE=0 NCCL_NUM_MOCK_GPU=64 NCCL_NUM_MOCK_NODE=8 $1 ./MultiDevOneP $CORE_FILE
If I run the script with another xml file (e.g. topo_4A100_2CX6.xml), there is a different bug:
NCCL version 2.20.5-MockNCCL+cuda12.6
inspur-gpu-server-15:2570998:2571000 [0] misc/cudawrap.cc:34 NCCL WARN Cuda failure 'no CUDA-capable device is detected'
inspur-gpu-server-15:2570998:2571000 [0] init.cc:270 NCCL WARN Cuda failure 3 'initialization error'
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Mocking start: mock_nNodes(8), mock_nRanks(64), mock_nRanksPerNode(8).
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO info_index(0): 0x7f08f0099440, rank: 0, cudaDev: 0, nvmlDev: 0, gdrSupport: 1,hostHash: 10710501475456838185, pidHash: 13668843625570461005, busId: ffffffffffffffff:ff:ff.f, cudaCompCap: -1.
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO comm_index(0): 0x55eaf7ae4c00, node: 0, nNodes: 0, localRank: 0, localRanks: 0, maxLocalRanks: 0, intraRank: 0, intraRanks: 0, ncclCollNet: 0.
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO nNodes: 64.
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO comm_index(0): 0x55eaf7ae4c00, node: 0, nNodes: 0, localRank: 0, localRanks: 0, maxLocalRanks: 0, intraRank: 0, intraRanks: 1, ncclCollNet: 0.
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO After ncclTopoGetSystem, system_info{nodes[NET].count: 3, nodes[GPU].count: 4, maxBW: 0.000000, totalBW: 0.000000}.
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO (*system)->nodes[GPU].nodes[0].id: 0000:13:00.0, gdr = 1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO (*system)->nodes[GPU].nodes[1].id: 0000:19:00.0, gdr = 1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO (*system)->nodes[GPU].nodes[2].id: 0000:48:00.0, gdr = 1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO (*system)->nodes[GPU].nodes[3].id: 0000:4d:00.0, gdr = 1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO After ncclTopoGetSystem, system_info{nodes[NET].count: 3, nodes[GPU].count: 4, maxBW: 0.000000, totalBW: 0.000000}.
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO (*system)->nodes[GPU].nodes[0].id: 0000:13:00.0, gdr = 1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO (*system)->nodes[GPU].nodes[1].id: 0000:19:00.0, gdr = 1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO (*system)->nodes[GPU].nodes[2].id: 0000:48:00.0, gdr = 1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO (*system)->nodes[GPU].nodes[3].id: 0000:4d:00.0, gdr = 1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO After ncclTopoSearchInit, system_info{nodes[NET].count: 3, nodes[GPU].count: 4, maxBW: 24.000000, totalBW: 240.000000}.
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO GPU/13000 :GPU/13000 (0/5000.000000/LOC) GPU/19000 (2/240.000000/NVL) GPU/48000 (2/240.000000/NVL) GPU/4D000 (2/240.000000/NVL) NVS/0 (1/240.000000/NVL) CPU/0 (3/24.000000/PHB) NET/0 (5/24.000000/PHB) NET/1 (4/24.000000/PXB) NET/2 (6/24.000000/PXN)
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO GPU/19000 :GPU/13000 (2/240.000000/NVL) GPU/19000 (0/5000.000000/LOC) GPU/48000 (2/240.000000/NVL) GPU/4D000 (2/240.000000/NVL) NVS/0 (1/240.000000/NVL) CPU/0 (3/24.000000/PHB) NET/0 (5/24.000000/PHB) NET/1 (4/24.000000/PXB) NET/2 (6/24.000000/PXN)
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO GPU/48000 :GPU/13000 (2/240.000000/NVL) GPU/19000 (2/240.000000/NVL) GPU/48000 (0/5000.000000/LOC) GPU/4D000 (2/240.000000/NVL) NVS/0 (1/240.000000/NVL) CPU/0 (3/24.000000/PHB) NET/0 (5/24.000000/PHB) NET/1 (6/24.000000/PXN) NET/2 (4/24.000000/PXB)
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO GPU/4D000 :GPU/13000 (2/240.000000/NVL) GPU/19000 (2/240.000000/NVL) GPU/48000 (2/240.000000/NVL) GPU/4D000 (0/5000.000000/LOC) NVS/0 (1/240.000000/NVL) CPU/0 (3/24.000000/PHB) NET/0 (5/24.000000/PHB) NET/1 (6/24.000000/PXN) NET/2 (4/24.000000/PXB)
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO NET/0 :GPU/13000 (5/24.000000/PHB) GPU/19000 (5/24.000000/PHB) GPU/48000 (5/24.000000/PHB) GPU/4D000 (5/24.000000/PHB) CPU/0 (2/24.000000/PHB) NET/0 (0/5000.000000/LOC) NET/1 (5/24.000000/PHB) NET/2 (5/24.000000/PHB)
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO NET/1 :GPU/13000 (4/24.000000/PXB) GPU/19000 (4/24.000000/PXB) GPU/48000 (6/24.000000/PHB) GPU/4D000 (6/24.000000/PHB) CPU/0 (3/24.000000/PHB) NET/0 (5/24.000000/PHB) NET/1 (0/5000.000000/LOC) NET/2 (6/24.000000/PHB)
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO NET/2 :GPU/13000 (6/24.000000/PHB) GPU/19000 (6/24.000000/PHB) GPU/48000 (4/24.000000/PXB) GPU/4D000 (4/24.000000/PXB) CPU/0 (3/24.000000/PHB) NET/0 (5/24.000000/PHB) NET/1 (6/24.000000/PHB) NET/2 (0/5000.000000/LOC)
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO s->nodes[GPU].nodes[0].gpu.rank: 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO s->nodes[GPU].nodes[1].gpu.rank: 1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO s->nodes[GPU].nodes[2].gpu.rank: 2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO s->nodes[GPU].nodes[3].gpu.rank: 3
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO comm->compCap: 80, comm->minCompCap: 80, comm->maxCompCap: 80
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO mock_comm[0].compCap: 80, mock_comm[0].minCompCap: 80, mock_comm[0].maxCompCap: 80
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 2, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 0 : NET/1 GPU/0 GPU/3 GPU/1 GPU/2 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 1 : NET/2 GPU/2 GPU/1 GPU/3 GPU/0 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 2, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 0 : NET/1 GPU/0 GPU/1 GPU/2 GPU/3 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 1 : NET/2 GPU/2 GPU/3 GPU/0 GPU/1 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock ring Graph[node: 0]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 2, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 0 : NET/1 GPU/0 GPU/3 GPU/1 GPU/2 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 1 : NET/2 GPU/2 GPU/1 GPU/3 GPU/0 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock tree Graph[node: 0]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 2, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 0 : NET/1 GPU/0 GPU/1 GPU/2 GPU/3 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 1 : NET/2 GPU/2 GPU/3 GPU/0 GPU/1 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock ring Graph[node: 1]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 2, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 0 : NET/1 GPU/8 GPU/11 GPU/9 GPU/10 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 1 : NET/2 GPU/10 GPU/9 GPU/11 GPU/8 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock tree Graph[node: 1]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 2, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 0 : NET/1 GPU/8 GPU/9 GPU/10 GPU/11 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 1 : NET/2 GPU/10 GPU/11 GPU/8 GPU/9 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock ring Graph[node: 2]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 2, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 0 : NET/1 GPU/16 GPU/19 GPU/17 GPU/18 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 1 : NET/2 GPU/18 GPU/17 GPU/19 GPU/16 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock tree Graph[node: 2]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 2, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 0 : NET/1 GPU/16 GPU/17 GPU/18 GPU/19 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 1 : NET/2 GPU/18 GPU/19 GPU/16 GPU/17 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock ring Graph[node: 3]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 2, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 0 : NET/1 GPU/24 GPU/27 GPU/25 GPU/26 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 1 : NET/2 GPU/26 GPU/25 GPU/27 GPU/24 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock tree Graph[node: 3]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 2, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 0 : NET/1 GPU/24 GPU/25 GPU/26 GPU/27 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 1 : NET/2 GPU/26 GPU/27 GPU/24 GPU/25 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock ring Graph[node: 4]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 2, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 0 : NET/1 GPU/32 GPU/35 GPU/33 GPU/34 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 1 : NET/2 GPU/34 GPU/33 GPU/35 GPU/32 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock tree Graph[node: 4]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 2, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 0 : NET/1 GPU/32 GPU/33 GPU/34 GPU/35 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 1 : NET/2 GPU/34 GPU/35 GPU/32 GPU/33 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock ring Graph[node: 5]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 2, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 0 : NET/1 GPU/40 GPU/43 GPU/41 GPU/42 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 1 : NET/2 GPU/42 GPU/41 GPU/43 GPU/40 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock tree Graph[node: 5]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 2, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 0 : NET/1 GPU/40 GPU/41 GPU/42 GPU/43 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 1 : NET/2 GPU/42 GPU/43 GPU/40 GPU/41 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock ring Graph[node: 6]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 2, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 0 : NET/1 GPU/48 GPU/51 GPU/49 GPU/50 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 1 : NET/2 GPU/50 GPU/49 GPU/51 GPU/48 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock tree Graph[node: 6]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 2, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 0 : NET/1 GPU/48 GPU/49 GPU/50 GPU/51 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 1 : NET/2 GPU/50 GPU/51 GPU/48 GPU/49 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock ring Graph[node: 7]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 2, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 0 : NET/1 GPU/56 GPU/59 GPU/57 GPU/58 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 1 : NET/2 GPU/58 GPU/57 GPU/59 GPU/56 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock tree Graph[node: 7]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 2, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 0 : NET/1 GPU/56 GPU/57 GPU/58 GPU/59 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO 1 : NET/2 GPU/58 GPU/59 GPU/56 GPU/57 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO comm->nNodes: 8
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO treeGraph.nChannels: 2, ringGraph.nChannels: 2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Channel 00/04 : 0 3 1 2 10 9 11 8 16 19 17 18 26 25 27 24 32 35 33 34 42 41 43 40 48 51 49 50 58 57 59 56 0 3 1 2 10 9 11 8 16 19 17 18 26 25 27 24 32 35 33 34 42 41 43 40 48 51 49 50 58 57 59 56
inspur-gpu-server-15:2570998:2571000 [0] graph/rings.cc:58 NCCL WARN Error : ring 0 does not contain rank 4
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO graph/connect.cc:507 -> 3
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO init.cc:1492 -> 3
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO init.cc:1913 -> 3
inspur-gpu-server-15:2570998:2570998 [0] NCCL INFO group.cc:418 -> 3
inspur-gpu-server-15:2570998:2570998 [0] NCCL INFO group.cc:95 -> 3
inspur-gpu-server-15:2570998:2570998 [0] NCCL INFO init.cc:2254 -> 3
Failed, NCCL error MultiDevOneP.cc:134 'internal error - please report this issue to the NCCL developers'
Besides, if run in the VM, there will be another error:
NCCL version 2.20.5-MockNCCL+cuda12.9
localhost:799:801 [0] NCCL INFO Mocking start: mock_nNodes(1), mock_nRanks(1), mock_nRanksPerNode(1).
localhost:799:801 [0] NCCL INFO info_index(0): 0x7fa4fc0038d0, rank: 0, cudaDev: 0, nvmlDev: 0, gdrSupport: 1,hostHash: 8982114052393301127, pidHash: 7559370754094648493, busId: ffffffffffffffff:ff:ff.f, cudaCompCap: -1.
run.sh: line 11: 799 Bus error LD_LIBRARY_PATH="$SCRIPT_DIR/../build/lib/" NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=MOCK_ROOT,MOCK NCCL_CROSS_NIC=1 NCCL_TOPO_FILE=$SCRIPT_DIR/topo_8A100_4CX6.xml NCCL_NVLS_ENABLE=0 NCCL_NUM_MOCK_GPU=1 NCCL_NUM_MOCK_NODE=1 $1 ./MultiDevOneP $CORE_FILE
Expected Behavior
Should run successfully
Actual Behavior
Error
Environment
CUDA: 12.6, installed in /usr/local/cuda
OS: 6.2.0-39-generic #40~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 16 10:53:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Metadata
Metadata
Assignees
Labels
No labels