You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[ Upstream commit 4dab26b ]
In workloads where there are many processes establishing connections using
RDMA CM in parallel (large scale MPI), there can be heavy contention for
mad_agent_lock in cm_alloc_msg.
This contention can occur while inside of a spin_lock_irq region, leading
to interrupts being disabled for extended durations on many
cores. Furthermore, it leads to the serialization of rdma_create_ah calls,
which has negative performance impacts for NICs which are capable of
processing multiple address handle creations in parallel.
The end result is the machine becoming unresponsive, hung task warnings,
netdev TX timeouts, etc.
Since the lock appears to be only for protection from cm_remove_one, it
can be changed to a rwlock to resolve these issues.
Reproducer:
Server:
for i in $(seq 1 512); do
ucmatose -c 32 -p $((i + 5000)) &
done
Client:
for i in $(seq 1 512); do
ucmatose -c 32 -p $((i + 5000)) -s 10.2.0.52 &
done
Fixes: 76039ac ("IB/cm: Protect cm_dev, cm_ports and mad_agent with kref and lock")
Link: https://patch.msgid.link/r/[email protected]
Signed-off-by: Jacob Moroni <[email protected]>
Acked-by: Eric Dumazet <[email protected]>
Reviewed-by: Zhu Yanjun <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
0 commit comments