Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEST: add error checking and exit in gtest #1083

Merged
merged 7 commits into from
Feb 28, 2025

Conversation

samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented Feb 26, 2025

When a gtest unit is skipped or failed, using gtest macros like GTEST_ASSERT_EQ, or GTEST_SKIP, the program only exits the current function, but not the whole test. Therefore, the error/skip must be propagated through all frames of the unit test. Bottom line, each time we call a function that is susceptible to call one of this macro, we need to check on return the test status and return if relevant.

Related issue:

@samnordmann
Copy link
Collaborator Author

samnordmann commented Feb 27, 2025

Conclusion: there was a bad handling of error when there is no port available in the device. With this patch, we can see that all gtest are successfully skipped, with the message

../../../test/gtest/tl/mlx5/test_tl_mlx5.cc:43: SkippedSkipped
no ib devices

For the record, the stack detecting that there is no port open on the ib_device is the following:

#0  ucc_tl_mlx5_check_port_active (ctx=0x555556c181c0, port_num=1) at ../../../../../src/components/tl/mlx5/tl_mlx5_ib.c:100
#1  0x00007ffff6cf46a2 in ucc_tl_mlx5_get_active_port (ctx=0x555556c181c0) at ../../../../../src/components/tl/mlx5/tl_mlx5_ib.c:110
#2  0x00007ffff6cf430a in create_ctx_for_dev (ib_dev=0x555556c13c00, ib_ctx=0x555556c0cfa0) at ../../../../../src/components/tl/mlx5/tl_mlx5_ib.c:32
#3  0x00007ffff6cf4407 in ucc_tl_mlx5_create_ibv_ctx (ib_devname=0x7fffffffd7d8, ctx=0x555556c0cfa0, lib=0x555556c0cf28) at ../../../../../src/components/tl/mlx5/tl_mlx5_ib.c:59
#4  0x00005555561095a9 in test_tl_mlx5::SetUp (this=0x555556c0cf10) at ../../../test/gtest/tl/mlx5/test_tl_mlx5.cc:41
#5  0x000055555610b9c4 in test_tl_mlx5_qp::SetUp (this=0x555556c0cf10) at ../../../test/gtest/tl/mlx5/test_tl_mlx5_qps.h:14
#6  0x000055555610baba in test_tl_mlx5_rc_qp::SetUp (this=0x555556c0cf10) at ../../../test/gtest/tl/mlx5/test_tl_mlx5_qps.h:55

@samnordmann
Copy link
Collaborator Author

samnordmann commented Feb 28, 2025

The CI is now segfault-free, so this patch should unblock the other PRs

One error remains:

12:22:23  [ RUN      ] test_tl_mlx5_rdma_write.RdmaWriteWqe/0
12:22:23  /opt/nvidia/src/ucc/test/gtest/tl/mlx5/test_tl_mlx5_qps.h:90: Failure
12:22:23  Expected equality of these values:
12:22:23    create_rc_umr_qp(ctx, pd, cq, port, &umr_qp.qp, &umr_qp_conf, &lib)
12:22:23      Which is: -6
12:22:23    UCC_OK
12:22:23      Which is: 0
12:22:23  [  FAILED  ] test_tl_mlx5_rdma_write.RdmaWriteWqe/0, where GetParam() = 1 (7 ms)

which is probably due to a legit error in the node. If this error persist we could decide to skip and not assert in case of error status

@Sergei-Lebedev Sergei-Lebedev merged commit 8892c73 into openucx:master Feb 28, 2025
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants