Skip to content

Conversation

@mini-1235
Copy link
Contributor

@mini-1235 mini-1235 commented Sep 8, 2025


Basic Info

Info Please fill out this column
Ticket(s) this addresses closes #5509
Primary OS tested on Ubuntu
Robotic platform tested on
Does this PR contain AI generated software? No
Was this PR description generated by AI software? Out of respect for maintainers, AI for human-to-human communications are banned

Description of contribution in a few bullet points

  • Add zenoh to ci matrix
  • Fix tests to pass with zenoh

Description of documentation updates required from your changes

Description of how this change was tested


Future work that may be required in bullet points

For Maintainers:

  • Check that any new parameters added are updated in docs.nav2.org
  • Check that any significant change is added to the migration guide
  • Check that any new features OR changes to existing behaviors are reflected in the tuning guide
  • Check that any new functions have Doxygen added
  • Check that any new features have test coverage
  • Check that any new plugins is added to the plugins page
  • If BT Node, Additionally: add to BT's XML index of nodes for groot, BT package's readme table, and BT library lists
  • Should this be backported to current distributions? If so, tag with backport-*.

@mini-1235
Copy link
Contributor Author

Two of the tests are still failing, this PR is not ready for review

Copy link
Member

@SteveMacenski SteveMacenski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave the default as is before merging, but I understand it’s useful to test in CI for now.

Also, why did the docking test need to be updated? Did you find root cause or is that a workaround for zenoh to work? If a workaround, that might indicate a problem in the RMW if there’s not a logical error in the test.

@mini-1235
Copy link
Contributor Author

mini-1235 commented Sep 11, 2025

Please leave the default as is before merging, but I understand it’s useful to test in CI for now.

Will do after all tests pass

Also, why did the docking test need to be updated? Did you find root cause or is that a workaround for zenoh to work? If a workaround, that might indicate a problem in the RMW if there’s not a logical error in the test.

I will explain the changes I made to the test here:

  • test_distance_controller: originally reporting Input t_sec is too large or too small for tf2::Duration, the cause is due to missing declare_parameter for transform tolerance

  • test_astar: thread panicked, related to https://github.com/ros2/rmw_zenoh?tab=readme-ov-file#crash-when-program-terminates

  • test_controller: flaky, it fails when the subscriber does not receive the latest costmap message published here

collision_tester->publishCostmap();

I can actually reproduce this as well in cyclonedds when running 1000 times repeatedly

We can actually decrease initial_transform_timeout / increase queries_default_timeout and overwrite the default zenoh config, but I am not sure if these are the best solutions

@SteveMacenski
Copy link
Member

Ok sounds good to me. I might prefer spinning for those 10ms rather than having a background spin & 10ms sleeps, but I generally don’t nitpick over tests. Are you working on the last failure in CI? The rest LGTM

@mini-1235
Copy link
Contributor Author

Are you working on the last failure in CI?

Yes, it looks like that we need to overwrite the zenoh config at least for the CI ros2/rmw_zenoh#783 (comment), what do you think about that?

@SteveMacenski
Copy link
Member

It seems to me like something that should be addressed in the RMW / Zenoh layers. If this is a quirk of ROS 2, then either that should change, RMW Zenoh should adjust, or the default configurations should handle it.

@mini-1235
Copy link
Contributor Author

It seems to me like something that should be addressed in the RMW / Zenoh layers. If this is a quirk of ROS 2, then either that should change, RMW Zenoh should adjust, or the default configurations should handle it.

I share the same opinion. Admittedly, I am not a zenoh expert, so I think I might need to leave this PR open until we can find a solution together with the zenoh maintainers

@SteveMacenski
Copy link
Member

OK - did you open a ticket on this with the RMW? Might be good and then link this PR

@mini-1235
Copy link
Contributor Author

OK - did you open a ticket on this with the RMW? Might be good and then link this PR

Good idea, I have open a new issue in the rmw repo

@mergify
Copy link
Contributor

mergify bot commented Sep 18, 2025

This pull request is in conflict. Could you fix it @mini-1235?

Signed-off-by: mini-1235 <[email protected]>
Signed-off-by: mini-1235 <[email protected]>
Signed-off-by: mini-1235 <[email protected]>
@mini-1235 mini-1235 force-pushed the feat/zenoh branch 3 times, most recently from f60a807 to 723ac34 Compare September 27, 2025 07:49
@mini-1235
Copy link
Contributor Author

@SteveMacenski, I have just rebased this, now waiting for next rmw zenoh's release to pass all tests

@SteveMacenski
Copy link
Member

OK! Ping me when we can take a look again!

@mini-1235
Copy link
Contributor Author

I might prefer spinning for those 10ms rather than having a background spin & 10ms sleeps

I tried using executor.spin_all(std::chrono::milliseconds(100));, but the tests are still flaky. Running the executor in background and sleeps resolve the issue across all RMW implementations.

[amcl-6] *** stack smashing detected *: terminated
[amcl-6] Stack trace (most recent call last):
[amcl-6] #26 Object "", at 0xffffffffffffffff, in
[amcl-6] #25 Object "/opt/overlay_ws/build/nav2_amcl/amcl", at 0x55c088dcfea4, in _start
[amcl-6] #24 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f6e4859d28a, in _libc_start_main
[amcl-6] #23 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f6e4859d1c9, in
[amcl-6] #22 Object "/opt/overlay_ws/build/nav2_amcl/amcl", at 0x55c088dd118d, in main
[amcl-6] #21 Object "/opt/ros/rolling/lib/librclcpp.so", at 0x7f6e48b5ebb8, in rclcpp::spin(std::shared_ptrrclcpp::node_interfaces::NodeBaseInterface)
[amcl-6] #20 Object "/opt/ros/rolling/lib/librclcpp.so", at 0x7f6e48b65ae3, in rclcpp::executors::SingleThreadedExecutor::spin()
[amcl-6] #19 Object "/opt/ros/rolling/lib/librclcpp.so", at 0x7f6e48b54c19, in rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&)
[amcl-6] #18 Object "/opt/ros/rolling/lib/librclcpp.so", at 0x7f6e48b50483, in rclcpp::Executor::execute_service(std::shared_ptrrclcpp::ServiceBase)
[amcl-6] #17 Object "/opt/ros/rolling/lib/librclcpp.so", at 0x7f6e48c23f39, in
[amcl-6] #16 Object "/opt/ros/rolling/lib/librclcpp_lifecycle.so", at 0x7f6e48cc8d34, in
[amcl-6] #15 Object "/opt/ros/rolling/lib/librclcpp_lifecycle.so", at 0x7f6e48cc28ab, in std::Function_handler<void (std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request<std::allocator > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response
<std::allocator > >), std::_Bind<void (rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl::
(rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl
, std::Placeholder<1>, std::Placeholder<2>, std::Placeholder<3>))(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request<std::allocator > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response<std::allocator > >)> >::M_invoke(std::Any_data const&, std::shared_ptr<rmw_request_id_s>&&, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request<std::allocator > >&&, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response<std::allocator > >&&)
[amcl-6] #14 Object "/opt/ros/rolling/lib/librclcpp_lifecycle.so", at 0x7f6e48cc1914, in rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl::on_change_state(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request
<std::allocator > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator > >)
[amcl-6] #13 Object "/opt/ros/rolling/lib/librclcpp_lifecycle.so", at 0x7f6e48cc0d39, in rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl::change_state(unsigned char, rclcpp_lifecycle::node_interfaces::LifecycleNodeInterface::CallbackReturn&)
[amcl-6] #12 Object "/opt/ros/rolling/lib/librclcpp_lifecycle.so", at 0x7f6e48cc09fe, in rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl::execute_callback(unsigned int, rclcpp_lifecycle::State const&) const
[amcl-6] #11 Object "/opt/overlay_ws/install/nav2_amcl/lib/libamcl_core.so", at 0x7f6e48ea3ba4, in nav2_amcl::AmclNode::on_configure(rclcpp_lifecycle::State const&)
[amcl-6] #10 Object "/opt/overlay_ws/install/nav2_amcl/lib/libamcl_core.so", at 0x7f6e48e85b29, in nav2_amcl::AmclNode::initParticleFilter()
[amcl-6] #9 Object "/opt/overlay_ws/install/nav2_amcl/lib/libpf_lib.so", at 0x7f6e4811b7de, in pf_init
[amcl-6] #8 Object "/opt/overlay_ws/install/nav2_amcl/lib/libpf_lib.so", at 0x7f6e4811c98c, in pf_pdf_gaussian_alloc
[amcl-6] #7 Object "/opt/overlay_ws/install/nav2_amcl/lib/libpf_lib.so", at 0x7f6e4811cf7b, in pf_matrix_unitary
[amcl-6] #6 Object "/opt/overlay_ws/install/nav2_amcl/lib/libpf_lib.so", at 0x7f6e4811e243, in eigen_decomposition
[amcl-6] #5 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f6e486aaed3, in __stack_chk_fail
[amcl-6] #4 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f6e486a9c48, in __fortify_fail
[amcl-6] #3 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f6e4859c7b5, in
[amcl-6] #2 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f6e4859b8fe, in abort
[amcl-6] #1 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f6e485b827d, in raise
[amcl-6] #0 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f6e48611b2c, in pthread_kill
[amcl-6] Aborted (Signal sent by tkill() 2450 0)

New test failing, and this is exactly the same error reported in #5470, I will try to reproduce it locally

@mini-1235
Copy link
Contributor Author

I cannot reproduce this locally, but I think I understand the cause.

When AMCL is on configure stage, it calls initParticleFilter

initParticleFilter();

In the function, it uses the variable init_pose_ and init_cov_

pf_init_pose_mean.v[0] = init_pose_[0];
pf_init_pose_mean.v[1] = init_pose_[1];
pf_init_pose_mean.v[2] = init_pose_[2];
pf_matrix_t pf_init_pose_cov = pf_matrix_zero();
pf_init_pose_cov.m[0][0] = init_cov_[0];
pf_init_pose_cov.m[1][1] = init_cov_[1];
pf_init_pose_cov.m[2][2] = init_cov_[2];

However, this two variables are initialized later when calling initOdometry

That's probably why we are seeing NaNs here #5470 (comment)

I think we need to move initOdometry before initParticleFilter

Signed-off-by: mini-1235 <[email protected]>
@codecov
Copy link

codecov bot commented Oct 23, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines Coverage Δ
nav2_amcl/src/amcl_node.cpp 86.31% <100.00%> (ø)

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mini-1235
Copy link
Contributor Author

mini-1235 commented Oct 23, 2025

@SteveMacenski All tests passing now! If you agree to all my changes here, I can switch back the default to rmw_cyclone_dds

@mini-1235 mini-1235 changed the title Add rmw zenoh cpp to ci Add rmw zenoh to nightly ci builds Oct 23, 2025
rclcpp::CallbackGroupType::MutuallyExclusive, false);
initParameters();
initTransforms();
initOdometry();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be after init pubsub and init services in order to be able to have last_published_pose_ be usable

Copy link
Contributor Author

@mini-1235 mini-1235 Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the ros(1) code, https://github.com/ros-planning/navigation/blob/f44bb1fc2810399165115cc98b530fe4b9397c18/amcl/src/amcl_node.cpp#L793-L833, I think the init_pose_ is either [0,0,0] or [initial_pose_x,y,yaw] when set_initial_pose_ is enabled

However, I cannot find the parameters initial_cov_xx, initial_cov_yy in our code. Maybe it is removed at some point in Nav2?

So, I think

if (set_initial_pose_) {
auto msg = std::make_shared<geometry_msgs::msg::PoseWithCovarianceStamped>();
msg->header.stamp = now();
msg->header.frame_id = global_frame_id_;
msg->pose.pose.position.x = initial_pose_x_;
msg->pose.pose.position.y = initial_pose_y_;
msg->pose.pose.position.z = initial_pose_z_;
msg->pose.pose.orientation = orientationAroundZAxis(initial_pose_yaw_);
initialPoseReceived(msg);
} else if (init_pose_received_on_inactive) {
handleInitialPose(last_published_pose_);
}

and
initOdometry should be called before initParticleFilter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with the amcl codebase, so I may have missed something. A simpler option would be initializing init_pose_ to [0,0,0] in the constructor so that there will be no garbage value when configuring amcl

# Start zenohd daemon only if using rmw_zenoh_cpp
if [ "$RMW_IMPLEMENTATION" = "rmw_zenoh_cpp" ]; then
. /opt/ros/$ROS_DISTRO/setup.sh
ros2 run rmw_zenoh_cpp rmw_zenohd &
Copy link
Member

@SteveMacenski SteveMacenski Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is doing this in the CI config the prevailing thought from the threads you brought up for how to handle this? I was kind of hoping for ros2 launch to deal with this so it was in the application code -- i.e. StartZenohRouter() method that we can do conditionally when the RMW implementation is set to zenoh.

... Actually, can we do that instead? Create in nav2_common that ros2 launch action and use the condition for checking that environnmental variable?

Copy link
Contributor Author

@mini-1235 mini-1235 Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is doing this in the CI config the prevailing thought from the threads you brought up for how to handle this?

They provide CMake functions like ament_add_ros_isolated_xxxx to help when testing using RMW Zenoh

What I was thinking is that my solution would be more straightforward, especially since manually starting the Zenoh router will not be required in the future according to their README.

If you prefer to go with ament_add_ros_isolated I can add it :)

Signed-off-by: mini-1235 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add rmw zenoh cpp to CI

2 participants