Skip to content

Conversation

@Flova
Copy link
Contributor

@Flova Flova commented Dec 10, 2025

As described in ros2/rclpy#1563 rclpy currently contains a memory issue. This leads to undefined behavior, which is especially noticable in the robostack builds. I would suggest that we add this as a patch now, as it is currently broken and upstream backporting etc. can take a while.

Do I need to bump the build number or something like that?

@traversaro
Copy link
Member

Do I need to bump the build number or something like that?

Yes, you need to add:

rclpy:
  build_number: 13

at the end of https://github.com/Flova/ros-jazzy/blob/fix/events_executor/pkg_additional_info.yaml#L159 .

Can you also provide a reproducer or some more hints on the problem that this fixes?

@traversaro
Copy link
Member

Another problem to understand is if this changes the ABI of the rclpy package (if there is any ABI reused by downstream libraries). If that is the case, to introduce the change without a full rebuild we would need to change it to be ABI-compatible.

@Flova
Copy link
Contributor Author

Flova commented Dec 10, 2025

Another problem to understand is if this changes the ABI of the rclpy package (if there is any ABI reused by downstream libraries). If that is the case, to introduce the change without a full rebuild we would need to change it to be ABI-compatible.

As far as I know the changed ABI is exclusivly used inside the rclpy package. All other packages use it indirectly via a Python API that is expose using pybind11. So it should still be ABI compatible to all other packages.

@Flova
Copy link
Contributor Author

Flova commented Dec 10, 2025

We could also do a "hack" and set the default timepoint using a magic value that is smaller than chrono max(). That would keep the existing API. But I would argue against this and would just use the upstream PR as a patch.

@Flova
Copy link
Contributor Author

Flova commented Dec 10, 2025

Do I need to bump the build number or something like that?

Yes, you need to add:

rclpy:
  build_number: 13

Okay. I will do that 👍

@Flova
Copy link
Contributor Author

Flova commented Dec 10, 2025

Can you also provide a reproducer or some more hints on the problem that this fixes?

The default executor (which is responsible for scheduling callbacks etc.) is famously quite slow. Therefore, recently a new implementation based on an events queue was added. This implementation is significantly more efficient and similar to the events executor in rclcpp.

When trying out Pixi and Robostack with our software, which uses the events executor, I noticed some deadlock issues (or so I thought) with it. I first thought it might be an incompatible dependency, but after a lot of investigation using gdb and valgrind it turned out to be a "memory" issue that was present in the implementation all along, leading to undefined behavior. In the past this was not noticed and the undefined behavior didn't cause any problems, but due to the sligthly different compilation setup used by robostack a more violent failure occurs more frequently.

The issue is as follows: The events queue dispatches new events from its queue one after the other. If there are no new events it waits using a conditional variable. This wait has an optional timeout. If no timeout is given, the latest time in the epoch is used, which too far in the future for the underlying syscall, leading to the crash.

Solutions:

  • Reduce the deadline time from max() to a smaller value
  • Use the appropriate syscall for cv waits without a deadline (this PR)

@Flova
Copy link
Contributor Author

Flova commented Dec 10, 2025

The following example does not print the string msgs from /chatter. After the fix it does.

To publish msgs to the topic use: ros2 run demo_nodes_cpp talker

from rclpy.experimental.events_executor import EventsExecutor
from rclpy.node import Node
import rclpy
from std_msgs.msg import String

rclpy.init()
a = Node("test")
def t(msg):
    print(msg)
a.create_subscription(String, "chatter", t, 1)
ex = EventsExecutor()
ex.add_node(a)
ex.spin()

@Flova Flova force-pushed the fix/events_executor branch from 1f0646f to 5523559 Compare December 10, 2025 13:14
@Flova
Copy link
Contributor Author

Flova commented Dec 10, 2025

The page is a bit buggy for me right now, but the failure of the macos action seems to be unrelated right?

The macOS-13 based runner images are now retired

@traversaro
Copy link
Member

traversaro commented Dec 10, 2025

The page is a bit buggy for me right now, but the failure of the macos action seems to be unrelated right?

The macOS-13 based runner images are now retired

Yes, changing to macos-15-intel as done in RoboStack/ros-humble@fc5cbdb should fix it.

@Flova
Copy link
Contributor Author

Flova commented Dec 10, 2025

Yes, changing to macos-15-intel as done in RoboStack/ros-humble@fc5cbdb should fix it.

Done

@traversaro
Copy link
Member

traversaro commented Dec 10, 2025

If no timeout is given, the latest time in the epoch is used, which too far in the future for the underlying syscall, leading to the crash.

Thanks, this is clear. So this is basically a mitigation for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58931. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113327 and other related compiler/libstdc++ issues?

@Flova
Copy link
Contributor Author

Flova commented Dec 10, 2025

Thanks, this is clear. So this is basically a mitigation for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58931. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113327 and other related compiler/libstdc++ issues?

Exactly. pthread_cond_timedwait is the underlying syscall that fails in that case on POSIX systems. But this is the case, because libstdc++ allows you to pass too large time points to it.

@Flova
Copy link
Contributor Author

Flova commented Dec 10, 2025

Everything seems to pass 🎉

@Flova
Copy link
Contributor Author

Flova commented Dec 10, 2025

Btw. this will be relevant for kilted too. Should I open another PR there?

@traversaro
Copy link
Member

Btw. this will be relevant for kilted too. Should I open another PR there?

Sure! I still want to double check the ABI thing, but if you need it in kilted feel free to open a PR.

@Flova
Copy link
Contributor Author

Flova commented Dec 10, 2025

A quick github search shows that the events queue header is only referenced by rclpy itself:

https://grep.app/search?q=events_queue.hpp

The other hits are all for the rclcpp implementation, which is a different thing. Imo. it also doesn't really make sense to use this API at a different place. Or did I missunderstand your concerns and they are regarding the internals of the rclpy build?

@traversaro
Copy link
Member

A quick github search shows that the events queue header is only referenced by rclpy itself:

https://grep.app/search?q=events_queue.hpp

The other hits are all for the rclcpp implementation, which is a different thing. Imo. it also doesn't really make sense to use this API at a different place. Or did I missunderstand your concerns and they are regarding the internals of the rclpy build?

That sounds reasonable, I just want to double check in detail as ABI issues are quite nasty,

Flova added a commit to Flova/ros-kilted that referenced this pull request Dec 10, 2025
@traversaro
Copy link
Member

The modified header is private and not installed, so we should be safe to merge. I will merge soon unless @Tobias-Fischer or @wolfv object.

@Tobias-Fischer Tobias-Fischer merged commit 8555471 into RoboStack:main Dec 11, 2025
5 checks passed
@Flova Flova deleted the fix/events_executor branch December 11, 2025 14:09
@Flova
Copy link
Contributor Author

Flova commented Dec 11, 2025

Thanks for the quick merge!

@Flova
Copy link
Contributor Author

Flova commented Dec 11, 2025

Quick question: When I install rclpy as a standalone package I get the new version. But the ros-jazzy-desktop package uses the ros mutex 0.11, so it is not installed. Is this intended behavior?

@traversaro
Copy link
Member

Quick question: When I install rclpy as a standalone package I get the new version. But the ros-jazzy-desktop package uses the ros mutex 0.11, so it is not installed. Is this intended behavior?

I am afraid this is due to #125 . We are working on a new full rebuild in #135 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants