Skip to content

Conversation

@BramOtte
Copy link
Contributor

@BramOtte BramOtte commented Dec 6, 2025

This put the forwards links at the end of Node, when the forward links don't fit into the Node extra nodes will simply be inserted which will be skipped over when iterating.
This results in not having to go to a separate area in memory when looking up the forward links and not even having to load in an extra cache-line when there are less than 6 forward links.
This does have the disadvantage of making iterating a bit more complicated.
Additionally this requires unsafe and is a bit error prone, I tried to minimize potential error by minimizing magic numbers and putting in asserts.

Benchmarks:
rtps for tickn(50_000) with 1000 samples for iris mandlebrot benchmark (ran twice in a row):
From this we can see about a 5.6% speedup.

master (https://github.com/MCHPR/MCHPRS/tree/a179ab1be44dd44fa3f20ba67195a7f147ab06f2)

avg= 3259104 dev=   31748 q0= 3100066 q1= 3241976 q2= 3246659 q3= 3291493 q4= 3302341
avg= 3252497 dev=   26361 q0= 3098181 q1= 3228507 q2= 3248133 q3= 3278679 q4= 3304033

old (https://github.com/BramOtte/MCHPRS/tree/refactor-forward-links-unsafe):
(this seems to be within margin of error from master)

avg= 3274858 dev=   29929 q0= 2998197 q1= 3274732 q2= 3281709 q3= 3284870 q4= 3352242
avg= 3280807 dev=   27363 q0= 3156344 q1= 3279137 q2= 3283034 q3= 3286403 q4= 3349798

This PR:

avg= 3459374 dev=   15655 q0= 3297415 q1= 3455197 q2= 3459280 q3= 3463341 q4= 3483251
avg= 3465641 dev=   20777 q0= 3198885 q1= 3457385 q2= 3473685 q3= 3478063 q4= 3482424

@BramOtte BramOtte force-pushed the interleaved-forward-links branch from 500b45a to 37f7db9 Compare December 7, 2025 10:45
assert!(offset_of!(Node, fwd_links) + size_of::<LinkBuffer>() == size_of::<Node>())
}

(fwd_link_len + BLOCK_SIZE - 1 - LINKS_IN_NODE) / BLOCK_SIZE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(fwd_link_len + BLOCK_SIZE - 1 - LINKS_IN_NODE) / BLOCK_SIZE
(fwd_link_len - LINKS_IN_NODE).div_ceil(BLOCK_SIZE)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that result in undefined behaviour when LINKS_IN_NODE is larger than fwd_link_len?
I guess combined with saturating sub it would work out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed that. Yeah, adding saturating_sub will work. The branch is fine as this section isn't performance-critical.

Copy link
Contributor

@KonaeAkira KonaeAkira Dec 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must point out that integer overflow/underflow is not UB in Rust. Even in C and C++ only signed integer overflow is UB. Rust integer overflow is well-defined and will just wrap around. The whole design goal of Rust is to make it impossible to have UB unless you're using unsafe.

Copy link
Contributor Author

@BramOtte BramOtte Dec 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm ok interesting, but overflow is still not desired here and like you said saturating_sub is fine because its not performance critical so I used that approach.
Even if without it it will likely work out since it will likely result in basically the same code I had before but someone reading the code can't really see that since they just see the div and not whats inside.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For performance critical code we should however use the code I had before even without the saturating_sub the div_ceil does infact not result in the same code.
https://godbolt.org/z/he4GE693z

@BramOtte
Copy link
Contributor Author

BramOtte commented Dec 8, 2025

Doing some more benchmarks this change seems to pretty significantly decrease the performance of flush from ~2.5ms to ~4ms.
So this should probably only get merged if this gets merged too: #206 as this should nullify this cost

@KonaeAkira
Copy link
Contributor

Doing some more benchmarks this change seems to pretty significantly decrease the performance of flush from ~2.5ms to ~4ms.

Would storing the valid node indices instead of computing them on the fly on every flush solve this regression?

@BramOtte
Copy link
Contributor Author

BramOtte commented Dec 8, 2025

Doing some more benchmarks this change seems to pretty significantly decrease the performance of flush from ~2.5ms to ~4ms.

Would storing the valid node indices instead of computing them on the fly on every flush solve this regression?

Doing this reduces the overhead and it now takes ~3ms instead which is a more acceptable amount.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants