-
Notifications
You must be signed in to change notification settings - Fork 14.1k
Optimized implementation for uN::{gather,scatter}_bits #149663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
9549004 to
2c752af
Compare
This comment has been minimized.
This comment has been minimized.
2c752af to
ac7e6c6
Compare
|
Do take this with a grain of salt, since I authored the benchmarks with this in mind. In particular, since the new implementation doesn't have any input-dependent control-flow, it is easily vectorized which all of these benchmarks allow for. Benchmarked locally on an
|
|
rustbot has assigned @Mark-Simulacrum. Use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work.
I noticed spot-checking some random constant masks that this implementation and the HD implementation flip-flop on which has more instructions, but it's only a small difference count. On simpler masks, they seem to optimize similarly.
The dynamic mask output is a drastic improvement with this by a quarter/third reduction of instructions for gather and scatter.
alive2 showing that this implementation and the current implementation appear to be equivalent functions
gather - https://alive2.llvm.org/ce/z/wKxY2Z
scatter - https://alive2.llvm.org/ce/z/_98DbY
scratchpad for the LLVM IR output - https://rust.godbolt.org/z/brbePx44T
| //! ........ABCDEFGH | ||
| //! ``` | ||
| macro_rules! uint_impl { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maintainability/readability: There are quite a few single character variable names in these functions: m, n, k, j, q, x.
If possible, it might be good to give the variables/counters a slightly more descriptive name or leave a comment above them for their function if it doesn't reduce readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'm thinking at least
k -> stage
m -> mask
n is so heavily referenced in the comments that it probably needs to stay a single letter, but could be w for width.
Feature gate: #![feature(uint_gather_scatter_bits)]
Tracking issue: #149069
Accepted ACP: rust-lang/libs-team#695
Implements the methods using the parallel suffix strategy mentioned in the ACP discussion. The referenced source material provides C implementations, though this PR makes improvements over those, cutting the instruction count by a third:
https://rust.godbolt.org/z/rn5naYnK4 (this PR)
https://c.godbolt.org/z/WzYd5WbsY (Hacker's delight)
This was initially based on the code for
gather_bitsthat @okaneco provided in rust-lang/libs-team#695 (comment) . I wanted to understand how it worked, and later on noticed some opportunities for improvement, which eventually led to this PR.