Skip to content

Commit efd7240

Browse files
committed
Made ScanPrefixes the default accumulate algorithm
1 parent ae73e6d commit efd7240

File tree

2 files changed

+14
-26
lines changed

2 files changed

+14
-26
lines changed

src/accumulate/accumulate.jl

Lines changed: 12 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ include("accumulate_nd.jl")
4040
min_elems::Int=2,
4141
4242
# Algorithm choice
43-
alg::AccumulateAlgorithm=DecoupledLookback(),
43+
alg::AccumulateAlgorithm=ScanPrefixes(),
4444
4545
# GPU settings
4646
block_size::Int=256,
@@ -60,7 +60,7 @@ include("accumulate_nd.jl")
6060
min_elems::Int=2,
6161
6262
# Algorithm choice
63-
alg::AccumulateAlgorithm=DecoupledLookback(),
63+
alg::AccumulateAlgorithm=ScanPrefixes(),
6464
6565
# GPU settings
6666
block_size::Int=256,
@@ -89,13 +89,13 @@ becomes faster if it is a more compute-heavy operation to hide memory latency -
8989
9090
## GPU
9191
For the 1D case (`dims=nothing`), the `alg` can be one of the following:
92-
- `DecoupledLookback()`: the default algorithm, using opportunistic lookback to reuse earlier
93-
blocks' results; requires device-level memory consistency guarantees, which Apple Metal does not
94-
provide.
95-
- `ScanPrefixes()`: a simpler algorithm that scans the prefixes of each block, with no lookback; it
96-
has similar performance as `DecoupledLookback()` for large block sizes, and small to medium arrays,
92+
- `ScanPrefixes()`: the default algorithm that scans the prefixes of each block, with no lookback; it
93+
has better performance than `DecoupledLookback()` for large block sizes, and small to medium arrays,
9794
but poorer scaling for many blocks; there is no performance degradation below `block_size^2`
98-
elements.
95+
elements, but it remains fast well into millions of elements.
96+
- `DecoupledLookback()`: a more complex algorithm using opportunistic lookback to reuse earlier
97+
blocks' results; requires device-level memory consistency guarantees (which Apple Metal does not
98+
provide) and atomic orderings; theoretically more scalable for many blocks.
9999
100100
A different, unique algorithm is used for the multi-dimensional case (`dims` is an integer).
101101
@@ -105,13 +105,7 @@ The temporaries are only used for the 1D case (`dims=nothing`): `temp` stores pe
105105
`temp_flags` is only used for the `DecoupledLookback()` algorithm for flagging if blocks are ready;
106106
they should both have at least `(length(v) + 2 * block_size - 1) ÷ (2 * block_size)` elements; also,
107107
`eltype(v) === eltype(temp)` is required; the elements in `temp_flags` can be any integers, but
108-
`Int8` is used by default to reduce memory usage.
109-
110-
# Platform-Specific Notes
111-
On Metal, the `alg=ScanPrefixes()` algorithm is used by default, as Apple Metal GPUs do not have
112-
strong enough memory consistency guarantees for the `DecoupledLookback()` algorithm - which
113-
produces incorrect results about 0.38% of the time (the beauty of parallel algorithms, ey). Also,
114-
`block_size=1024` is used here by default to reduce the number of coupled lookbacks.
108+
`UInt8` is used by default to reduce memory usage.
115109
116110
# Examples
117111
Example computing an inclusive prefix sum (the typical GPU "scan"):
@@ -123,7 +117,7 @@ v = oneAPI.ones(Int32, 100_000)
123117
AK.accumulate!(+, v, init=0)
124118
125119
# Use a different algorithm
126-
AK.accumulate!(+, v, alg=AK.ScanPrefixes())
120+
AK.accumulate!(+, v, alg=AK.DecoupledLookback())
127121
```
128122
"""
129123
function accumulate!(
@@ -160,7 +154,7 @@ function _accumulate_impl!(
160154
dims::Union{Nothing, Int}=nothing,
161155
inclusive::Bool=true,
162156

163-
alg::AccumulateAlgorithm=DecoupledLookback(),
157+
alg::AccumulateAlgorithm=ScanPrefixes(),
164158

165159
# CPU settings
166160
max_tasks::Int=Threads.nthreads(),
@@ -212,7 +206,7 @@ end
212206
min_elems::Int=2,
213207
214208
# Algorithm choice
215-
alg::AccumulateAlgorithm=DecoupledLookback(),
209+
alg::AccumulateAlgorithm=ScanPrefixes(),
216210
217211
# GPU settings
218212
block_size::Int=256,

src/arithmetics.jl

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -307,7 +307,7 @@ end
307307
dims::Union{Nothing, Int}=nothing,
308308
309309
# Algorithm choice
310-
alg::AccumulateAlgorithm=DecoupledLookback(),
310+
alg::AccumulateAlgorithm=ScanPrefixes(),
311311
312312
# GPU settings
313313
block_size::Int=256,
@@ -318,9 +318,6 @@ end
318318
Cumulative sum of elements of an array, with optional `init` and `dims`. Arguments are the same as
319319
for [`accumulate`](@ref).
320320
321-
## Platform-Specific Notes
322-
On Apple Metal, the `alg=ScanPrefixes()` algorithm is used by default.
323-
324321
# Examples
325322
Simple cumulative sum of elements in a vector:
326323
```julia
@@ -360,7 +357,7 @@ end
360357
dims::Union{Nothing, Int}=nothing,
361358
362359
# Algorithm choice
363-
alg::AccumulateAlgorithm=DecoupledLookback(),
360+
alg::AccumulateAlgorithm=ScanPrefixes(),
364361
365362
# GPU settings
366363
block_size::Int=256,
@@ -371,9 +368,6 @@ end
371368
Cumulative product of elements of an array, with optional `init` and `dims`. Arguments are the same
372369
as for [`accumulate`](@ref).
373370
374-
## Platform-Specific Notes
375-
On Apple Metal, the `alg=ScanPrefixes()` algorithm is used by default.
376-
377371
# Examples
378372
Simple cumulative product of elements in a vector:
379373
```julia

0 commit comments

Comments
 (0)