Skip to content

Commit 24299a4

Browse files
mhoemmenRuyman
authored andcommitted
d0796r1: Fix Listing 1 to have correct syntax
Listing 1 in d0796r1 had incorrect syntax. Fix the example so readers won't object. Reword surrounding text for correctness and clarity.
1 parent 79ebbf0 commit 24299a4

File tree

1 file changed

+15
-16
lines changed

1 file changed

+15
-16
lines changed

affinity/cpp-20/d0796r1.md

Lines changed: 15 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -37,28 +37,27 @@ The affinity problem is especially challenging for applications whose behavior c
3737

3838
Frequently, data is initialized at the beginning of the program by the initial thread and is used by multiple threads. While automatic thread migration has been implemented in some OSes, migration may have high overhead. In an optimal case, the OS may automatically detect which thread access which data most frequently, or it may replicate data which is read by multiple threads, or migrate data which is modified and used by threads residing on remote locality groups. However, the OS often does a reasonable job, if the machine is not overloaded, if the application carefully used first-touch allocation, and if the program does not change its behavior with respect to locality.
3939

40-
Consider a code example using the C++ STL container `valarray` and the latest C++17 parallel STL algorithm `for_each`. The example applies a loop body in a lambda to container entry in the iterator range `[begin, end)`, using a parallel execution policy such that the workload is distributed in parallel across multiple cores on the CPU. We might expect the work to be fast, but since `valarray` containers are initialized automatically and automatically allocated on the master threads memory, we find that it is actually quite slow even when we have more than one thread.
40+
Consider a code example (Listing 1) that uses the C++17 parallel STL algorithm `for_each` to modify the entries of a `valarray` `a`. The example applies a loop body in a lambda to each entry of the `valarray` `a`, using a parallel execution policy that distributes work in parallel across multiple CPU cores. We might expect this to be fast, but since `valarray` containers are initialized automatically and automatically allocated on the master thread's memory, we find that it is actually quite slow even when we have more than one thread.
4141

4242
```cpp
43-
// C++ valarray STL containers are initialized
44-
// automatically and allocated on the master's memory
45-
valarray<double> a(N), b(N), c(N);
46-
//saxpying is slow
47-
//Parallel foreach
43+
// C++ valarray STL containers are initialized automatically.
44+
// First-touch allocation thus places all of a on the master.
45+
std::valarray<double> a(N);
46+
47+
// Data placement is wrong, so parallel update is slow.
4848
std::for_each(par, std::begin(a), std::end(a),
49-
[=](double b, double c){b[i]+scalar*c[i]});
50-
// if we can migrate data at next usage and move pages close to next accessing thread
51-
//using the affinity interface in future
49+
[=] (double& a_i) { scalar * a_i; });
50+
51+
// Use future affinity interface to migrate data at next
52+
// use and move pages closer to next accessing thread.
5253
...
53-
//now faster, because data is local now
54+
// Faster, because data are local now.
5455
std::for_each(par, std::begin(a), std::end(a),
55-
[=](double b, double c){b[i]+scalar*c[i]});
56+
[=] (double& a) { scalar * a_i; } );
5657
```
57-
*Listing 1: Motivational example*
58-
59-
Now with the affinity interface we propose below and in future, we will hopefully find that there is significant increase in memory bandwidth when we have multiple threads.
58+
*Listing 1: Parallel vector update example*
6059
61-
The goal was that this would enable scaling up for heterogeneous and distributed computing in future. Indeed OpenMP [14] where one of the author participated in the design of its affinity model, has plans to integrate its affinity model with its heterogeneous model.[21]
60+
The affinity interface we propose should help computers achieve a much higher fraction of peak memory bandwidth when using parallel algorithms. In the future, we plan to extend this to heterogeneous and distributed computing. This follows the lead of OpenMP [14], which has plans to integrate its affinity model with its heterogeneous model [21]. (One of the authors of this document participated in the design of OpenMP's affinity model.)
6261
6362
# Background Research: State of the Art
6463
@@ -592,4 +591,4 @@ https://www.open-mpi.org/projects/hwloc/lstopo/
592591
[mpi]: http://mpi-forum.org/docs/
593592
[mpi-fault-tolerance]: http://www.mcs.anl.gov/~lusk/papers/fault-tolerance.pdf
594593
[mpi-post-failure-recovery]: http://journals.sagepub.com/doi/10.1177/1094342013488238
595-
[movidius]: https://developer.movidius.com/
594+
[movidius]: https://developer.movidius.com/

0 commit comments

Comments
 (0)