You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: affinity/cpp-20/d0796r1.md
+15-16Lines changed: 15 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,28 +37,27 @@ The affinity problem is especially challenging for applications whose behavior c
37
37
38
38
Frequently, data is initialized at the beginning of the program by the initial thread and is used by multiple threads. While automatic thread migration has been implemented in some OSes, migration may have high overhead. In an optimal case, the OS may automatically detect which thread access which data most frequently, or it may replicate data which is read by multiple threads, or migrate data which is modified and used by threads residing on remote locality groups. However, the OS often does a reasonable job, if the machine is not overloaded, if the application carefully used first-touch allocation, and if the program does not change its behavior with respect to locality.
39
39
40
-
Consider a code example using the C++STL container `valarray` and the latest C++17 parallel STL algorithm `for_each`. The example applies a loop body in a lambda to container entry in the iterator range `[begin, end)`, using a parallel execution policy such that the workload is distributed in parallel across multiple cores on the CPU. We might expect the work to be fast, but since `valarray` containers are initialized automatically and automatically allocated on the master thread’s memory, we find that it is actually quite slow even when we have more than one thread.
40
+
Consider a code example (Listing 1) that uses the C++17 parallel STL algorithm `for_each` to modify the entries of a `valarray``a`. The example applies a loop body in a lambda to each entry of the `valarray``a`, using a parallel execution policy that distributes work in parallel across multiple CPU cores. We might expect this to be fast, but since `valarray` containers are initialized automatically and automatically allocated on the master thread's memory, we find that it is actually quite slow even when we have more than one thread.
41
41
42
42
```cpp
43
-
// C++ valarray STL containers are initialized
44
-
//automatically and allocated on the master's memory
45
-
valarray<double> a(N), b(N), c(N);
46
-
//saxpying is slow
47
-
//Parallel foreach
43
+
// C++ valarray STL containers are initialized automatically.
44
+
//First-touch allocation thus places all of a on the master.
45
+
std::valarray<double> a(N);
46
+
47
+
// Data placement is wrong, so parallel update is slow.
48
48
std::for_each(par, std::begin(a), std::end(a),
49
-
[=](double b, double c){b[i]+scalar*c[i]});
50
-
// if we can migrate data at next usage and move pages close to next accessing thread
51
-
//using the affinity interface in future
49
+
[=] (double& a_i) { scalar * a_i; });
50
+
51
+
// Use future affinity interface to migrate data at next
52
+
// use and move pages closer to next accessing thread.
52
53
...
53
-
//now faster, because data is local now
54
+
// Faster, because data are local now.
54
55
std::for_each(par, std::begin(a), std::end(a),
55
-
[=](double b, double c){b[i]+scalar*c[i]});
56
+
[=](double& a) { scalar * a_i; } );
56
57
```
57
-
*Listing 1: Motivational example*
58
-
59
-
Now with the affinity interface we propose below and in future, we will hopefully find that there is significant increase in memory bandwidth when we have multiple threads.
58
+
*Listing 1: Parallel vector update example*
60
59
61
-
The goal was that this would enable scaling up for heterogeneous and distributed computing in future. Indeed OpenMP [14] where one of the author participated in the design of its affinity model, has plans to integrate its affinity model with its heterogeneous model.[21]
60
+
The affinity interface we propose should help computers achieve a much higher fraction of peak memory bandwidth when using parallel algorithms. In the future, we plan to extend this to heterogeneous and distributed computing. This follows the lead of OpenMP [14], which has plans to integrate its affinity model with its heterogeneous model[21]. (One of the authors of this document participated in the design of OpenMP's affinity model.)
0 commit comments