Merge branch 'master' into d0796r1-document-changes

Gordon Brown · Gordon Brown · commit d34ec281e256 · 2018-02-07T13:14:38.000Z
diff --git a/affinity/cpp-20/d0796r1.md b/affinity/cpp-20/d0796r1.md
@@ -6,7 +6,7 @@
 
 **Authors: Gordon Brown, Ruyman Reyes, Michael Wong, H. Carter Edwards, Thomas Rodgers**
 
-**Contributors: Patrice Roy, Jeff Hammond**
+**Contributors: Patrice Roy, Jeff Hammond, Mark Hoemmen**
 
 **Emails: gordon@codeplay.com, ruyman@codeplay.com, michael@codeplay.com, hcedwar@sandia.gov, rodgert@twrodgers.com**
 
@@ -21,28 +21,23 @@
 
 # Abstract
 
-This paper provides an initial meta-framework for the drives toward memory affinity for C++, given the direction from Toronto 2017 SG1 meeting that we should look towards defining affinity for C++ before looking at inaccessible memory as a solution to the separate memory problem towards supporting heterogeneous and distributed computing.
+This paper provides an initial meta-framework for the drives toward memory affinity for C++.  It accounts for feedback from the Toronto 2017 SG1 meeting that we should define affinity for C++ first, before considering inaccessible memory as a solution to the separate memory problem towards supporting heterogeneous and distributed computing.
 
 # Motivation
 
-Processor and memory binding, also called 'affinity', can help the performance of an application for many reasons. Keeping a process bound to a specific thread and local memory region optimizes cache affinity and reduces context switching and unnecessary scheduler activity. Since memory accesses to remote locations incur higher latency and lower bandwidth, control of thread placement to enforce affinity within parallel applications is crucial to fuel all the cores and to exploit the full performance of the memory subsystem on Non-Uniform Memory Architectures (NUMA).
+**Affinity** refers to the "closeness" in terms of memory access performance, between running code, the hardware execution resource on which the code runs, and the data that the code accesses.  A hardware execution resource has "more affinity" to a part of memory or to some data, if it has lower latency and/or higher bandwidth when accessing that memory / those data.
 
-Traditional homogeneous designs where memory is accessible at the same cost from all threads are difficult to scale up to the current computing needs. Current architectural trends move towards Non-Uniform Memory Access (NUMA) architectures where, although there is a coherent view of the memory, the cost to access it is not uniform. Memory affinity is especially useful in these systems. Using memory that is located on the same node as the processing unit helps to ensure that the application can access the data as quickly as possible.
+On almost all computer architectures, the cost of accessing different data may differ. Most computers have caches that are associated with specific processing units. If the operating system moves a thread or process from one processing unit to another, the thread or process will no longer have data in its new cache that it had in its old cache. This may make the next access to those data slower. Many computers also have a Non-Uniform Memory Architecture (NUMA), which means that even though all processing units see a single memory in terms of programming model, different processing units may still have more affinity to some parts of memory than others. NUMA architectures exist because it is difficult to scale non-NUMA memory systems to the performance needed by today's highly parallel computers and applications.
 
-In terms of traditional operating system behaviour, all processing elements of a CPU are threads, and they are placed using high-level policies that do not necessarily match the optimal usage pattern for a given application.
+One strategy to improve applications' performance, given the importance of affinity, is processor and memory **binding**. Keeping a process bound to a specific thread and local memory region optimizes cache affinity. It also reduces context switching and unnecessary scheduler activity. Since memory accesses to remote locations incur higher latency and/or lower bandwidth, control of thread placement to enforce affinity within parallel applications is crucial to fuel all the cores and to exploit the full performance of the memory subsystem on Non-Uniform Memory Architectures (NUMA). 
 
-However, application developers must leverage the placement of memory and **placement of threads** in order to obtain maximum performance on current and future architecture.
-For C++ developers to achieve this, native support for **placement of threads and memory** is critical for application portability. We will refer to this as the **affinity problem**.
+Operating systems (OSes) traditionally take responsibility for assigning threads or processes to run on processing units. However, OSes may use high-level policies for this assignment that do not necessarily match the optimal usage pattern for a given application. Application developers must leverage the placement of memory and **placement of threads** for best performance on current and future architectures. For C++ developers to achieve this, native support for **placement of threads and memory** is critical for application portability. We will refer to this as the **affinity problem**. 
 
-**Affinity** is defined as maintaining or improving the locality of threads and the most frequently used data, especially if the program behaviour is unpredictable or changes over time, or the machine is overloaded such that multiple programs interfere with each other.
+The affinity problem is especially challenging for applications whose behavior changes over time or is hard to predict, or when different applications interfere with each other's performance. Today, most OSes already can group processing units according to their locality and distribute processes, while keeping threads close to the initial thread, or even avoid migrating threads and maintain first touch policy. Nevertheless, most programs can change their work distribution, especially in the presence of nested parallelism.
 
-Today, most OSes already can group processors according to their locality and distribute processes, while keeping threads close to the initial thread, or even avoid migrating threads and maintain first touch policy. But the fact is most programs can change their work distribution, especially in the presence of nested parallelism.
+Frequently, data is initialized at the beginning of the program by the initial thread and is used by multiple threads. While automatic thread migration has been implemented in some OSes, migration may have high overhead. In an optimal case, the OS may automatically detect which thread access which data most frequently, or it may replicate data which is read by multiple threads, or migrate data which is modified and used by threads residing on remote locality groups. However, the OS often does a reasonable job, if the machine is not overloaded, if the application carefully used first-touch allocation, and if the program does not change its behavior with respect to locality.
 
-Frequently, data is initialized at the beginning of the program by the initial thread and is used by multiple threads. While automatic thread migration has been implemented in some OSes, the reality is that this has migration can cause high overhead. In an optimal case the operating system may automatically detect which thread access which data most frequently, or it may replicate data which is read by multiple threads, or migrate data which is modified and used by threads residing on remote locality groups.
-
-The fact of it is that the OS may do a reasonable job, if the machine is not overloaded, and the first touch policy has been carefully used, and the program does not change its behaviour with respect to locality.  
-
-Imagine we have a code example using C++ STL container valarray using the latest C++17 parallel STL algorithm for_each, which applies the lambda to elements in the iterator range [begin, end) but using a parallel execution policy such that the workload is distributed in parallel across multiple cores on the CPU. We might expect the work to be fast, but because the containers of valarray are initialized automatically and automatically allocated on the master thread’s memory, we find that it is actually quite slow even when we have more than one thread. 
+Consider a code example using the C++ STL container `valarray` and the latest C++17 parallel STL algorithm `for_each`. The example applies a loop body in a lambda to container entry in the iterator range `[begin, end)`, using a parallel execution policy such that the workload is distributed in parallel across multiple cores on the CPU. We might expect the work to be fast, but since `valarray` containers are initialized automatically and automatically allocated on the master thread’s memory, we find that it is actually quite slow even when we have more than one thread. 
 
 ```cpp
 // C++ valarray STL containers are initialized
@@ -410,21 +405,27 @@ for (int i = 0; i < resource.partition_size(); i++) {
 | Should the interface provide a way of creating an execution context from an execution resource? |
 | *Is what is defined here a suitable solution?* |
 
-## Importance of topology discovery
+### Topology Discovery & Fault Tolerance
+
+In traditional single CPU systems the execution resources can be reasoned about using standard constructs such as `std::thread`, `std::this_thread` and `thread_local`. This is because the C++ machine model requires that a system have **at least one thread of execution, some memory and some I/O capabilities**. This means that for these systems some assumptions can be made about the system resource topology can be made as part of the language and supporting std library. For example the fact that developers can query always the hardware concurrency available as there is always at least one thread or the fact that you can always use thread local storage.
+
+This assumption, however, does not hold on newer more complex systems, and is particularly false in heterogeneous systems. In these systems, even the availabiliy of high level resources available in a particular **system** (the type and number of resources) is not known until the physical hardware attached to a particular system has been identified by the program. This often happens as part of a runtime initialisation API [19] [20] which the resources available through som software abstraction. Furthermore the resources which are identified often have different levels of parallel and concurrenct execution capabilities. This process of identifying resources and their capabilities is often refered to as **topology discovery** and the point at the point at which this occurs as the **point of discovery**.
 
-For traditional single CPU systems the execution resources reasoned about using standard constructs such as std::thread, std::this_thread and thread local storage. This is because the C++ memory model requires that  a system have **at least one thread of execution, some memory and some I/O capabilities**. This means that for these systems some assumptions can be made about the topology could be made during at compile-time, for example the fact that developers can query always the hardware concurrency available as there is always at least 1 thread or the fact that you can always use thread local storage.
+An interesting question which arises here is whether the **system resource topology** should be fixed at the **point of discovery** or be allowed to be dynamic and alter during the course of the program. We can identify two main reasons for allowing the **system resource topology** to be dynamic after the *point of discovery*: (A) **online resource discovery** and **fault tolerance**.
 
-This assumption, however, does not hold on newer more complex systems, and is particularly false in heterogeneous systems. In these systems, the even the available high level resources such as the number and type of devices available in a particular **system** is not known until the **system’s resource topology** has been discovered which often happens as part of a runtime API [19] [20]. Furthermore the level of support these for querying the resource topology these devices may vary. This means the previous assumption that you can query thread concurrency at any stage of the program or the availability of a **std::thread** with local storage is no longer valid: Different devices may have different capabilities.
+In some systems, hardware can be attached to the system while the program is executing, for example, a [USB-compute device][movidius] that can be plugged in while the application is running to add additional computational power, or a remote hardware connected over a network that can be enabled over specific periods of time. The ability of supporting **online resource discovery** allows programs to directly target these situations natively and be reactive to changes to the resources available to a system.
 
-An interesting question which arises here is whether the system topology of an execution resource should be fixed on initialisation or allowed to be dynamic. Allowing a dynamic system topology allows components to go offline and become unavailable at runtime. If we do allow the system topology to be dynamic then we will need to provide a mechanism by which users can be notified of a topology change. However, providing this interface is out of the scope of this initial document.
+Other applications, such as those designed for safety critical enviroments, require the ability to recover from hardware failures. This requires that the resources available within a system can be queried and can be expected to change at any point during the execution of a program. For example GPU may encounter encounter exceptional behaviour or overheat and need to be disabled, yet the program must continue at all costs. **Fault tolerance** allows programs to query the availability of resources and handle failures, which could facilitate reliable programming of heterogeneous and distributed systems.
 
-Note that this is different from devices that go online or offline during execution: The devices themselves are online, they have not been found (or used) by the program until the appropriate discovery stage has been executed.
+From a historic perspective, many different programming models have tackled the problem of **dynamic resource discovery** following various approaches. [MPI (Message Passing Interface)][mpi] originally (in MPI-1) did not support **dynamic resource discovery**. All processes which were capable of communicating with each other would be identified and fixed during at the **point of discovery**. [PVM (Parallel Virtual Machine)][pvm] enabled resources to be discovered at runtime since its conception, using an alternative execution model of manually spawning processes from the main process. This lead MPI to introduce the feature it in later MPI-2. However as far as we know, despite being available this feature is not widely used in HPC environments and the execution model of having all processes fixed on initialisation is generally still the prefered approach. Other programming models for HPC environments support a fixed set of processors on initialization library time, such as SHMEM, Fortran coarrays and UPC++.
+
+Some of these programming models also address **fault tolerance**, in particular, PVM has native support for this, providing a [mechanism][pvm-callback] which can notify a program when a resource is added or removed from a system . MPI whilst it does not have native support for a PVM-like **fault tolerance** mechanism can be [implemented on top of MPI][mpi-post-failure-recovery] or provided via [extensions][mpi-fault-tolerance].
+
+Due to the complexity involved in standardising **dynamic resource discovery** and **fault tolerance** these are outside currently out of the scope of this paper.
 
 | Straw Poll |
 |------------|
-| Should the interface allow a system’s resource topology to be updated dynamically after initial initialisation? |
-| *When do we enable the device discovery process? Can we change the system topology after executors have been created?* |
-| *Should be provide an interface for providing a call-back on topology change?* |
+| Should the interface support **dynamic resource discovery**? |
 
 ## Lifetime considerations
 
@@ -565,7 +566,6 @@ Euro-Par 2011 Parallel Processing: 17th International
 [22] Portable Hardware Locality Istopo
 https://www.open-mpi.org/projects/hwloc/lstopo/
 
-
 [//]: Links
 
 [hwloc]: https://www.open-mpi.org/projects/hwloc/
@@ -586,4 +586,10 @@ https://www.open-mpi.org/projects/hwloc/lstopo/
 [tbb]: https://www.threadingbuildingblocks.org/
 [hpx]: https://github.com/STEllAR-GROUP/hpx
 [madness]: https://github.com/m-a-d-n-e-s-s/madness
-[maddness-journal]: http://dx.doi.org/10.1137/15M1026171
+[maddness-journal]: http://dx.doi.org/10.1137/15M1026171
+[pvm]: http://www.csm.ornl.gov/pvm/
+[pvm-callback]: http://etutorials.org/Linux+systems/cluster+computing+with+linux/Part+II+Parallel+Programming/Chapter+11+Fault-Tolerant+and+Adaptive+Programs+with+PVM/11.2+Building+Fault-Tolerant+Parallel+Applications/
+[mpi]: http://mpi-forum.org/docs/
+[mpi-fault-tolerance]: http://www.mcs.anl.gov/~lusk/papers/fault-tolerance.pdf
+[mpi-post-failure-recovery]: http://journals.sagepub.com/doi/10.1177/1094342013488238
+[movidius]: https://developer.movidius.com/