Skip to content

Commit cb9ba67

Browse files
GordonGordon
authored andcommitted
CP013: Add additional changes missing from previous commit.
1 parent 1df2134 commit cb9ba67

File tree

1 file changed

+3
-2
lines changed

1 file changed

+3
-2
lines changed

affinity/cpp-20/d0796r1.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -257,9 +257,9 @@ In some systems, hardware can be attached to the system while the program is exe
257257

258258
Other applications, such as those designed for safety critical enviroments, require the ability to recover from hardware failures. This requires that the resources available within a system can be queried and can be expected to change at any point during the execution of a program. For example GPU may encounter encounter exceptional behaviour or overheat and need to be disabled, yet the program must continue at all costs. **Fault tolerance** allows programs to query the availability of resources and handle failures, which could facilitate reliable programming of heterogeneous and distributed systems.
259259

260-
From a historic perspective, many different programming models have tackled the problem of **dynamic resource discovery** following various approaches. [MPI (Message Passing Interface)][mpi] originally (in MPI-1) did not support it, all processes which were capable of communicating with each other would be identified and fixed during at the **point of discovery**. [PVM (Parallel Virtual Machine)][pvm] enabled resources to be discovered at runtime since its conception, using an alternative execution model of manually spawning processes from the main process. This lead MPI to introduce the feature it in later MPI-2. However as far as we know, despite being available this feature is not widely used in HPC environments and the execution model of having all processes fixed on initialisation is generally still the prefered approach. Other programming models for HPC environments support a fixed set of processors on initialization library time, such as SHMEM, Fortran coarrays and UPC++.
260+
From a historic perspective, many different programming models have tackled the problem of **dynamic resource discovery** following various approaches. [MPI (Message Passing Interface)][mpi] originally (in MPI-1) did not support **dynamic resource discovery**. All processes which were capable of communicating with each other would be identified and fixed during at the **point of discovery**. [PVM (Parallel Virtual Machine)][pvm] enabled resources to be discovered at runtime since its conception, using an alternative execution model of manually spawning processes from the main process. This lead MPI to introduce the feature it in later MPI-2. However as far as we know, despite being available this feature is not widely used in HPC environments and the execution model of having all processes fixed on initialisation is generally still the prefered approach. Other programming models for HPC environments support a fixed set of processors on initialization library time, such as SHMEM, Fortran coarrays and UPC++.
261261

262-
Some of these programming models also address **fault tolerance**, in particular, PVM has native support for this. MPI whilst it does not have native support for a PVM-like **fault tolerance** mechanism can be [implemented on top of MPI][mpi-post-failure-recovery] or provided via [extensions][mpi-fault-tolerance].
262+
Some of these programming models also address **fault tolerance**, in particular, PVM has native support for this, providing a [mechanism][pvm-callback] which can notify a program when a resource is added or removed from a system . MPI whilst it does not have native support for a PVM-like **fault tolerance** mechanism can be [implemented on top of MPI][mpi-post-failure-recovery] or provided via [extensions][mpi-fault-tolerance].
263263

264264
Due to the complexity involved in standardising **dynamic resource discovery** and **fault tolerance** these are outside currently out of the scope of this paper.
265265

@@ -407,6 +407,7 @@ Euro-Par 2011 Parallel Processing: 17th International
407407
https://www.open-mpi.org/projects/hwloc/lstopo/
408408
409409
[pvm]: http://www.csm.ornl.gov/pvm/
410+
[pvm-callback]: http://etutorials.org/Linux+systems/cluster+computing+with+linux/Part+II+Parallel+Programming/Chapter+11+Fault-Tolerant+and+Adaptive+Programs+with+PVM/11.2+Building+Fault-Tolerant+Parallel+Applications/
410411
[mpi]: http://mpi-forum.org/docs/
411412
[mpi-fault-tolerance]: http://www.mcs.anl.gov/~lusk/papers/fault-tolerance.pdf
412413
[mpi-post-failure-recovery]: http://journals.sagepub.com/doi/10.1177/1094342013488238

0 commit comments

Comments
 (0)