Skip to content

Commit 9a7d7c1

Browse files
GordonGordon
authored andcommitted
CP013: Add further proposed wording and minor interface changes.
* Add structure for proposed wording on existing interface. * Change execution_context::resource() and this_system::resource() to return a const reference.
1 parent d9f10e2 commit 9a7d7c1

File tree

1 file changed

+145
-52
lines changed

1 file changed

+145
-52
lines changed

affinity/cpp-20/d0796r1.md

Lines changed: 145 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,13 @@
1717
### Revision 1
1818

1919
* Introduce proposed wording.
20+
* Have `excution_context::resource()` and `this_system::resource()` return a `const execution_resource &`.
2021

2122
# Abstract
2223

2324
This paper provides an initial meta-framework for the drives toward memory affinity for C++, given the direction from Toronto 2017 SG1 meeting that we should look towards defining affinity for C++ before looking at inaccessible memory as a solution to the separate memory problem towards supporting heterogeneous and distributed computing.
2425

25-
## Affinity Matters
26+
# Motivation
2627

2728
Processor and memory binding, also called 'affinity', can help the performance of an application for many reasons. Keeping a process bound to a specific thread and local memory region optimizes cache affinity and reduces context switching and unnecessary scheduler activity. Since memory accesses to remote locations incur higher latency and lower bandwidth, control of thread placement to enforce affinity within parallel applications is crucial to fuel all the cores and to exploit the full performance of the memory subsystem on Non-Uniform Memory Architectures (NUMA).
2829

@@ -60,13 +61,13 @@ std::for_each(par, std::begin(a), std::end(a),
6061
```
6162
*Listing 1: Motivational example*
6263
63-
Now with the affinity interface we propose below and in future, we will hopefully find that there is significant increase in memory bandwidth when we have multiple threads by as much as 2x GB/s as thread count increases (using system call madvise on Sun systems to implement next touch policy to migrate the data close to the next executing thread).
64+
Now with the affinity interface we propose below and in future, we will hopefully find that there is significant increase in memory bandwidth when we have multiple threads by as much as 2x GB/s as thread count increases (using system call madvise on Sun systems to implement next touch policy to migrate the data close to the next executing thread).
6465
6566
The goal was that this would enable scaling up for heterogeneous and distributed computing in future. Indeed OpenMP [14] where one of the author participated in the design of its affinity model, has plans to integrate its affinity model with its heterogeneous model.[21]
6667
67-
## Background Research: State of the Art
68+
# Background Research: State of the Art
6869
69-
The problem of effectively partitioning a system’s topology is one which has been so for some time, and there are a range of third party libraries / standards which provides APIs to solve the problem. In order to standardise this process for the C++ standard we must carefully look at all of these. Below is a list of the libraries and standards which define an interface for affinity:
70+
The problem of effectively partitioning a system’s topology is one which has been so for some time, and there are a range of third party libraries / standards which provides APIs to solve the problem. In order to standardise this process for C++ we must carefully look at all of these approaches and identify which we wish to adopt. Below is a list of the libraries and standards which this proposal will draw from:
7071
7172
* [Portable Hardware Locality][hwloc]
7273
* [SYCL 1.2][sycl-1-2-1]
@@ -87,7 +88,7 @@ Libraries such as the Portable Hardware Locality (hwloc) [9] provide a low level
8788
8889
Some systems will provide additional user control through explicit binding of threads to processors through environment variables consumed by various compilers, system commands (e.g. Linux: taskset, numactl; Windows: start /affinity), or system calls for example Solaris has `pbind()`, Linux has `sched_setaffinity()` and Windows has `SetThreadAffinityMask()`.
8990
90-
# Problem Space
91+
## Problem Space
9192
9293
In this paper we describe the problem space of affinity for C++, the various challenges which need to be addressed in defining a partitioning and affinity interface for C++ and some suggested solutions:
9394
@@ -102,79 +103,171 @@ There are some additional challenges which we have been investigating but are no
102103
* Migrating data from memory allocated in one partition to another
103104
* Defining memory placement algorithms or policies
104105
106+
# Proposed Wording
105107
106-
## Proposed Wording
108+
## Header `<execution>` synopsis
107109
108-
### Header synopsis
110+
namespace std {
111+
namespace experimental {
112+
namespace execution {
109113
110-
```cpp
111-
namespace std {
112-
namespace experimental {
113-
namespace execution {
114+
/* Execution resource */
114115
115-
/* Execution resource */
116+
struct execution_resource {
116117
117-
struct execution_resource {
118+
execution_resource() = delete;
119+
execution_resource(const execution_resource &) = delete;
120+
execution_resource(execution_resource &&) = delete;
121+
execution_resource &operator=(const execution_resource &) = delete;
122+
execution_resource &operator=(execution_resource &&) = delete;
123+
~execution_resource() = delete;
118124
119-
execution_resource() = delete;
120-
execution_resource(const execution_resource &) = delete;
121-
execution_resource(execution_resource &&) = delete;
122-
execution_resource &operator=(const execution_resource &) = delete;
123-
execution_resource &operator=(execution_resource &&) = delete;
125+
size_t concurrency() const noexcept;
126+
size_t partition_size() const noexcept;
124127
125-
size_t concurrency() const noexcept;
126-
size_t partition_size() const noexcept;
128+
const execution_resource &partition(size_t i) const noexcept;
129+
const execution_resource &member_of() const noexcept;
127130
128-
const execution_resource &partition(size_t i) const noexcept;
129-
const execution_resource &member_of() const noexcept;
131+
std::string name() const noexcept;
130132
131-
std::string name() const noexcept;
133+
bool can_place_memory() const noexcept;
134+
bool can_place_agent() const noexcept;
132135
133-
bool can_place_memory() const noexcept;
134-
bool can_place_agent() const noexcept;
136+
};
135137
136-
};
138+
/* Execution context */
137139
138-
/* Execution context */
140+
struct execution_context {
139141
140-
struct execution_context {
142+
using executor_type = __unspecfied__;
141143
142-
using executor_type = __unspecfied__;
144+
template <typename ExecutionResource>
145+
execution_context(ExecutionResource &&execResource);
143146
144-
template <typename ExecutionResource>
145-
execution_context(ExecutionResource &&execResource);
147+
~execution_context();
146148
147-
execution_resource &resource();
149+
const execution_resource &resource() const noexcept;
148150
149-
executor_type executor() noexcept;
151+
executor_type executor() noexcept;
150152
151-
};
153+
};
152154
153-
/* This system */
155+
/* This system */
154156
155-
namespace this_system {
156-
execution_resource &resource();
157-
}
157+
namespace this_system {
158+
const execution_resource &resource();
159+
}
160+
161+
} // execution
162+
} // experimental
163+
} // std
158164
159-
} // execution
160-
} // experimental
161-
} // std
162-
```
163165
*Listing 2: Header synopsis*
164166
165-
### Querying a System’s Topology
167+
## Class `execution_resource`
168+
169+
The `execution_resource` class provides an abstraction over a software or hardware resource capable of memory allocation, execution of light weight exeution agents or both.
170+
171+
### `execution_resource` constructors
172+
173+
execution_resource() = delete;
174+
175+
176+
[*Note:* An implementation of `execution_resource` is permitted to provide non-public constructors to allow other objects to construct them. *--end note*]
177+
178+
### `execution_resource` assignment
179+
180+
The `execution_resource` class is not is not `CopyConstructible` (C++Std [copyconstructible]).
181+
182+
execution_resource(const execution_resource &) = delete;
183+
execution_resource(execution_resource &&) = delete;
184+
execution_resource &operator=(const execution_resource &) = delete;
185+
execution_resource &operator=(execution_resource &&) = delete;
186+
187+
### `execution_resource` destructor
188+
189+
The `execution_resource` class is not is not `Destructible` (C++Std [destructible]).
190+
191+
~execution_resource() = delete;
192+
193+
### `execution_resource` operations
194+
195+
size_t concurrency() const noexcept;
196+
197+
*Returns:*
198+
199+
size_t partition_size() const noexcept;
200+
201+
*Returns:*
202+
203+
const execution_resource &partition(size_t i) const noexcept;
204+
205+
*Returns:*
206+
207+
const execution_resource &member_of() const noexcept;
208+
209+
*Returns:*
210+
211+
std::string name() const noexcept;
212+
213+
*Returns:*
214+
215+
bool can_place_memory() const noexcept;
216+
217+
*Returns:*
218+
219+
bool can_place_agent() const noexcept;
220+
221+
*Returns:*
222+
223+
## Class `execution_context`
224+
225+
The `execution_context` class provides an abstraction for managing a number of light weight execution agents executing work on one or more `execution_resource`s.
226+
227+
### `execution_context` member aliases
228+
229+
using executor_type = __unspecfied__;
230+
231+
*Requires:*
232+
233+
### `execution_context` constructors
234+
235+
template <typename ExecutionResource>
236+
execution_context(ExecutionResource &&execResource);
237+
238+
### `execution_context` destructor
239+
240+
~execution_context();
241+
242+
### `execution_context` operators
243+
244+
const execution_resource &resource() const noexcept;
245+
246+
*Returns:*
247+
248+
executor_type executor() noexcept;
249+
250+
*Returns:*
251+
252+
## Free functions
253+
254+
const this_system::execution_resource &resource();
255+
256+
*Returns:*
257+
258+
## Querying a System’s Topology
166259
167260
The first task in allowing C++ applications to leverage memory locality is to provide the ability to query a **system** for its **resource topology** (commonly represented as a tree or graph) and traverse its **execution resources**.
168261
169-
### Execution resource
262+
## Execution resource
170263
171264
The capability of querying underlying **execution resources** of a given **system** is particularly important towards supporting affinity control in C++. The current proposal for executors [5] leaves the **execution resource** largely unspecified. This is intentional: **execution resources** will vary greatly between one implementation and another, and it is out of the scope of the current executors proposal to define those.
172265
173266
There is current work on extending the executors proposal to describe a typical interface for an **execution context** [8]. In this paper a typical **execution context** is defined with an interface for construction and comparison, and for retrieving an **executor**, waiting on submitted work to complete and querying the underlying **execution resource**.
174267
175268
Extending the executors interface to provide topology information can serve as a basis for providing a unified interface to expose affinity. This interface cannot mandate a specific architectural definition, and must be generic enough that future architectural evolutions can still be expressed.
176269
177-
### Level of abstraction
270+
## Level of abstraction
178271
179272
An important consideration when defining a unified interface for querying the **resource topology** of a **system** is what level of abstraction should such an interface have and at what granularity the **execution resources** of the topology be described.
180273
@@ -186,7 +279,7 @@ As both the level of abstraction of an **execution resource** and the granularit
186279
|------------|
187280
| Should the interface for querying a system’s resource topology be completely abstract or should it provide specific components of the hardware architecture? |
188281
189-
### Representation
282+
## Representation
190283
191284
Nowadays, there are various APIs and libraries that enable this functionality. One of the most commonly used is the Portable Hardware Locality (hwloc) [9]. Hwloc presents the hardware as a tree, where the root node represents the whole machine and subsequent levels represents different partitions depending on different hardware characteristics. The picture below shows the output of the hwloc visualization tool (lstopo) on a 2-socket Xeon E5300 server. Note that each socket is represented by a package in the graph. Each socket contain its own cache memories, but both share the same NUMA memory region. Note also that different I/O units are visible underneath: Placement of these units w.r.t to memory and threads can be critical to performance. The ability of placing threads and/or allocating memory appropriately on the different components of this system is an important part of the process of application development, especially as hardware architectures get more complex. The documentation of lstopo [22] shows more interesting examples of topologies that can be encountered on today systems.
192285
@@ -199,7 +292,7 @@ However, systems are becoming increasingly non-hierarchical and a traditional tr
199292
| Should the interface for querying a system’s resource topology support non-hierarchical architectures. |
200293
| *What kind of shape do we want for expressing the topology abstraction?* |
201294
202-
### Extended Execution Resource Interface
295+
## Extended Execution Resource Interface
203296
204297
Below is a proposed interface for the generalization of the **execution resource** based on the definition of `thread_execution_resource_t` [8] with some extensions.
205298
@@ -313,7 +406,7 @@ for (int i = 0; i < resource.partition_size(); i++) {
313406
| Should the interface provide a way of creating an execution context from an execution resource? |
314407
| *Is what is defined here a suitable solution?* |
315408

316-
### Importance of topology discovery
409+
## Importance of topology discovery
317410

318411
For traditional single CPU systems the execution resources reasoned about using standard constructs such as std::thread, std::this_thread and thread local storage. This is because the C++ memory model requires that a system have **at least one thread of execution, some memory and some I/O capabilities**. This means that for these systems some assumptions can be made about the topology could be made during at compile-time, for example the fact that developers can query always the hardware concurrency available as there is always at least 1 thread or the fact that you can always use thread local storage.
319412

@@ -329,7 +422,7 @@ Note that this is different from devices that go online or offline during execut
329422
| *When do we enable the device discovery process? Can we change the system topology after executors have been created?* |
330423
| *Should be provide an interface for providing a call-back on topology change?* |
331424

332-
### Lifetime considerations
425+
## Lifetime considerations
333426

334427
As the execution context would provide a partitioning interface which returns objects describing the components of the system topology of an execution resource it’s important to consider the lifetime of these objects.
335428

@@ -339,7 +432,7 @@ For these reasons **resources** must always outlive any **execution context** wh
339432

340433
### Scaling to heterogeneous and distributed systems
341434

342-
The initial solution should target systems with a single addressable memory region, i.e. a system which does not have discrete non-accessible memory regions such as a discrete GPU or FPGA. However in the interest of maintaining a unified interface going forward the initial solution should be designed with the latter in mind and should be scalable to support these systems in the future. In particular to support heterogeneous systems it’s important that the abstraction allows the interface for querying the **resource topology** of the **system** in order to perform device discovery.
435+
The initial solution should target systems with a single addressable memory region, i.e. a system which does not have discrete non-accessible memory regions such as a discrete GPU or FPGA. However in the interest of maintaining a unified interface going forward the initial solution should be designed with the latter in mind and should be scalable to support these systems in the future. In particular to support heterogeneous systems it’s important that the abstraction allows the interface for querying the **resource topology** of the **system** in order to perform device discovery.
343436

344437
## Querying the Relative Affinity of Partitions
345438

@@ -384,15 +477,15 @@ If a particular policy or algorithm requires to access placement information, th
384477
385478
# Future Work
386479
387-
### Migrating data from memory allocated in one partition to another
480+
## Migrating data from memory allocated in one partition to another
388481
389482
In some cases for performance it is important to bind a memory allocation to a memory region for the duration of an a tasks execution, however in other cases it’s important to be able to migrate the data from one memory region to another. This is outside the scope of this paper, however we would like to investigate this in a future paper.
390483
391484
| Straw Poll |
392485
|------------|
393486
| Should the interface provide a way of migrating data between partitions? |
394487
395-
### Defining memory placement algorithms or policies
488+
## Defining memory placement algorithms or policies
396489
397490
With the ability to place memory with affinity comes the ability to define algorithms or memory policies which describe at a higher level how memory is distributed across large systems. Some examples of these are pinned, first touch and scatter. This is outside the scope of this paper, however we would like to investigate this in a future paper.
398491

0 commit comments

Comments
 (0)