You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CP013: Add further proposed wording and minor interface changes.
* Add structure for proposed wording on existing interface.
* Change execution_context::resource() and this_system::resource() to
return a const reference.
* Have `excution_context::resource()` and `this_system::resource()` return a `const execution_resource &`.
20
21
21
22
# Abstract
22
23
23
24
This paper provides an initial meta-framework for the drives toward memory affinity for C++, given the direction from Toronto 2017 SG1 meeting that we should look towards defining affinity for C++ before looking at inaccessible memory as a solution to the separate memory problem towards supporting heterogeneous and distributed computing.
24
25
25
-
## Affinity Matters
26
+
#Motivation
26
27
27
28
Processor and memory binding, also called 'affinity', can help the performance of an application for many reasons. Keeping a process bound to a specific thread and local memory region optimizes cache affinity and reduces context switching and unnecessary scheduler activity. Since memory accesses to remote locations incur higher latency and lower bandwidth, control of thread placement to enforce affinity within parallel applications is crucial to fuel all the cores and to exploit the full performance of the memory subsystem on Non-Uniform Memory Architectures (NUMA).
Now with the affinity interface we propose below and in future, we will hopefully find that there is significant increase in memory bandwidth when we have multiple threads by as much as 2x GB/s as thread count increases (using system call madvise on Sun systems to implement next touch policy to migrate the data close to the next executing thread).
64
+
Now with the affinity interface we propose below and in future, we will hopefully find that there is significant increase in memory bandwidth when we have multiple threads by as much as 2x GB/s as thread count increases (using system call madvise on Sun systems to implement next touch policy to migrate the data close to the next executing thread).
64
65
65
66
The goal was that this would enable scaling up for heterogeneous and distributed computing in future. Indeed OpenMP [14] where one of the author participated in the design of its affinity model, has plans to integrate its affinity model with its heterogeneous model.[21]
66
67
67
-
## Background Research: State of the Art
68
+
# Background Research: State of the Art
68
69
69
-
The problem of effectively partitioning a system’s topology is one which has been so for some time, and there are a range of third party libraries / standards which provides APIs to solve the problem. In order to standardise this process for the C++ standard we must carefully look at all of these. Below is a list of the libraries and standards which define an interface for affinity:
70
+
The problem of effectively partitioning a system’s topology is one which has been so for some time, and there are a range of third party libraries / standards which provides APIs to solve the problem. In order to standardise this process for C++ we must carefully look at all of these approaches and identify which we wish to adopt. Below is a list of the libraries and standards which this proposal will draw from:
70
71
71
72
* [Portable Hardware Locality][hwloc]
72
73
* [SYCL 1.2][sycl-1-2-1]
@@ -87,7 +88,7 @@ Libraries such as the Portable Hardware Locality (hwloc) [9] provide a low level
87
88
88
89
Some systems will provide additional user control through explicit binding of threads to processors through environment variables consumed by various compilers, system commands (e.g. Linux: taskset, numactl; Windows: start /affinity), or system calls for example Solaris has `pbind()`, Linux has `sched_setaffinity()` and Windows has `SetThreadAffinityMask()`.
89
90
90
-
# Problem Space
91
+
## Problem Space
91
92
92
93
In this paper we describe the problem space of affinity for C++, the various challenges which need to be addressed in defining a partitioning and affinity interface for C++ and some suggested solutions:
93
94
@@ -102,79 +103,171 @@ There are some additional challenges which we have been investigating but are no
102
103
* Migrating data from memory allocated in one partition to another
103
104
* Defining memory placement algorithms or policies
The `execution_resource` class provides an abstraction over a software or hardware resource capable of memory allocation, execution of light weight exeution agents or both.
170
+
171
+
### `execution_resource` constructors
172
+
173
+
execution_resource() = delete;
174
+
175
+
176
+
[*Note:* An implementation of `execution_resource` is permitted to provide non-public constructors to allow other objects to construct them. *--end note*]
177
+
178
+
### `execution_resource` assignment
179
+
180
+
The `execution_resource` class is not is not `CopyConstructible` (C++Std [copyconstructible]).
The `execution_context` class provides an abstraction for managing a number of light weight execution agents executing work on one or more `execution_resource`s.
The first task in allowing C++ applications to leverage memory locality is to provide the ability to query a **system** for its **resource topology** (commonly represented as a tree or graph) and traverse its **execution resources**.
168
261
169
-
###Execution resource
262
+
## Execution resource
170
263
171
264
The capability of querying underlying **execution resources** of a given **system** is particularly important towards supporting affinity control in C++. The current proposal for executors [5] leaves the **execution resource** largely unspecified. This is intentional: **execution resources** will vary greatly between one implementation and another, and it is out of the scope of the current executors proposal to define those.
172
265
173
266
There is current work on extending the executors proposal to describe a typical interface for an **execution context** [8]. In this paper a typical **execution context** is defined with an interface for construction and comparison, and for retrieving an **executor**, waiting on submitted work to complete and querying the underlying **execution resource**.
174
267
175
268
Extending the executors interface to provide topology information can serve as a basis for providing a unified interface to expose affinity. This interface cannot mandate a specific architectural definition, and must be generic enough that future architectural evolutions can still be expressed.
176
269
177
-
###Level of abstraction
270
+
## Level of abstraction
178
271
179
272
An important consideration when defining a unified interface for querying the **resource topology** of a **system** is what level of abstraction should such an interface have and at what granularity the **execution resources** of the topology be described.
180
273
@@ -186,7 +279,7 @@ As both the level of abstraction of an **execution resource** and the granularit
186
279
|------------|
187
280
| Should the interface for querying a system’s resource topology be completely abstract or should it provide specific components of the hardware architecture? |
188
281
189
-
###Representation
282
+
## Representation
190
283
191
284
Nowadays, there are various APIs and libraries that enable this functionality. One of the most commonly used is the Portable Hardware Locality (hwloc) [9]. Hwloc presents the hardware as a tree, where the root node represents the whole machine and subsequent levels represents different partitions depending on different hardware characteristics. The picture below shows the output of the hwloc visualization tool (lstopo) on a 2-socket Xeon E5300 server. Note that each socket is represented by a package in the graph. Each socket contain its own cache memories, but both share the same NUMA memory region. Note also that different I/O units are visible underneath: Placement of these units w.r.t to memory and threads can be critical to performance. The ability of placing threads and/or allocating memory appropriately on the different components of this system is an important part of the process of application development, especially as hardware architectures get more complex. The documentation of lstopo [22] shows more interesting examples of topologies that can be encountered on today systems.
192
285
@@ -199,7 +292,7 @@ However, systems are becoming increasingly non-hierarchical and a traditional tr
199
292
| Should the interface for querying a system’s resource topology support non-hierarchical architectures. |
200
293
| *What kind of shape do we want for expressing the topology abstraction?* |
201
294
202
-
###Extended Execution Resource Interface
295
+
## Extended Execution Resource Interface
203
296
204
297
Below is a proposed interface for the generalization of the **execution resource** based on the definition of `thread_execution_resource_t` [8] with some extensions.
205
298
@@ -313,7 +406,7 @@ for (int i = 0; i < resource.partition_size(); i++) {
313
406
| Should the interface provide a way of creating an execution context from an execution resource? |
314
407
|*Is what is defined here a suitable solution?*|
315
408
316
-
###Importance of topology discovery
409
+
## Importance of topology discovery
317
410
318
411
For traditional single CPU systems the execution resources reasoned about using standard constructs such as std::thread, std::this_thread and thread local storage. This is because the C++ memory model requires that a system have **at least one thread of execution, some memory and some I/O capabilities**. This means that for these systems some assumptions can be made about the topology could be made during at compile-time, for example the fact that developers can query always the hardware concurrency available as there is always at least 1 thread or the fact that you can always use thread local storage.
319
412
@@ -329,7 +422,7 @@ Note that this is different from devices that go online or offline during execut
329
422
|*When do we enable the device discovery process? Can we change the system topology after executors have been created?*|
330
423
|*Should be provide an interface for providing a call-back on topology change?*|
331
424
332
-
###Lifetime considerations
425
+
## Lifetime considerations
333
426
334
427
As the execution context would provide a partitioning interface which returns objects describing the components of the system topology of an execution resource it’s important to consider the lifetime of these objects.
335
428
@@ -339,7 +432,7 @@ For these reasons **resources** must always outlive any **execution context** wh
339
432
340
433
### Scaling to heterogeneous and distributed systems
341
434
342
-
The initial solution should target systems with a single addressable memory region, i.e. a system which does not have discrete non-accessible memory regions such as a discrete GPU or FPGA. However in the interest of maintaining a unified interface going forward the initial solution should be designed with the latter in mind and should be scalable to support these systems in the future. In particular to support heterogeneous systems it’s important that the abstraction allows the interface for querying the **resource topology** of the **system** in order to perform device discovery.
435
+
The initial solution should target systems with a single addressable memory region, i.e. a system which does not have discrete non-accessible memory regions such as a discrete GPU or FPGA. However in the interest of maintaining a unified interface going forward the initial solution should be designed with the latter in mind and should be scalable to support these systems in the future. In particular to support heterogeneous systems it’s important that the abstraction allows the interface for querying the **resource topology** of the **system** in order to perform device discovery.
343
436
344
437
## Querying the Relative Affinity of Partitions
345
438
@@ -384,15 +477,15 @@ If a particular policy or algorithm requires to access placement information, th
384
477
385
478
# Future Work
386
479
387
-
### Migrating data from memory allocated in one partition to another
480
+
## Migrating data from memory allocated in one partition to another
388
481
389
482
In some cases for performance it is important to bind a memory allocation to a memory region for the duration of an a tasks execution, however in other cases it’s important to be able to migrate the data from one memory region to another. This is outside the scope of this paper, however we would like to investigate this in a future paper.
390
483
391
484
| Straw Poll |
392
485
|------------|
393
486
| Should the interface provide a way of migrating data between partitions? |
394
487
395
-
### Defining memory placement algorithms or policies
488
+
## Defining memory placement algorithms or policies
396
489
397
490
With the ability to place memory with affinity comes the ability to define algorithms or memory policies which describe at a higher level how memory is distributed across large systems. Some examples of these are pinned, first touch and scatter. This is outside the scope of this paper, however we would like to investigate this in a future paper.
0 commit comments