Skip to content

Commit 48b4eea

Browse files
authored
add gomp envs (#1572)
1 parent 78ec388 commit 48b4eea

File tree

4 files changed

+49
-28
lines changed

4 files changed

+49
-28
lines changed

docs/tutorials/performance_tuning/launch_script.md

Lines changed: 19 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ Available option settings (knobs) are listed below:
2525
| `-h`, `--help` | - | - | show this help message and exit |
2626
| `-m`, `--module` | - | False | Changes each process to interpret the launch script as a python module, executing with the same behavior as 'python -m'. |
2727
| `--no-python` | - | False | Avoid applying `python` to execute `program`. |
28-
| `--log-path` | str | '' | The log file directory. Setting it to empty ('') disables logging to files. |
28+
| `--log-dir` | str | '' | The log file directory. Setting it to empty ('') disables logging to files. |
2929
| `--log-file-prefix` | str | 'run' | log file name prefix |
3030

3131
Launcher Common Arguments:
@@ -78,7 +78,7 @@ Distributed Training Arguments With oneCCL backend:
7878

7979
The *launch* script respects existing environment variables when it get launched, except for *LD_PRELOAD*. If you have your favorite values for certain environment variables, you can set them before running the *launch* script. Intel OpenMP library uses an environment variable *KMP_AFFINITY* to control its behavior. Different settings result in different performance numbers. By default, if you enable Intel OpenMP library, the *launch* script will set *KMP_AFFINITY* to `granularity=fine,compact,1,0`. If you want to try with other values, you can use `export` command on Linux to set *KMP_AFFINITY* before you run the *launch* script. In this case, the script will not set the default value but take the existing value of *KMP_AFFINITY*, and print a message to stdout.
8080

81-
Execution via the *launch* script can dump logs into files under a designated log directory so you can do some investigations afterward. By default, it is disabled to avoid undesired log files. You can enable logging by setting knob `--log-path` to be:
81+
Execution via the *launch* script can dump logs into files under a designated log directory so you can do some investigations afterward. By default, it is disabled to avoid undesired log files. You can enable logging by setting knob `--log-dir` to be:
8282

8383
- directory to store log files. It can be an absolute path or relative path.
8484
- types of log files to generate. One file (`<prefix>_timestamp_instances.log`) contains command and information when the script was launched. Another type of file (`<prefix>_timestamp_instance_#_core#-core#....log`) contain stdout print of each instance.
@@ -119,7 +119,7 @@ __Note:__ GIF files below illustrate CPU usage ONLY. Do NOT infer performance nu
119119
#### I. Use all physical cores
120120

121121
```
122-
ipexrun --log-path ./logs resnet50.py
122+
ipexrun --log-dir ./logs resnet50.py
123123
```
124124

125125
CPU usage is shown as below. 1 main worker thread was launched, then it launched physical core number of threads on all physical cores.
@@ -153,7 +153,7 @@ $ cat logs/run_20210712212258_instances.log
153153
#### II. Use all cores including logical cores
154154

155155
```
156-
ipexrun --use-logical-core --log-path ./logs resnet50.py
156+
ipexrun --use-logical-core --log-dir ./logs resnet50.py
157157
```
158158

159159
CPU usage is shown as below. 1 main worker thread was launched, then it launched threads on all cores, including logical cores.
@@ -187,7 +187,7 @@ $ cat logs/run_20210712223308_instances.log
187187
#### III. Use physical cores on designated nodes
188188

189189
```
190-
ipexrun --nodes-list 1 --log-path ./logs resnet50.py
190+
ipexrun --nodes-list 1 --log-dir ./logs resnet50.py
191191
```
192192

193193
CPU usage is shown as below. 1 main worker thread was launched, then it launched threads on all other cores on the same numa node.
@@ -221,7 +221,7 @@ $ cat logs/run_20210712214504_instances.log
221221
#### IV. Use your designated number of cores
222222

223223
```
224-
ipexrun --ninstances 1 --ncores-per-instance 10 --log-path ./logs resnet50.py
224+
ipexrun --ninstances 1 --ncores-per-instance 10 --log-dir ./logs resnet50.py
225225
```
226226

227227
CPU usage is shown as below. 1 main worker thread was launched, then it launched threads on other 9 physical cores.
@@ -254,7 +254,7 @@ $ cat logs/run_20210712220928_instances.log
254254
You can also specify the cores to be utilized using `--cores-list` argument. For example, if core id 11-20 are desired instead of the first 10 cores, the launch command would be as below.
255255

256256
```
257-
ipexrun --ncores-per-instance 10 --cores-list "11-20" --log-path ./logs resnet50.py
257+
ipexrun --ncores-per-instance 10 --cores-list "11-20" --log-dir ./logs resnet50.py
258258
```
259259

260260
Please notice that when specifying `--cores-list`, a correspondant `--ncores-per-instance` argument is required for instance number deduction.
@@ -286,7 +286,7 @@ $ cat logs/run_20210712221615_instances.log
286286
#### V. Throughput mode
287287

288288
```
289-
ipexrun --throughput-mode --log-path ./logs resnet50.py
289+
ipexrun --throughput-mode --log-dir ./logs resnet50.py
290290
```
291291

292292
CPU usage is shown as below. 2 main worker threads were launched on 2 numa nodes respectively, then they launched threads on other physical cores.
@@ -321,7 +321,7 @@ $ cat logs/run_20210712221150_instances.log
321321
#### VI. Latency mode
322322

323323
```
324-
ipexrun --latency-mode --log-path ./logs resnet50.py
324+
ipexrun --latency-mode --log-dir ./logs resnet50.py
325325
```
326326

327327
CPU usage is shown as below. 4 cores are used for each instance.
@@ -375,7 +375,7 @@ $ cat logs/run_20210712221415_instances.log
375375
#### VII. Your designated number of instances
376376

377377
```
378-
ipexrun --ninstances 4 --log-path ./logs resnet50.py
378+
ipexrun --ninstances 4 --log-dir ./logs resnet50.py
379379
```
380380

381381
CPU usage is shown as below. 4 main worker thread were launched, then they launched threads on all other physical cores.
@@ -416,7 +416,7 @@ $ cat logs/run_20210712221305_instances.log
416416
Launcher by default runs all `ninstances` for multi-instance inference/training as shown above. You can specify `instance_idx` to independently run that instance only among `ninstances`
417417

418418
```
419-
ipexrun --ninstances 4 --instance-idx 0 --log-path ./logs resnet50.py
419+
ipexrun --ninstances 4 --instance-idx 0 --log-dir ./logs resnet50.py
420420
```
421421

422422
you can confirm usage in log file:
@@ -431,7 +431,7 @@ you can confirm usage in log file:
431431
```
432432

433433
```
434-
ipexrun --ninstances 4 --instance-idx 1 --log-path ./logs resnet50.py
434+
ipexrun --ninstances 4 --instance-idx 1 --log-dir ./logs resnet50.py
435435
```
436436

437437
you can confirm usage in log file:
@@ -454,7 +454,7 @@ Memory allocator influences performance sometime. If users do not designate desi
454454
__Note:__ You can set your favorite value to *MALLOC_CONF* before running the *launch* script if you do not want to use its default setting.
455455

456456
```
457-
ipexrun --memory-allocator jemalloc --log-path ./logs resnet50.py
457+
ipexrun --memory-allocator jemalloc --log-dir ./logs resnet50.py
458458
```
459459

460460
you can confirm usage in log file:
@@ -474,7 +474,7 @@ you can confirm usage in log file:
474474
#### TCMalloc
475475

476476
```
477-
ipexrun --memory-allocator tcmalloc --log-path ./logs resnet50.py
477+
ipexrun --memory-allocator tcmalloc --log-dir ./logs resnet50.py
478478
```
479479

480480
you can confirm usage in log file:
@@ -493,7 +493,7 @@ you can confirm usage in log file:
493493
#### Default memory allocator
494494

495495
```
496-
ipexrun --memory-allocator default --log-path ./logs resnet50.py
496+
ipexrun --memory-allocator default --log-dir ./logs resnet50.py
497497
```
498498

499499
you can confirm usage in log file:
@@ -516,16 +516,18 @@ Generally, Intel OpenMP library brings better performance. Thus, in the *launch*
516516

517517
#### GNU OpenMP Library
518518

519-
It is, however, not always that Intel OpenMP library brings better performance comparing to GNU OpenMP library. In this case, you can use knob `--disable_iomp` to switch active OpenMP library to the GNU one.
519+
It is, however, not always that Intel OpenMP library brings better performance comparing to GNU OpenMP library. In this case, you can use knob `--omp-runtime default` to switch active OpenMP library to the GNU one. GNU OpenMP specific environment variables, *OMP_SCHEDULE* and *OMP_PROC_BIND*, for setting CPU affinity are set automatically.
520520

521521
```
522-
ipexrun --omp-runtime default --log-path ./logs resnet50.py
522+
ipexrun --omp-runtime default --log-dir ./logs resnet50.py
523523
```
524524

525525
you can confirm usage in log file:
526526

527527
```
528528
2021-07-13 15:25:00,760 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/<user>/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance
529+
2021-07-13 15:25:00,761 - __main__ - INFO - OMP_SCHEDULE=STATIC
530+
2021-07-13 15:25:00,761 - __main__ - INFO - OMP_PROC_BIND=CLOSE
529531
2021-07-13 15:25:00,761 - __main__ - INFO - OMP_NUM_THREADS=44
530532
2021-07-13 15:25:00,761 - __main__ - WARNING - Numa Aware: cores:['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43'] on different NUMA nodes
531533
2021-07-13 15:25:00,761 - __main__ - INFO - numactl -C 0-43 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210713152500_instance_0_cores_0-43.log

intel_extension_for_pytorch/cpu/launch/launcher_base.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,6 +192,7 @@ def set_memory_allocator(self, memory_allocator='auto', benchmark=False, skip_li
192192
self.add_env('MALLOC_CONF', 'oversize_threshold:1,background_thread:false,metadata_thp:always,dirty_decay_ms:-1,muzzy_decay_ms:-1')
193193
else:
194194
self.add_env('MALLOC_CONF', 'oversize_threshold:1,background_thread:true,metadata_thp:auto')
195+
return ma_local
195196

196197
def set_omp_runtime(self, omp_runtime='auto', set_kmp_affinity=True):
197198
'''
@@ -203,6 +204,10 @@ def set_omp_runtime(self, omp_runtime='auto', set_kmp_affinity=True):
203204
if set_kmp_affinity:
204205
self.add_env('KMP_AFFINITY', 'granularity=fine,compact,1,0')
205206
self.add_env('KMP_BLOCKTIME', '1')
207+
elif omp_local == 'default':
208+
self.add_env('OMP_SCHEDULE', 'STATIC')
209+
self.add_env('OMP_PROC_BIND', 'CLOSE')
210+
return omp_local
206211

207212
def parse_list_argument(self, txt):
208213
ret = []

intel_extension_for_pytorch/cpu/launch/launcher_distributed.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -188,7 +188,6 @@ def launch(self, args):
188188
for k,v in self.environ_set.items():
189189
self.verbose('info', f'env: {k}={v}')
190190

191-
os.environ['LAUNCH_CMD'] = '#'
192191
cmd = ['mpiexec.hydra']
193192
genvs = [f'-genv {k}={v}' for k,v in self.environ_set.items()]
194193
mpi_config = f"-l -np {args.nnodes * args.nprocs_per_node} -ppn {args.nprocs_per_node} {' '.join(genvs)} "
@@ -228,8 +227,6 @@ def launch(self, args):
228227
self.verbose('warning', f'Failed to detect rank id from log file {log_name} at line "{line.strip()}".')
229228
for fn in log_fns:
230229
fn.close()
231-
os.environ['LAUNCH_CMD'] += f'{" ".join(cmd)},#'
232-
os.environ['LAUNCH_CMD'] = os.environ['LAUNCH_CMD'][:-2]
233230

234231
if __name__ == '__main__':
235232
pass

intel_extension_for_pytorch/cpu/launch/launcher_multi_instances.py

Lines changed: 25 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -100,9 +100,10 @@ def set_multi_task_manager(self, multi_task_manager='auto', skip_list=[]):
100100
tm_local = self.set_lib_bin_from_list(multi_task_manager, tm_bin_name, 'multi-task manager', self.tm_supported, self.is_command_available, skip_list)
101101
return tm_local
102102

103-
def execution_command_builder(self, args, task_mgr, cpu_pools, index):
104-
cmd = []
103+
def execution_command_builder(self, args, omp_runtime, task_mgr, environ, cpu_pools, index):
105104
assert index > -1 and index <= len(cpu_pools), 'Designated instance index for constructing execution commands is out of range.'
105+
cmd = []
106+
environ_local = environ
106107
pool = cpu_pools[index]
107108
pool_txt = pool.get_pool_txt()
108109
cores_list_local = pool_txt['cores']
@@ -116,6 +117,19 @@ def execution_command_builder(self, args, task_mgr, cpu_pools, index):
116117
params = f'-c {cores_list_local}'
117118
cmd.append(task_mgr)
118119
cmd.extend(params.split())
120+
else:
121+
k = ''
122+
v = ''
123+
if omp_runtime == 'default':
124+
k = 'GOMP_CPU_AFFINITY'
125+
v = cores_list_local
126+
elif omp_runtime == 'intel':
127+
k = 'KMP_AFFINITY'
128+
v = f'granularity=fine,proclist=[{cores_list_local}],explicit'
129+
if k != '':
130+
self.verbose('info', '==========')
131+
self.verbose('info', f'env: {k}={v}')
132+
environ_local[k] = v
119133

120134
if not args.no_python:
121135
cmd.append(sys.executable)
@@ -126,14 +140,13 @@ def execution_command_builder(self, args, task_mgr, cpu_pools, index):
126140
log_name = f'{args.log_file_prefix}_instance_{index}_cores_{cores_list_local.replace(",", "_")}.log'
127141
log_name = os.path.join(args.log_dir, log_name)
128142
cmd.extend(args.program_args)
129-
os.environ['LAUNCH_CMD'] += '{" ".join(cmd)},#'
130143
cmd_s = ' '.join(cmd)
131144
if args.log_dir:
132145
cmd_s = f'{cmd_s} 2>&1 | tee {log_name}'
133146
self.verbose('info', f'cmd: {cmd_s}')
134147
if len(set([c.node for c in pool])) > 1:
135148
self.verbose('warning', f'Cross NUMA nodes execution detected: cores [{cores_list_local}] are on different NUMA nodes [{nodes_list_local}]')
136-
process = subprocess.Popen(cmd_s, env=os.environ, shell=True)
149+
process = subprocess.Popen(cmd_s, env=environ_local, shell=True)
137150
return {'process': process, 'cmd': cmd_s}
138151

139152
def launch(self, args):
@@ -177,7 +190,7 @@ def launch(self, args):
177190
set_kmp_affinity = False
178191

179192
self.set_memory_allocator(args.memory_allocator, args.benchmark)
180-
self.set_omp_runtime(args.omp_runtime, set_kmp_affinity)
193+
omp_runtime = self.set_omp_runtime(args.omp_runtime, set_kmp_affinity)
181194
self.add_env('OMP_NUM_THREADS', str(args.ncores_per_instance))
182195

183196
skip_list = []
@@ -187,8 +200,12 @@ def launch(self, args):
187200

188201
# Set environment variables for multi-instance execution
189202
for k,v in self.environ_set.items():
203+
if task_mgr == self.tm_supported[1]:
204+
if omp_runtime == 'default' and k == 'GOMP_CPU_AFFINITY':
205+
continue
206+
if omp_runtime == 'intel' and k == 'KMP_AFFINITY':
207+
continue
190208
self.verbose('info', f'env: {k}={v}')
191-
os.environ[k] = v
192209

193210
if args.auto_ipex:
194211
args.program = auto_ipex.apply_monkey_patch(args.program, args.dtype, args.auto_ipex_verbose, args.disable_ipex_graph_mode)
@@ -203,15 +220,15 @@ def launch(self, args):
203220
instance_idx = list(set(instance_idx))
204221
assert set(instance_idx).issubset(set(instances_available)), f'Designated nodes list contains invalid nodes.'
205222
processes = []
206-
os.environ["LAUNCH_CMD"] = "#"
207223
for i in instance_idx:
208224
process = self.execution_command_builder(
209225
args = args,
226+
omp_runtime = omp_runtime,
210227
task_mgr = task_mgr,
228+
environ = self.environ_set,
211229
cpu_pools = self.cpuinfo.pools_ondemand,
212230
index = i)
213231
processes.append(process)
214-
os.environ["LAUNCH_CMD"] = os.environ["LAUNCH_CMD"][:-2]
215232
try:
216233
for process in processes:
217234
p = process['process']

0 commit comments

Comments
 (0)