Skip to content

Conversation

@david-cortes-intel
Copy link
Contributor

Description

Adds a kernel profiler log for the whole reduction phase in linear regression.

This revealed that something fishy is going on - it shows the operation taking a lot of time, but the suboperations that already have logs inside it amount to only a small fraction - example after merging #3435 here:
image

@Alexandr-Solovev Any comments about whether this is working as it should?


Checklist:

Completeness and readability

  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.

@Alexandr-Solovev
Copy link
Contributor

@Alexandr-Solovev Any comments about whether this is working as it should?

Are you sure that we cover everything inside the reduction with kernel profiler? If yes, may be there is an issue with threading tasks times, I can check it

@david-cortes-intel
Copy link
Contributor Author

@Alexandr-Solovev Any comments about whether this is working as it should?

Are you sure that we cover everything inside the reduction with kernel profiler? If yes, may be there is an issue with threading tasks times, I can check it

Yes, everything inside the threading task is profiled already, and the 'delete' call, if profiled, amounts to a small amount of time.

@Alexandr-Solovev
Copy link
Contributor

@Alexandr-Solovev Any comments about whether this is working as it should?

Are you sure that we cover everything inside the reduction with kernel profiler? If yes, may be there is an issue with threading tasks times, I can check it

Yes, everything inside the threading task is profiled already, and the 'delete' call, if profiled, amounts to a small amount of time.

Can you may be do a tricky experiment with set_num_threads(1)?
Will it be the same picture or no?

@david-cortes-intel
Copy link
Contributor Author

@Alexandr-Solovev Any comments about whether this is working as it should?

Are you sure that we cover everything inside the reduction with kernel profiler? If yes, may be there is an issue with threading tasks times, I can check it

Yes, everything inside the threading task is profiled already, and the 'delete' call, if profiled, amounts to a small amount of time.

Can you may be do a tricky experiment with set_num_threads(1)? Will it be the same picture or no?

If I set it to a single thread with numactl --physbind <number>, then it still shows some gap:

Algorithm tree analyzer
|-- computeUpdate time: 918.48ms 94.68% 1 times in a sequential region
|  |-- update.syrkX time: 6.10ms 0.63% 157 times in a parallel region
|  |-- update.gemm1X time: 124.40us 0.01% 157 times in a parallel region
|  |-- update.gemmXY time: 8.22ms 0.85% 157 times in a parallel region
|  |-- update.gemm1Y time: 33.87us 0.00% 157 times in a parallel region
|  |-- reduction time: 4.64ms 0.48% 1 times in a sequential region
|  |  |-- reduce.syrkX time: 4.46ms 0.46% 1 times in a parallel region
|  |  |-- reduce.gemmXY time: 13.04us 0.00% 1 times in a parallel region
|  |  |-- reduction.delete time: 167.01us 0.02% 1 times in a parallel region
|-- computeFinalize time: 51.66ms 5.32% 1 times in a sequential region
|  |-- computeFinalize.betaBufCopy time: 4.74us 0.00% 1 times in a sequential region
|  |-- computeFinalize.xtxCopy time: 6.55ms 0.68% 1 times in a sequential region
|  |-- computeFinalize.computeBetasImpl time: 44.98ms 4.64% 1 times in a sequential region
|  |  |-- solveSymmetricEquationsSystem time: 44.96ms 4.63% 1 times in a sequential region
|  |  |  |-- solveEquationsSystemWithCholesky time: 36.99ms 3.81% 1 times in a sequential region
|  |-- computeFinalize.copyBetaToResult time: 5.81us 0.00% 1 times in a sequential region
|--(end)
DAAL KERNEL_PROFILER: kernels total time 970.14ms

If I pass n_jobs=1, then the times more or less agree:

Algorithm tree analyzer
|-- computeUpdate time: 946.55ms 94.70% 1 times in a sequential region
|  |-- update.syrkX time: 6.11ms 0.61% 157 times in a parallel region
|  |-- update.gemm1X time: 122.16us 0.01% 157 times in a parallel region
|  |-- update.gemmXY time: 34.14ms 3.42% 157 times in a parallel region
|  |-- update.gemm1Y time: 32.65us 0.00% 157 times in a parallel region
|  |-- reduction time: 4.71ms 0.47% 1 times in a sequential region
|  |  |-- reduce.syrkX time: 4.53ms 0.45% 1 times in a parallel region
|  |  |-- reduce.gemmXY time: 10.18us 0.00% 1 times in a parallel region
|  |  |-- reduction.delete time: 171.92us 0.02% 1 times in a parallel region
|-- computeFinalize time: 52.97ms 5.30% 1 times in a sequential region
|  |-- computeFinalize.betaBufCopy time: 4.77us 0.00% 1 times in a sequential region
|  |-- computeFinalize.xtxCopy time: 6.70ms 0.67% 1 times in a sequential region
|  |-- computeFinalize.computeBetasImpl time: 46.15ms 4.62% 1 times in a sequential region
|  |  |-- solveSymmetricEquationsSystem time: 46.14ms 4.62% 1 times in a sequential region
|  |  |  |-- solveEquationsSystemWithCholesky time: 38.30ms 3.83% 1 times in a sequential region
|  |-- computeFinalize.copyBetaToResult time: 6.08us 0.00% 1 times in a sequential region
|--(end)
DAAL KERNEL_PROFILER: kernels total time 999.52ms

But I'm not sure what the percentages are showing in such case. Are those percentages always meant to apply to the top entry?

daal::static_tls<ThreadingTaskType *> tls([=]() -> ThreadingTaskType * { return ThreadingTaskType::create(nBetasIntercept, nResponses); });

SafeStatus safeStat;
daal::static_threader_for(nBlocks, [=, &tls, &xTable, &yTable, &safeStat](int iBlock, size_t tid) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked locally and looks like this block takes much of the time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try to also put it wrap it with profiler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, I tried doing it with direct calls to std::chrono, and the time was spent during the 'reduce' call. That call also has profile logs inside it, and they more or less sum to the total time spent before/after that call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on my data reduction 1 is static_threader_for
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, can you rebase your branch with the latest updates in main

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try it like this on some server with at least 64 cores and see what you get:

import numpy as np
from time import time
import os
from sklearnex.linear_model import Ridge
import joblib

TRAIN_SIZE=20000
TEST_SIZE=80000
rng = np.random.default_rng(seed=123)
X_train = rng.random(size=(TRAIN_SIZE, 2000))
y_train = rng.random(size=X_train.shape[0])

estimator = Ridge(fit_intercept=True, alpha=2.0)

tm0 = time()
result = estimator.fit(X_train, y_train)
tm1 = time()
print(tm1 - tm0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, can you rebase your branch with the latest updates in main

These screenshots were done after merging the PR that fixes kernel profiler bugs btw.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image I used the data generated with these parameters X, y = generate_dataset( n_rows_X=20000, n_cols_X=2000, n_rows_y=20000, n_cols_y=1, seed=42 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that screenshot:

  • The parts that are under 'reduction_1' sum to more than its parent.
  • The parts under 'reduction' are only a small fraction of the parent.

Perhaps it might be more clear if you put calls to std::chrono line-by-line at these points:

daal::static_threader_for(nBlocks, [=, &tls, &xTable, &yTable, &safeStat](int iBlock, size_t tid) {


tls.reduce([=, &st](ThreadingTaskType * tlsLocal) -> void {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants