ENH: Add kernel profiler log for linear regression reduction #3442

david-cortes-intel · 2025-11-19T14:55:33Z

Description

Adds a kernel profiler log for the whole reduction phase in linear regression.

This revealed that something fishy is going on - it shows the operation taking a lot of time, but the suboperations that already have logs inside it amount to only a small fraction - example after merging #3435 here:

@Alexandr-Solovev Any comments about whether this is working as it should?

Checklist:

Completeness and readability

Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
I have resolved any merge conflicts that might occur with the base branch.

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.

cpp/daal/src/algorithms/linear_model/linear_model_train_normeq_update_impl.i

Alexandr-Solovev · 2025-11-20T08:16:05Z

@Alexandr-Solovev Any comments about whether this is working as it should?

Are you sure that we cover everything inside the reduction with kernel profiler? If yes, may be there is an issue with threading tasks times, I can check it

david-cortes-intel · 2025-11-20T08:30:03Z

@Alexandr-Solovev Any comments about whether this is working as it should?

Are you sure that we cover everything inside the reduction with kernel profiler? If yes, may be there is an issue with threading tasks times, I can check it

Yes, everything inside the threading task is profiled already, and the 'delete' call, if profiled, amounts to a small amount of time.

Alexandr-Solovev · 2025-11-20T08:33:15Z

@Alexandr-Solovev Any comments about whether this is working as it should?

Are you sure that we cover everything inside the reduction with kernel profiler? If yes, may be there is an issue with threading tasks times, I can check it

Yes, everything inside the threading task is profiled already, and the 'delete' call, if profiled, amounts to a small amount of time.

Can you may be do a tricky experiment with set_num_threads(1)?
Will it be the same picture or no?

david-cortes-intel · 2025-11-20T08:45:58Z

@Alexandr-Solovev Any comments about whether this is working as it should?

Are you sure that we cover everything inside the reduction with kernel profiler? If yes, may be there is an issue with threading tasks times, I can check it

Yes, everything inside the threading task is profiled already, and the 'delete' call, if profiled, amounts to a small amount of time.

Can you may be do a tricky experiment with set_num_threads(1)? Will it be the same picture or no?

If I set it to a single thread with numactl --physbind <number>, then it still shows some gap:

Algorithm tree analyzer
|-- computeUpdate time: 918.48ms 94.68% 1 times in a sequential region
|  |-- update.syrkX time: 6.10ms 0.63% 157 times in a parallel region
|  |-- update.gemm1X time: 124.40us 0.01% 157 times in a parallel region
|  |-- update.gemmXY time: 8.22ms 0.85% 157 times in a parallel region
|  |-- update.gemm1Y time: 33.87us 0.00% 157 times in a parallel region
|  |-- reduction time: 4.64ms 0.48% 1 times in a sequential region
|  |  |-- reduce.syrkX time: 4.46ms 0.46% 1 times in a parallel region
|  |  |-- reduce.gemmXY time: 13.04us 0.00% 1 times in a parallel region
|  |  |-- reduction.delete time: 167.01us 0.02% 1 times in a parallel region
|-- computeFinalize time: 51.66ms 5.32% 1 times in a sequential region
|  |-- computeFinalize.betaBufCopy time: 4.74us 0.00% 1 times in a sequential region
|  |-- computeFinalize.xtxCopy time: 6.55ms 0.68% 1 times in a sequential region
|  |-- computeFinalize.computeBetasImpl time: 44.98ms 4.64% 1 times in a sequential region
|  |  |-- solveSymmetricEquationsSystem time: 44.96ms 4.63% 1 times in a sequential region
|  |  |  |-- solveEquationsSystemWithCholesky time: 36.99ms 3.81% 1 times in a sequential region
|  |-- computeFinalize.copyBetaToResult time: 5.81us 0.00% 1 times in a sequential region
|--(end)
DAAL KERNEL_PROFILER: kernels total time 970.14ms

If I pass n_jobs=1, then the times more or less agree:

Algorithm tree analyzer
|-- computeUpdate time: 946.55ms 94.70% 1 times in a sequential region
|  |-- update.syrkX time: 6.11ms 0.61% 157 times in a parallel region
|  |-- update.gemm1X time: 122.16us 0.01% 157 times in a parallel region
|  |-- update.gemmXY time: 34.14ms 3.42% 157 times in a parallel region
|  |-- update.gemm1Y time: 32.65us 0.00% 157 times in a parallel region
|  |-- reduction time: 4.71ms 0.47% 1 times in a sequential region
|  |  |-- reduce.syrkX time: 4.53ms 0.45% 1 times in a parallel region
|  |  |-- reduce.gemmXY time: 10.18us 0.00% 1 times in a parallel region
|  |  |-- reduction.delete time: 171.92us 0.02% 1 times in a parallel region
|-- computeFinalize time: 52.97ms 5.30% 1 times in a sequential region
|  |-- computeFinalize.betaBufCopy time: 4.77us 0.00% 1 times in a sequential region
|  |-- computeFinalize.xtxCopy time: 6.70ms 0.67% 1 times in a sequential region
|  |-- computeFinalize.computeBetasImpl time: 46.15ms 4.62% 1 times in a sequential region
|  |  |-- solveSymmetricEquationsSystem time: 46.14ms 4.62% 1 times in a sequential region
|  |  |  |-- solveEquationsSystemWithCholesky time: 38.30ms 3.83% 1 times in a sequential region
|  |-- computeFinalize.copyBetaToResult time: 6.08us 0.00% 1 times in a sequential region
|--(end)
DAAL KERNEL_PROFILER: kernels total time 999.52ms

But I'm not sure what the percentages are showing in such case. Are those percentages always meant to apply to the top entry?

Alexandr-Solovev · 2025-11-20T09:34:52Z

cpp/daal/src/algorithms/linear_model/linear_model_train_normeq_update_impl.i

    daal::static_tls<ThreadingTaskType *> tls([=]() -> ThreadingTaskType * { return ThreadingTaskType::create(nBetasIntercept, nResponses); });

    SafeStatus safeStat;
    daal::static_threader_for(nBlocks, [=, &tls, &xTable, &yTable, &safeStat](int iBlock, size_t tid) {


I checked locally and looks like this block takes much of the time

Can you try to also put it wrap it with profiler

Before this PR, I tried doing it with direct calls to std::chrono, and the time was spent during the 'reduce' call. That call also has profile logs inside it, and they more or less sum to the total time spent before/after that call.

on my data reduction 1 is static_threader_for

btw, can you rebase your branch with the latest updates in main

Could you try it like this on some server with at least 64 cores and see what you get:

import numpy as np from time import time import os from sklearnex.linear_model import Ridge import joblib TRAIN_SIZE=20000 TEST_SIZE=80000 rng = np.random.default_rng(seed=123) X_train = rng.random(size=(TRAIN_SIZE, 2000)) y_train = rng.random(size=X_train.shape[0]) estimator = Ridge(fit_intercept=True, alpha=2.0) tm0 = time() result = estimator.fit(X_train, y_train) tm1 = time() print(tm1 - tm0)

btw, can you rebase your branch with the latest updates in main

These screenshots were done after merging the PR that fixes kernel profiler bugs btw.

I used the data generated with these parameters X, y = generate_dataset( n_rows_X=20000, n_cols_X=2000, n_rows_y=20000, n_cols_y=1, seed=42 )

In that screenshot:

The parts that are under 'reduction_1' sum to more than its parent.

The parts under 'reduction' are only a small fraction of the parent.

Perhaps it might be more clear if you put calls to std::chrono line-by-line at these points:

oneDAL/cpp/daal/src/algorithms/linear_model/linear_model_train_normeq_update_impl.i

Line 316 in df4033c

daal::static_threader_for(nBlocks, [=, &tls, &xTable, &yTable, &safeStat](int iBlock, size_t tid) {

oneDAL/cpp/daal/src/algorithms/linear_model/linear_model_train_normeq_update_impl.i

Line 336 in df4033c

Status st = safeStat.detach();

oneDAL/cpp/daal/src/algorithms/linear_model/linear_model_train_normeq_update_impl.i

Line 337 in df4033c

tls.reduce([=, &st](ThreadingTaskType * tlsLocal) -> void {

oneDAL/cpp/daal/src/algorithms/linear_model/linear_model_train_normeq_update_impl.i

Line 343 in df4033c

return st;

add log for total reduction time

b5deb5f

david-cortes-intel requested a review from Vika-F November 19, 2025 14:55

david-cortes-intel requested review from Alexandr-Solovev, avolkov-intel, ethanglaser and icfaust as code owners November 19, 2025 14:55

david-cortes-intel added the enhancement label Nov 19, 2025

Alexandr-Solovev reviewed Nov 19, 2025

View reviewed changes

cpp/daal/src/algorithms/linear_model/linear_model_train_normeq_update_impl.i Show resolved Hide resolved

Alexandr-Solovev reviewed Nov 20, 2025

View reviewed changes

Merge branch 'main' into log_linreg_reduction

adb0427

ENH: Add kernel profiler log for linear regression reduction #3442

Are you sure you want to change the base?

ENH: Add kernel profiler log for linear regression reduction #3442

Uh oh!

Conversation

david-cortes-intel commented Nov 19, 2025

Description

Uh oh!

Uh oh!

Alexandr-Solovev commented Nov 20, 2025

Uh oh!

david-cortes-intel commented Nov 20, 2025

Uh oh!

Alexandr-Solovev commented Nov 20, 2025

Uh oh!

david-cortes-intel commented Nov 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants