-
Notifications
You must be signed in to change notification settings - Fork 225
ENH: Add kernel profiler log for linear regression reduction #3442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH: Add kernel profiler log for linear regression reduction #3442
Conversation
cpp/daal/src/algorithms/linear_model/linear_model_train_normeq_update_impl.i
Show resolved
Hide resolved
Are you sure that we cover everything inside the reduction with kernel profiler? If yes, may be there is an issue with threading tasks times, I can check it |
Yes, everything inside the threading task is profiled already, and the 'delete' call, if profiled, amounts to a small amount of time. |
Can you may be do a tricky experiment with set_num_threads(1)? |
If I set it to a single thread with If I pass But I'm not sure what the percentages are showing in such case. Are those percentages always meant to apply to the top entry? |
| daal::static_tls<ThreadingTaskType *> tls([=]() -> ThreadingTaskType * { return ThreadingTaskType::create(nBetasIntercept, nResponses); }); | ||
|
|
||
| SafeStatus safeStat; | ||
| daal::static_threader_for(nBlocks, [=, &tls, &xTable, &yTable, &safeStat](int iBlock, size_t tid) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked locally and looks like this block takes much of the time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try to also put it wrap it with profiler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this PR, I tried doing it with direct calls to std::chrono, and the time was spent during the 'reduce' call. That call also has profile logs inside it, and they more or less sum to the total time spent before/after that call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, can you rebase your branch with the latest updates in main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try it like this on some server with at least 64 cores and see what you get:
import numpy as np
from time import time
import os
from sklearnex.linear_model import Ridge
import joblib
TRAIN_SIZE=20000
TEST_SIZE=80000
rng = np.random.default_rng(seed=123)
X_train = rng.random(size=(TRAIN_SIZE, 2000))
y_train = rng.random(size=X_train.shape[0])
estimator = Ridge(fit_intercept=True, alpha=2.0)
tm0 = time()
result = estimator.fit(X_train, y_train)
tm1 = time()
print(tm1 - tm0)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, can you rebase your branch with the latest updates in main
These screenshots were done after merging the PR that fixes kernel profiler bugs btw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that screenshot:
- The parts that are under 'reduction_1' sum to more than its parent.
- The parts under 'reduction' are only a small fraction of the parent.
Perhaps it might be more clear if you put calls to std::chrono line-by-line at these points:
oneDAL/cpp/daal/src/algorithms/linear_model/linear_model_train_normeq_update_impl.i
Line 316 in df4033c
| daal::static_threader_for(nBlocks, [=, &tls, &xTable, &yTable, &safeStat](int iBlock, size_t tid) { |
oneDAL/cpp/daal/src/algorithms/linear_model/linear_model_train_normeq_update_impl.i
Line 336 in df4033c
| Status st = safeStat.detach(); |
oneDAL/cpp/daal/src/algorithms/linear_model/linear_model_train_normeq_update_impl.i
Line 337 in df4033c
| tls.reduce([=, &st](ThreadingTaskType * tlsLocal) -> void { |
oneDAL/cpp/daal/src/algorithms/linear_model/linear_model_train_normeq_update_impl.i
Line 343 in df4033c
| return st; |


Description
Adds a kernel profiler log for the whole reduction phase in linear regression.
This revealed that something fishy is going on - it shows the operation taking a lot of time, but the suboperations that already have logs inside it amount to only a small fraction - example after merging #3435 here:

@Alexandr-Solovev Any comments about whether this is working as it should?
Checklist:
Completeness and readability
Testing