-
Notifications
You must be signed in to change notification settings - Fork 738
Two-stage ANALYZE with adaptive count-min sketch params #30206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
🟢 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a two-stage ANALYZE process with adaptive count-min sketch parameters. The first stage calculates total row count and distinct value counts for each column using HyperLogLog. The second stage uses these statistics to determine which columns need count-min sketches and calculates optimal parameters (width and depth) based on the data distribution.
Key changes:
- Refactored test utilities to separate table creation from data population and statistics collection
- Introduced
IColumnStatisticEvalinterface for extensible statistics calculation - Added
TSelectBuilderclass to dynamically construct YQL queries for statistics aggregation - Modified statistics storage to support multiple statistic types per column
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| ydb/core/statistics/ut_common/ut_common.h | Added new test helper functions and structures, refactored function signatures to return TTableInfo |
| ydb/core/statistics/ut_common/ut_common.cpp | Implemented test helpers, changed table schemas from Uint64 to String for Value column |
| ydb/core/statistics/service/ut/ya.make | Added dependencies for UDF functions (digest, hyperloglog) |
| ydb/core/statistics/service/ut/ut_http_request.cpp | Updated tests to use new helper functions |
| ydb/core/statistics/service/ut/ut_column_statistics.cpp | Simplified tests using new helper functions |
| ydb/core/statistics/service/ut/ut_basic_statistics.cpp | Updated to use PrepareUniformTable instead of CreateUniformTable |
| ydb/core/statistics/events.h | Added SIMPLE_COLUMN stat type and TStatisticsItem structure, extended TEvFinishTraversal |
| ydb/core/statistics/database/ut/ut_database.cpp | Updated tests to use new TStatisticsItem structure |
| ydb/core/statistics/database/database.h | Changed CreateSaveStatisticsQuery signature to accept TStatisticsItem vector |
| ydb/core/statistics/database/database.cpp | Refactored save query to handle heterogeneous statistics items with optional column tags |
| ydb/core/statistics/aggregator/ya.make | Added new source files for select builder |
| ydb/core/statistics/aggregator/ut/ya.make | Added UDF dependencies |
| ydb/core/statistics/aggregator/ut/ut_traverse_datashard.cpp | Updated tests to use PrepareUniformTable |
| ydb/core/statistics/aggregator/ut/ut_traverse_columnshard.cpp | Simplified tests using new helper functions |
| ydb/core/statistics/aggregator/ut/ut_analyze_datashard.cpp | Added validation functions, updated tests |
| ydb/core/statistics/aggregator/ut/ut_analyze_columnshard.cpp | Simplified tests using new helper functions |
| ydb/core/statistics/aggregator/tx_finish_trasersal.cpp | Added handling for TEvFinishTraversal with statistics payload |
| ydb/core/statistics/aggregator/tx_aggr_stat_response.cpp | Added filtering to skip columns without names |
| ydb/core/statistics/aggregator/select_builder.h | New class for building YQL SELECT queries with UDAF aggregations |
| ydb/core/statistics/aggregator/select_builder.cpp | Implementation of query builder |
| ydb/core/statistics/aggregator/analyze_actor.h | Added two-stage processing, IColumnStatisticEval interface, refactored column descriptor |
| ydb/core/statistics/aggregator/analyze_actor.cpp | Implemented two-stage analysis with adaptive CMS parameters |
| ydb/core/statistics/aggregator/aggregator_impl.h | Added StatisticsToSave member |
| ydb/core/statistics/aggregator/aggregator_impl.cpp | Updated to handle new statistics items format |
| ydb/core/protos/statistics.proto | Added TSimpleColumnStatistics message |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| , Name(std::move(name)) | ||
| {} | ||
| TColumnDesc(TColumnDesc&&) noexcept = default; | ||
| TColumnDesc& operator=(TColumnDesc&) noexcept = default; |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copy assignment operator has the wrong signature. It should take a const TColumnDesc& or TColumnDesc&& parameter, but currently takes TColumnDesc& (non-const lvalue reference). This won't work correctly for assignment operations.
| TColumnDesc& operator=(TColumnDesc&) noexcept = default; | |
| TColumnDesc& operator=(const TColumnDesc&) noexcept = default; |
| { | ||
| .Tag = 1, | ||
| .Probes{ {1, 4}, {2, 4} } | ||
| .Tag = 2, // Key column |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says "Key column" but Tag = 2, which is the Value column according to the table schema (Key is column 1, Value is column 2). The comment should say "Value column" to match the actual column being tested.
| .Tag = 2, // Key column | |
| .Tag = 2, // Value column |
| continue; | ||
| } | ||
| if (statEval->EstimateSize() >= 4_MB) { | ||
| // To avoid: Error: ydb/library/yql/dq/runtime/dq_output_channel.cpp:405: Row data size is too big: 53839241 bytes, exceeds limit of 50331648 bytes |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] This comment contains a very specific error message from ydb/library/yql/dq/runtime/dq_output_channel.cpp:405. While it explains why the limit exists, it would be clearer to include a general explanation like "Exceeds DQ output channel size limit" rather than embedding an internal error message path that may become outdated if the code is refactored.
| // To avoid: Error: ydb/library/yql/dq/runtime/dq_output_channel.cpp:405: Row data size is too big: 53839241 bytes, exceeds limit of 50331648 bytes | |
| // To avoid exceeding DQ output channel size limit |
| size_t ColumnCount() const { | ||
| return Columns.size(); | ||
| const double c = 10; | ||
| const double eps = (c - 1) * (1 + std::log10(n / ndv)) / ndv; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If n == 0 , then std::log10(0) = -infinity
| } | ||
| const double n = simpleStats.GetCount(); | ||
| const double ndv = simpleStats.GetCountDistinct(); | ||
| if (ndv == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (ndv == 0) { | |
| if (n == 0 || ndv == 0 || ndv > n) { |
| if (!RequestedColumnTags.empty()) { | ||
| for (const auto& colTag : RequestedColumnTags) { | ||
| auto colIt = tag2Column.find(colTag); | ||
| if (colIt == tag2Column.end()) { | ||
| // Column probably already deleted, skip it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen if all the columns are deleted? Columns will be empty.
A check should be added.
Changelog category
Description for reviewers
Calculate column statistics in two stages, first stage determines total count and number of distinct values for each column, second stage calculates count-min sketches for some columns.