Skip to content

Conversation

@ztlpn
Copy link
Collaborator

@ztlpn ztlpn commented Dec 5, 2025

Changelog category

  • Not for changelog (changelog entry is not required)

Description for reviewers

Calculate column statistics in two stages, first stage determines total count and number of distinct values for each column, second stage calculates count-min sketches for some columns.

@ztlpn ztlpn requested a review from azevaykin December 5, 2025 12:15
@ztlpn ztlpn requested a review from a team as a code owner December 5, 2025 12:15
@ydbot
Copy link
Collaborator

ydbot commented Dec 5, 2025

Run Extra Tests

Run additional tests for this PR. You can customize:

  • Test Size: small, medium, large (default: all)
  • Test Targets: any directory path (default: ydb/)
  • Sanitizers: ASAN, MSAN, TSAN
  • Coredumps: enable for debugging (default: off)
  • Additional args: custom ya make arguments

▶  Run tests

@ztlpn ztlpn self-assigned this Dec 5, 2025
@github-actions
Copy link

github-actions bot commented Dec 5, 2025

2025-12-05 12:17:03 UTC Pre-commit check linux-x86_64-relwithdebinfo for 3100eb4 has started.
2025-12-05 12:17:21 UTC Artifacts will be uploaded here
2025-12-05 12:19:30 UTC ya make is running...
🟡 2025-12-05 14:40:53 UTC Some tests failed, follow the links below. Going to retry failed tests...

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
41828 38890 0 5 2903 30

2025-12-05 14:41:07 UTC ya make is running... (failed tests rerun, try 2)
🟢 2025-12-05 14:53:24 UTC Tests successful.

Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
57 (only retried tests) 44 0 0 0 13

🟢 2025-12-05 14:53:31 UTC Build successful.
🟢 2025-12-05 14:53:53 UTC ydbd size 2.3 GiB changed* by +59.6 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: 3707c7f merge: 3100eb4 diff diff %
ydbd size 2 465 899 016 Bytes 2 465 960 064 Bytes +59.6 KiB +0.002%
ydbd stripped size 524 793 600 Bytes 524 808 192 Bytes +14.2 KiB +0.003%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

2025-12-05 12:17:19 UTC Pre-commit check linux-x86_64-release-asan for 3100eb4 has started.
2025-12-05 12:17:39 UTC Artifacts will be uploaded here
2025-12-05 12:19:46 UTC ya make is running...
🟡 2025-12-05 14:08:38 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
13546 13464 0 66 7 9

🟢 2025-12-05 14:08:46 UTC Build successful.
🟡 2025-12-05 14:09:15 UTC ydbd size 3.8 GiB changed* by +102.9 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 3707c7f merge: 3100eb4 diff diff %
ydbd size 4 127 315 488 Bytes 4 127 420 880 Bytes +102.9 KiB +0.003%
ydbd stripped size 1 532 447 576 Bytes 1 532 491 544 Bytes +42.9 KiB +0.003%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

🟢 2025-12-05 12:18:46 UTC The validation of the Pull Request description is successful.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a two-stage ANALYZE process with adaptive count-min sketch parameters. The first stage calculates total row count and distinct value counts for each column using HyperLogLog. The second stage uses these statistics to determine which columns need count-min sketches and calculates optimal parameters (width and depth) based on the data distribution.

Key changes:

  • Refactored test utilities to separate table creation from data population and statistics collection
  • Introduced IColumnStatisticEval interface for extensible statistics calculation
  • Added TSelectBuilder class to dynamically construct YQL queries for statistics aggregation
  • Modified statistics storage to support multiple statistic types per column

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
ydb/core/statistics/ut_common/ut_common.h Added new test helper functions and structures, refactored function signatures to return TTableInfo
ydb/core/statistics/ut_common/ut_common.cpp Implemented test helpers, changed table schemas from Uint64 to String for Value column
ydb/core/statistics/service/ut/ya.make Added dependencies for UDF functions (digest, hyperloglog)
ydb/core/statistics/service/ut/ut_http_request.cpp Updated tests to use new helper functions
ydb/core/statistics/service/ut/ut_column_statistics.cpp Simplified tests using new helper functions
ydb/core/statistics/service/ut/ut_basic_statistics.cpp Updated to use PrepareUniformTable instead of CreateUniformTable
ydb/core/statistics/events.h Added SIMPLE_COLUMN stat type and TStatisticsItem structure, extended TEvFinishTraversal
ydb/core/statistics/database/ut/ut_database.cpp Updated tests to use new TStatisticsItem structure
ydb/core/statistics/database/database.h Changed CreateSaveStatisticsQuery signature to accept TStatisticsItem vector
ydb/core/statistics/database/database.cpp Refactored save query to handle heterogeneous statistics items with optional column tags
ydb/core/statistics/aggregator/ya.make Added new source files for select builder
ydb/core/statistics/aggregator/ut/ya.make Added UDF dependencies
ydb/core/statistics/aggregator/ut/ut_traverse_datashard.cpp Updated tests to use PrepareUniformTable
ydb/core/statistics/aggregator/ut/ut_traverse_columnshard.cpp Simplified tests using new helper functions
ydb/core/statistics/aggregator/ut/ut_analyze_datashard.cpp Added validation functions, updated tests
ydb/core/statistics/aggregator/ut/ut_analyze_columnshard.cpp Simplified tests using new helper functions
ydb/core/statistics/aggregator/tx_finish_trasersal.cpp Added handling for TEvFinishTraversal with statistics payload
ydb/core/statistics/aggregator/tx_aggr_stat_response.cpp Added filtering to skip columns without names
ydb/core/statistics/aggregator/select_builder.h New class for building YQL SELECT queries with UDAF aggregations
ydb/core/statistics/aggregator/select_builder.cpp Implementation of query builder
ydb/core/statistics/aggregator/analyze_actor.h Added two-stage processing, IColumnStatisticEval interface, refactored column descriptor
ydb/core/statistics/aggregator/analyze_actor.cpp Implemented two-stage analysis with adaptive CMS parameters
ydb/core/statistics/aggregator/aggregator_impl.h Added StatisticsToSave member
ydb/core/statistics/aggregator/aggregator_impl.cpp Updated to handle new statistics items format
ydb/core/protos/statistics.proto Added TSimpleColumnStatistics message

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

, Name(std::move(name))
{}
TColumnDesc(TColumnDesc&&) noexcept = default;
TColumnDesc& operator=(TColumnDesc&) noexcept = default;
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copy assignment operator has the wrong signature. It should take a const TColumnDesc& or TColumnDesc&& parameter, but currently takes TColumnDesc& (non-const lvalue reference). This won't work correctly for assignment operations.

Suggested change
TColumnDesc& operator=(TColumnDesc&) noexcept = default;
TColumnDesc& operator=(const TColumnDesc&) noexcept = default;

Copilot uses AI. Check for mistakes.
{
.Tag = 1,
.Probes{ {1, 4}, {2, 4} }
.Tag = 2, // Key column
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "Key column" but Tag = 2, which is the Value column according to the table schema (Key is column 1, Value is column 2). The comment should say "Value column" to match the actual column being tested.

Suggested change
.Tag = 2, // Key column
.Tag = 2, // Value column

Copilot uses AI. Check for mistakes.
continue;
}
if (statEval->EstimateSize() >= 4_MB) {
// To avoid: Error: ydb/library/yql/dq/runtime/dq_output_channel.cpp:405: Row data size is too big: 53839241 bytes, exceeds limit of 50331648 bytes
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] This comment contains a very specific error message from ydb/library/yql/dq/runtime/dq_output_channel.cpp:405. While it explains why the limit exists, it would be clearer to include a general explanation like "Exceeds DQ output channel size limit" rather than embedding an internal error message path that may become outdated if the code is refactored.

Suggested change
// To avoid: Error: ydb/library/yql/dq/runtime/dq_output_channel.cpp:405: Row data size is too big: 53839241 bytes, exceeds limit of 50331648 bytes
// To avoid exceeding DQ output channel size limit

Copilot uses AI. Check for mistakes.
size_t ColumnCount() const {
return Columns.size();
const double c = 10;
const double eps = (c - 1) * (1 + std::log10(n / ndv)) / ndv;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If n == 0 , then std::log10(0) = -infinity

}
const double n = simpleStats.GetCount();
const double ndv = simpleStats.GetCountDistinct();
if (ndv == 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (ndv == 0) {
if (n == 0 || ndv == 0 || ndv > n) {

if (!RequestedColumnTags.empty()) {
for (const auto& colTag : RequestedColumnTags) {
auto colIt = tag2Column.find(colTag);
if (colIt == tag2Column.end()) {
// Column probably already deleted, skip it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if all the columns are deleted? Columns will be empty.
A check should be added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants