Skip to content

Conversation

@Vladilen
Copy link
Collaborator

@Vladilen Vladilen commented Dec 5, 2025

Changelog entry

...

Changelog category

  • Not for changelog (changelog entry is not required)

Description for reviewers

...

Copilot AI review requested due to automatic review settings December 5, 2025 17:51
@Vladilen Vladilen requested review from a team as code owners December 5, 2025 17:51
@Vladilen Vladilen requested review from XJIE6 and dorooleg December 5, 2025 17:52
@ydbot
Copy link
Collaborator

ydbot commented Dec 5, 2025

Run Extra Tests

Run additional tests for this PR. You can customize:

  • Test Size: small, medium, large (default: all)
  • Test Targets: any directory path (default: ydb/)
  • Sanitizers: ASAN, MSAN, TSAN
  • Coredumps: enable for debugging (default: off)
  • Additional args: custom ya make arguments

▶  Run tests

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

2025-12-05 17:55:36 UTC Pre-commit check linux-x86_64-relwithdebinfo for cc62a54 has started.
2025-12-05 17:55:53 UTC Artifacts will be uploaded here
2025-12-05 17:58:05 UTC ya make is running...
🟡 2025-12-05 20:09:18 UTC Some tests failed, follow the links below. Going to retry failed tests...

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
39469 36519 0 3 2919 28

2025-12-05 20:09:30 UTC ya make is running... (failed tests rerun, try 2)
🟢 2025-12-05 20:20:50 UTC Tests successful.

Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
72 (only retried tests) 60 0 0 0 12

🟢 2025-12-05 20:20:57 UTC Build successful.
🟡 2025-12-05 20:21:16 UTC ydbd size 2.3 GiB changed* by +234.4 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: fe2b581 merge: cc62a54 diff diff %
ydbd size 2 466 152 008 Bytes 2 466 392 024 Bytes +234.4 KiB +0.010%
ydbd stripped size 524 843 136 Bytes 524 865 152 Bytes +21.5 KiB +0.004%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

2025-12-05 17:55:43 UTC Pre-commit check linux-x86_64-release-asan for cc62a54 has started.
2025-12-05 17:56:00 UTC Artifacts will be uploaded here
2025-12-05 17:58:06 UTC ya make is running...
🟡 2025-12-05 19:42:09 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
12063 11998 0 48 7 10

🟢 2025-12-05 19:42:19 UTC Build successful.
🟡 2025-12-05 19:42:43 UTC ydbd size 3.8 GiB changed* by +331.0 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: fe2b581 merge: cc62a54 diff diff %
ydbd size 4 127 809 296 Bytes 4 128 148 208 Bytes +331.0 KiB +0.008%
ydbd stripped size 1 532 692 760 Bytes 1 532 767 704 Bytes +73.2 KiB +0.005%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

🟢 2025-12-05 17:55:48 UTC The validation of the Pull Request description is successful.

Copilot finished reviewing on behalf of Vladilen December 5, 2025 17:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes JSON path handling by implementing a proper JSONPath parser and accessor system, replacing the previous workaround that manually stripped $. prefixes from paths. The changes introduce a new TJsonPathAccessor and TJsonPathAccessorTrie infrastructure for correctly parsing and navigating JSON paths according to RFC 9535 standards.

Key Changes:

  • Introduced new JSON path processing infrastructure using YQL's jsonpath parser instead of manual string manipulation
  • Removed the workaround in query optimization that stripped quotes from simple JSON paths
  • Updated JSON subcolumn extraction to properly handle quoted keys and special characters in member names

Reviewed changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
json_value_path.h/cpp New core JSON path parsing and accessor infrastructure
kqp_opt_phy_olap_filter.cpp Removed manual JSON path quote-stripping workaround
kernel_logic.h/cpp Updated to use new JSON path extraction with proper error handling
json_ut.cpp Added comprehensive tests for special characters and path parsing
ut_sub_columns.cpp Added unit tests for JSON path trie and accessor functionality
stats.h Updated key lookup to use JSON path trie instead of string comparison
partial.h/cpp Changed to return TJsonPathAccessor with remaining path support
others_storage.h/cpp Updated accessor methods to use new JSON path infrastructure
columns_storage.h Modified GetPathAccessor to use trie-based lookup
accessor.h/cpp Refactored TJsonRestorer to use proper JSON path parsing
direct_builder.cpp Updated key string building to use QuoteJsonItem for proper escaping
data_extractor.cpp Restricted to only support top-level JSON objects
json_extractors.cpp Added handling for empty objects in JSON extraction
written.cpp Added debug logging (appears to be temporary)
composite_serial/accessor.cpp Fixed incorrect TConclusion boolean check
binary_json_value_view.h/cpp Exposed JsonNumberToString as public static method
ya.make Added jsonpath dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +49 to +50
// TODO: return failure ?
return NAccessor::TTrivialArray::BuildEmpty(std::make_shared<arrow::StringType>());
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO comment without context or tracking information. The comment suggests uncertainty about whether returning a failure is appropriate here. This should either be resolved with a proper implementation or tracked with an issue number.

Copilot uses AI. Check for mistakes.
Comment on lines +268 to +269
// TODO: IDK why it is needed, re-check
// return std::make_shared<TSparsedArray>(nullptr, arrow::binary(), recordsCount);
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO comment with uncertainty and lacking tracking information. The comment "IDK why it is needed, re-check" suggests the author is unsure about the commented-out code's purpose. Either remove the commented code if it's not needed, or properly document why it might be needed in the future.

Suggested change
// TODO: IDK why it is needed, re-check
// return std::make_shared<TSparsedArray>(nullptr, arrow::binary(), recordsCount);

Copilot uses AI. Check for mistakes.
Comment on lines +58 to +59
// TODO: return failure ?
return NAccessor::TTrivialArray::BuildEmpty(std::make_shared<arrow::StringType>());
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate TODO comment with identical text and uncertainty. This appears in multiple locations with the same issue - lack of clarity about whether returning a failure is appropriate.

Copilot uses AI. Check for mistakes.
}
const auto& nodes = result.GetNodes();
if (nodes.size() != 1) {
// TODO: Find case when it is needed and maybe support
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO comment lacking context. What specific case needs to be found? When would nodes.size() != 1 be needed and how should it be supported? Add tracking information or remove if not needed.

Suggested change
// TODO: Find case when it is needed and maybe support

Copilot uses AI. Check for mistakes.
Comment on lines +184 to +190
visitor(nodes[i].GetBool() ? "true" : "false");
break;
case NYql::NJsonPath::EValueType::Number:
visitor(::ToString(nodes[i].GetNumber()));
break;
case NYql::NJsonPath::EValueType::String:
visitor(nodes[i].GetString());
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Variable name mismatch - using i instead of 0 in array index. The loop variable is not defined in this scope; should use index 0 since nodes.size() is checked to be 1 on line 177.

Lines 184, 187, 190 should use nodes[0] instead of nodes[i].

Suggested change
visitor(nodes[i].GetBool() ? "true" : "false");
break;
case NYql::NJsonPath::EValueType::Number:
visitor(::ToString(nodes[i].GetNumber()));
break;
case NYql::NJsonPath::EValueType::String:
visitor(nodes[i].GetString());
visitor(nodes[0].GetBool() ? "true" : "false");
break;
case NYql::NJsonPath::EValueType::Number:
visitor(::ToString(nodes[0].GetNumber()));
break;
case NYql::NJsonPath::EValueType::String:
visitor(nodes[0].GetString());

Copilot uses AI. Check for mistakes.
Comment on lines +48 to +64
auto jsonPathAccessorTrie = std::make_shared<NKikimr::NArrow::NAccessor::NSubColumns::TJsonPathAccessorTrie>();
for (ui32 i = 0; i < DataNames->length(); ++i) {
const auto arrView = DataNames->GetView(i);
if (std::string_view(arrView.data(), arrView.size()) == keyName) {
return i;
}
auto insertResult = jsonPathAccessorTrie->Insert(ToJsonPath(TStringBuf(arrView.data(), arrView.size())), nullptr, i);
AFL_VERIFY(insertResult.IsSuccess())("error", insertResult.GetErrorMessage());
}
return std::nullopt;
auto accessorResult = jsonPathAccessorTrie->GetAccessor(ToJsonPath(keyName));
if (accessorResult.IsFail()) {
return std::nullopt;
}

auto accessor = accessorResult.DetachResult();
if (!accessor) {
return std::nullopt;
}

return accessor->GetCookie();
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name 'i' in the loop should be consistent with the convention used elsewhere in the codebase. More importantly, this creates inefficiency - a new TJsonPathAccessorTrie is created on every call to GetKeyIndexOptional, and all data names are inserted into it each time. This operation is O(n*m) where n is the number of data names and m is the average path length. Consider caching the trie or using a more efficient lookup structure.

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +28
TConclusion<std::shared_ptr<TJsonPathAccessor>> GetPathAccessor(const std::string_view path) const {
auto jsonPathAccessorTrie = std::make_shared<NKikimr::NArrow::NAccessor::NSubColumns::TJsonPathAccessorTrie>();
for (ui32 i = 0; i < Stats.GetColumnsCount(); ++i) {
auto insertResult = jsonPathAccessorTrie->Insert(ToSubcolumnName(Stats.GetColumnName(i)), Records->GetColumnVerified(i));
AFL_VERIFY(insertResult.IsSuccess())("error", insertResult.GetErrorMessage());
}
return jsonPathAccessorTrie->GetAccessor(path);
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar performance issue: A new TJsonPathAccessorTrie is created and populated on every call to GetPathAccessor. This is inefficient for repeated lookups. Consider caching the trie as a member variable or using a more efficient lookup structure.

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +40
auto jsonPathAccessorTrie = std::make_shared<NKikimr::NArrow::NAccessor::NSubColumns::TJsonPathAccessorTrie>();
auto headerStats = Header.GetColumnStats();
for (ui32 i = 0; i < headerStats.GetDataNamesCount(); ++i) {
if (PartialColumnsData.HasColumn(i)) {
auto insertResult = jsonPathAccessorTrie->Insert(NSubColumns::ToJsonPath(headerStats.GetColumnName(i)), PartialColumnsData.GetAccessorVerified(i));
AFL_VERIFY(insertResult.IsSuccess())("error", insertResult.GetErrorMessage());
}
}

auto accessorResult = jsonPathAccessorTrie->GetAccessor(svPath);
if (accessorResult.IsSuccess() && accessorResult.GetResult()->IsValid()) {
return accessorResult;
}

if (OthersData) {
return OthersData->GetPathAccessor(svPath, recordsCount);
} else {
AFL_VERIFY(!Header.GetOtherStats().GetKeyIndexOptional(svPath));
return std::make_shared<TTrivialArray>(TThreadSimpleArraysCache::GetNull(arrow::binary(), recordsCount));
}

return std::shared_ptr<NSubColumns::TJsonPathAccessor>{};
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar performance issue: A new TJsonPathAccessorTrie is created and populated on every call. This repeats the inefficiency pattern seen in other files.

Copilot uses AI. Check for mistakes.
Variator::ToExecutor(Variator::SingleScript(Sprintf(__SCRIPT_CONTENT.c_str(), injection.c_str()))).Execute();
}

// TODO: fix if top-level arrays are needed
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete TODO comment lacks context. What specific scenario needs fixing? What are "top-level arrays"? This should either be completed with a proper explanation and tracking information (e.g., issue number) or removed if the feature is definitively not needed.

Suggested change
// TODO: fix if top-level arrays are needed

Copilot uses AI. Check for mistakes.
@github-actions
Copy link

github-actions bot commented Dec 5, 2025

2025-12-05 23:29:55 UTC Pre-commit check linux-x86_64-release-asan for fb907a4 has started.
2025-12-05 23:30:04 UTC Artifacts will be uploaded here
2025-12-05 23:31:46 UTC ya make is running...
🟡 2025-12-06 00:49:18 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
11924 11869 0 39 7 9

🟢 2025-12-06 00:49:27 UTC Build successful.
🟡 2025-12-06 00:49:50 UTC ydbd size 3.8 GiB changed* by +327.5 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 435a374 merge: fb907a4 diff diff %
ydbd size 4 127 839 576 Bytes 4 128 174 928 Bytes +327.5 KiB +0.008%
ydbd stripped size 1 532 698 520 Bytes 1 532 774 872 Bytes +74.6 KiB +0.005%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

2025-12-05 23:34:40 UTC Pre-commit check linux-x86_64-relwithdebinfo for fb907a4 has started.
2025-12-05 23:34:53 UTC Artifacts will be uploaded here
2025-12-05 23:36:23 UTC ya make is running...
🟡 2025-12-06 01:17:42 UTC Some tests failed, follow the links below. Going to retry failed tests...

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
39330 36378 0 3 2928 21

2025-12-06 01:17:53 UTC ya make is running... (failed tests rerun, try 2)
🟢 2025-12-06 01:30:29 UTC Tests successful.

Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
75 (only retried tests) 59 0 0 0 16

🟢 2025-12-06 01:30:35 UTC Build successful.
🟡 2025-12-06 01:30:57 UTC ydbd size 2.3 GiB changed* by +229.4 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 435a374 merge: fb907a4 diff diff %
ydbd size 2 466 176 648 Bytes 2 466 411 552 Bytes +229.4 KiB +0.010%
ydbd stripped size 524 848 192 Bytes 524 869 376 Bytes +20.7 KiB +0.004%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants