-
Notifications
You must be signed in to change notification settings - Fork 738
Fix JSON paths #30243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix JSON paths #30243
Conversation
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
🟢 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes JSON path handling by implementing a proper JSONPath parser and accessor system, replacing the previous workaround that manually stripped $. prefixes from paths. The changes introduce a new TJsonPathAccessor and TJsonPathAccessorTrie infrastructure for correctly parsing and navigating JSON paths according to RFC 9535 standards.
Key Changes:
- Introduced new JSON path processing infrastructure using YQL's jsonpath parser instead of manual string manipulation
- Removed the workaround in query optimization that stripped quotes from simple JSON paths
- Updated JSON subcolumn extraction to properly handle quoted keys and special characters in member names
Reviewed changes
Copilot reviewed 27 out of 27 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| json_value_path.h/cpp | New core JSON path parsing and accessor infrastructure |
| kqp_opt_phy_olap_filter.cpp | Removed manual JSON path quote-stripping workaround |
| kernel_logic.h/cpp | Updated to use new JSON path extraction with proper error handling |
| json_ut.cpp | Added comprehensive tests for special characters and path parsing |
| ut_sub_columns.cpp | Added unit tests for JSON path trie and accessor functionality |
| stats.h | Updated key lookup to use JSON path trie instead of string comparison |
| partial.h/cpp | Changed to return TJsonPathAccessor with remaining path support |
| others_storage.h/cpp | Updated accessor methods to use new JSON path infrastructure |
| columns_storage.h | Modified GetPathAccessor to use trie-based lookup |
| accessor.h/cpp | Refactored TJsonRestorer to use proper JSON path parsing |
| direct_builder.cpp | Updated key string building to use QuoteJsonItem for proper escaping |
| data_extractor.cpp | Restricted to only support top-level JSON objects |
| json_extractors.cpp | Added handling for empty objects in JSON extraction |
| written.cpp | Added debug logging (appears to be temporary) |
| composite_serial/accessor.cpp | Fixed incorrect TConclusion boolean check |
| binary_json_value_view.h/cpp | Exposed JsonNumberToString as public static method |
| ya.make | Added jsonpath dependency |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // TODO: return failure ? | ||
| return NAccessor::TTrivialArray::BuildEmpty(std::make_shared<arrow::StringType>()); |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO comment without context or tracking information. The comment suggests uncertainty about whether returning a failure is appropriate here. This should either be resolved with a proper implementation or tracked with an issue number.
| // TODO: IDK why it is needed, re-check | ||
| // return std::make_shared<TSparsedArray>(nullptr, arrow::binary(), recordsCount); |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO comment with uncertainty and lacking tracking information. The comment "IDK why it is needed, re-check" suggests the author is unsure about the commented-out code's purpose. Either remove the commented code if it's not needed, or properly document why it might be needed in the future.
| // TODO: IDK why it is needed, re-check | |
| // return std::make_shared<TSparsedArray>(nullptr, arrow::binary(), recordsCount); |
| // TODO: return failure ? | ||
| return NAccessor::TTrivialArray::BuildEmpty(std::make_shared<arrow::StringType>()); |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate TODO comment with identical text and uncertainty. This appears in multiple locations with the same issue - lack of clarity about whether returning a failure is appropriate.
| } | ||
| const auto& nodes = result.GetNodes(); | ||
| if (nodes.size() != 1) { | ||
| // TODO: Find case when it is needed and maybe support |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO comment lacking context. What specific case needs to be found? When would nodes.size() != 1 be needed and how should it be supported? Add tracking information or remove if not needed.
| // TODO: Find case when it is needed and maybe support |
| visitor(nodes[i].GetBool() ? "true" : "false"); | ||
| break; | ||
| case NYql::NJsonPath::EValueType::Number: | ||
| visitor(::ToString(nodes[i].GetNumber())); | ||
| break; | ||
| case NYql::NJsonPath::EValueType::String: | ||
| visitor(nodes[i].GetString()); |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Variable name mismatch - using i instead of 0 in array index. The loop variable is not defined in this scope; should use index 0 since nodes.size() is checked to be 1 on line 177.
Lines 184, 187, 190 should use nodes[0] instead of nodes[i].
| visitor(nodes[i].GetBool() ? "true" : "false"); | |
| break; | |
| case NYql::NJsonPath::EValueType::Number: | |
| visitor(::ToString(nodes[i].GetNumber())); | |
| break; | |
| case NYql::NJsonPath::EValueType::String: | |
| visitor(nodes[i].GetString()); | |
| visitor(nodes[0].GetBool() ? "true" : "false"); | |
| break; | |
| case NYql::NJsonPath::EValueType::Number: | |
| visitor(::ToString(nodes[0].GetNumber())); | |
| break; | |
| case NYql::NJsonPath::EValueType::String: | |
| visitor(nodes[0].GetString()); |
| auto jsonPathAccessorTrie = std::make_shared<NKikimr::NArrow::NAccessor::NSubColumns::TJsonPathAccessorTrie>(); | ||
| for (ui32 i = 0; i < DataNames->length(); ++i) { | ||
| const auto arrView = DataNames->GetView(i); | ||
| if (std::string_view(arrView.data(), arrView.size()) == keyName) { | ||
| return i; | ||
| } | ||
| auto insertResult = jsonPathAccessorTrie->Insert(ToJsonPath(TStringBuf(arrView.data(), arrView.size())), nullptr, i); | ||
| AFL_VERIFY(insertResult.IsSuccess())("error", insertResult.GetErrorMessage()); | ||
| } | ||
| return std::nullopt; | ||
| auto accessorResult = jsonPathAccessorTrie->GetAccessor(ToJsonPath(keyName)); | ||
| if (accessorResult.IsFail()) { | ||
| return std::nullopt; | ||
| } | ||
|
|
||
| auto accessor = accessorResult.DetachResult(); | ||
| if (!accessor) { | ||
| return std::nullopt; | ||
| } | ||
|
|
||
| return accessor->GetCookie(); |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name 'i' in the loop should be consistent with the convention used elsewhere in the codebase. More importantly, this creates inefficiency - a new TJsonPathAccessorTrie is created on every call to GetKeyIndexOptional, and all data names are inserted into it each time. This operation is O(n*m) where n is the number of data names and m is the average path length. Consider caching the trie or using a more efficient lookup structure.
| TConclusion<std::shared_ptr<TJsonPathAccessor>> GetPathAccessor(const std::string_view path) const { | ||
| auto jsonPathAccessorTrie = std::make_shared<NKikimr::NArrow::NAccessor::NSubColumns::TJsonPathAccessorTrie>(); | ||
| for (ui32 i = 0; i < Stats.GetColumnsCount(); ++i) { | ||
| auto insertResult = jsonPathAccessorTrie->Insert(ToSubcolumnName(Stats.GetColumnName(i)), Records->GetColumnVerified(i)); | ||
| AFL_VERIFY(insertResult.IsSuccess())("error", insertResult.GetErrorMessage()); | ||
| } | ||
| return jsonPathAccessorTrie->GetAccessor(path); |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar performance issue: A new TJsonPathAccessorTrie is created and populated on every call to GetPathAccessor. This is inefficient for repeated lookups. Consider caching the trie as a member variable or using a more efficient lookup structure.
| auto jsonPathAccessorTrie = std::make_shared<NKikimr::NArrow::NAccessor::NSubColumns::TJsonPathAccessorTrie>(); | ||
| auto headerStats = Header.GetColumnStats(); | ||
| for (ui32 i = 0; i < headerStats.GetDataNamesCount(); ++i) { | ||
| if (PartialColumnsData.HasColumn(i)) { | ||
| auto insertResult = jsonPathAccessorTrie->Insert(NSubColumns::ToJsonPath(headerStats.GetColumnName(i)), PartialColumnsData.GetAccessorVerified(i)); | ||
| AFL_VERIFY(insertResult.IsSuccess())("error", insertResult.GetErrorMessage()); | ||
| } | ||
| } | ||
|
|
||
| auto accessorResult = jsonPathAccessorTrie->GetAccessor(svPath); | ||
| if (accessorResult.IsSuccess() && accessorResult.GetResult()->IsValid()) { | ||
| return accessorResult; | ||
| } | ||
|
|
||
| if (OthersData) { | ||
| return OthersData->GetPathAccessor(svPath, recordsCount); | ||
| } else { | ||
| AFL_VERIFY(!Header.GetOtherStats().GetKeyIndexOptional(svPath)); | ||
| return std::make_shared<TTrivialArray>(TThreadSimpleArraysCache::GetNull(arrow::binary(), recordsCount)); | ||
| } | ||
|
|
||
| return std::shared_ptr<NSubColumns::TJsonPathAccessor>{}; |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar performance issue: A new TJsonPathAccessorTrie is created and populated on every call. This repeats the inefficiency pattern seen in other files.
| Variator::ToExecutor(Variator::SingleScript(Sprintf(__SCRIPT_CONTENT.c_str(), injection.c_str()))).Execute(); | ||
| } | ||
|
|
||
| // TODO: fix if top-level arrays are needed |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete TODO comment lacks context. What specific scenario needs fixing? What are "top-level arrays"? This should either be completed with a proper explanation and tracking information (e.g., issue number) or removed if the feature is definitively not needed.
| // TODO: fix if top-level arrays are needed |
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
Changelog entry
...
Changelog category
Description for reviewers
...