-
Notifications
You must be signed in to change notification settings - Fork 268
Add a debug option to check if memory allocation is covered by retry framework #13995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…framework Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Greptile SummaryThis PR adds a debug feature to detect memory allocations that are not protected by the retry framework (
Key Design Decisions:
The feature helps developers identify OOM-prone code paths that should be wrapped in retry logic, addressing issue #13672. Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant App as Application Code
participant RRI as RmmRapidsRetryIterator
participant RST as RetryStateTracker
participant Alloc as Allocator (Host/Device)
participant ARTC as AllocationRetryCoverageTracker
participant RMM as RMM (Debug Mode)
participant CSV as CSV Output File
Note over App,CSV: Enable tracking: export SPARK_RAPIDS_RETRY_COVERAGE_TRACKING=true
App->>RRI: withRetryNoSplit(data)(fn)
activate RRI
RRI->>RST: enterRetryBlock()
activate RST
RST->>RST: increment localRetryBlockDepth
deactivate RST
RRI->>App: execute fn(data)
activate App
App->>Alloc: allocate memory
activate Alloc
alt Device Memory Allocation
Alloc->>RMM: allocation occurs
RMM->>ARTC: onAllocated() callback
else Host Memory Allocation
Alloc->>ARTC: checkHostAllocation()
end
activate ARTC
ARTC->>RST: isInRetryBlock()
RST-->>ARTC: return depth > 0 (true)
Note over ARTC: Allocation is COVERED - no logging
deactivate ARTC
Alloc-->>App: memory allocated
deactivate Alloc
App-->>RRI: result
deactivate App
RRI->>RST: exitRetryBlock()
activate RST
RST->>RST: decrement localRetryBlockDepth
deactivate RST
deactivate RRI
Note over App,CSV: Scenario: Uncovered Allocation
App->>Alloc: direct allocation (no retry wrapper)
activate Alloc
alt Device Memory
Alloc->>RMM: allocation occurs
RMM->>ARTC: onAllocated() callback
else Host Memory
Alloc->>ARTC: checkHostAllocation()
end
activate ARTC
ARTC->>RST: isInRetryBlock()
RST-->>ARTC: return depth == 0 (false)
ARTC->>ARTC: capture stack trace
ARTC->>ARTC: filter to Rapids classes
ARTC->>ARTC: deduplicate by stackKey
ARTC->>CSV: write uncovered allocation record
Note over CSV: kind,call_stack
deactivate ARTC
Alloc-->>App: memory allocated
deactivate Alloc
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, no comments
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
Resulting files generated by the python script which can be opend by https://www.speedscope.app/ uncovered_allocations_DEVICE.speedscope.json uncovered_allocations_HOST.speedscope.json
|
Contributes to NVIDIA/spark-rapids#13672 For debugging, I'd like to expose an `enableDebug` parameter in `RmmSpark.setEventHandler` to check if every memory allocation is covered withRetry in spark-rapids. See NVIDIA/spark-rapids#13995 <!-- Thank you for contributing to RAPIDS Accelerator for Apache Spark! Here are some guidelines to help the review process go smoothly. 1. Please write a description in this text box of the changes that are being made. 2. Please ensure that you have written units tests for the changes made/features added. 3. If you are closing an issue please use one of the automatic closing words as noted here: https://help.github.com/articles/closing-issues-using-keywords/ 4. If your pull request is not ready for review but you want to make use of the continuous integration testing facilities please create a draft pull rqeuest or prefix the pull request summary with `[WIP]`. 5. If your pull request is ready to be reviewed without requiring additional work on top of it then remove any `[WIP]` prefix in the summary and restore it from draft status if necessary. 6. Once all work has been done and review has taken place please do not add features or make changes out of the scope of those requested by the reviewer (doing this just add delays as already reviewed code ends up having to be re-reviewed/it is hard to tell what is new etc!). Further, please avoid rebasing your branch during the review process, as this causes the context of any comments made by reviewers to be lost. If conflicts occur during review then they should be resolved by merging into the branch used for making the pull request. Many thanks in advance for your cooperation! --> Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, no comments
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala
Show resolved
Hide resolved
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (2)
-
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala, line 122-127 (link)logic: the detection logic will incorrectly flag allocations inside
OOMRetryStateclass methods as covered, even thoughOOMRetryStateisn't a retry framework method - it's just a helper class that happens to have "Retry" in its name -
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala, line 146-153 (link)logic: check-then-act race condition - between checking
!loggedStacks.contains(stackKey)on line 146 and callingloggedStacks.add(stackKey)on line 147, another thread could add the same key, causing both threads to log duplicate entries to the file. is duplicate logging acceptable in this debugging tool, or should we ensure strictly unique entries?
5 files reviewed, 2 comments
Should be ok.
It's acceptable. |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala
Outdated
Show resolved
Hide resolved
abellina
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note we need to upgrade copyrights (2026)
I like the idea, had one comment.
| // Note: Scala often generates methods like `$anonfun$withRetryNoSplit$1`, so we match by | ||
| // substring rather than exact method name. | ||
| // Also exclude AllocationRetryCoverageTracker itself since it's always in the stack. | ||
| val hasCoverage = stackTrace.exists { element => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we, instead of pattern matching, use the RetryStateTracker? It presently tracks if the current thread is retrying, but could it also try when it is in a retry block? We could also gate that on whether we have enabled the feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's a better solution, thanks. Updated.
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a debugging feature to track memory allocations that are not covered by the retry framework (withRetry/withRetryNoSplit blocks), helping developers identify code paths that could potentially cause OOM errors without proper retry handling.
Key changes:
- Adds AllocationRetryCoverageTracker to detect and log uncovered host and device memory allocations
- Enhances RetryStateTracker with depth-based tracking to detect if code is executing inside a retry block
- Integrates tracking hooks into HostAlloc for host memory and DeviceMemoryEventHandler for device memory
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala | New utility class that tracks memory allocations not covered by retry framework, logs findings to CSV file |
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala | Adds retry block depth tracking to RetryStateTracker and wraps iterator next() with enter/exit calls |
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/HostAlloc.scala | Integrates coverage tracker check after successful host memory allocation |
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala | Enables RMM debug mode when retry coverage tracking is enabled to get allocation callbacks |
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/DeviceMemoryEventHandler.scala | Implements onAllocated callback to check device memory allocation coverage |
| integration_tests/run_pyspark_from_build.sh | Propagates SPARK_RAPIDS_RETRY_COVERAGE_TRACKING environment variable to test executors |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala, line 585-587 (link)logic:
clearCurThreadRetrying()called here may conflict with the same call at line 755 in parent classRmmRapidsRetryIterator.next(). The parent'snext()already handles clearing retry state after completing all retry attempts. Calling it here in the finally block means it will execute even on successful paths, potentially clearing state prematurely or redundantly.
6 files reviewed, 1 comment
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala, line 586-587 (link)logic:
clearCurThreadRetrying()should be called beforeexitRetryBlock()only on exception path. On success path, the parent'snext()already managessetCurThreadRetryingstate. Calling clear here may incorrectly reset retry state for nested retry blocks.
6 files reviewed, 1 comment
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala
Show resolved
Hide resolved
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks good to me.
Can you add some example lines in the csv file got from you local tests into the description ?
|
Better have more reviews from others |
Thanks. I added the results CSV file to the description. |
abellina
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran this tool locally and I think this is a must. Thanks for doing this.

Contributes to #13672
Depends on NVIDIA/spark-rapids-jni#4061Description
This PR add a debug option to check if memory allocation is covered by retry framework, so we can find if we have code that will do a host/device memory allocation is covered by retry framework, which is potentially cause OOM.
The option is default off by default and should be only useful for developers. We can run it from time to time to find uncovered cases, or set up a CI to run periodically.
Here is the python script to convert the csv to a flamegraph: https://gist.github.com/thirtiseven/f72e14c1a860dc40a75f1351c3616a68
csv result file generated by this pr:
uncovered_allocations.csv
with following command:
Checklists
(Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)