Add a debug option to check if memory allocation is covered by retry framework #13995

thirtiseven · 2025-12-11T08:43:37Z

Contributes to #13672
~~Depends on NVIDIA/spark-rapids-jni#4061~~

Description

This PR add a debug option to check if memory allocation is covered by retry framework, so we can find if we have code that will do a host/device memory allocation is covered by retry framework, which is potentially cause OOM.

The option is default off by default and should be only useful for developers. We can run it from time to time to find uncovered cases, or set up a CI to run periodically.

Here is the python script to convert the csv to a flamegraph: https://gist.github.com/thirtiseven/f72e14c1a860dc40a75f1351c3616a68

csv result file generated by this pr:

uncovered_allocations.csv

with following command:

SPARK_RAPIDS_RETRY_COVERAGE_TRACKING=true RANDOM_SELECT=0.05 ./integration_tests/run_pyspark_from_build.sh

Checklists

This PR has added documentation for new or modified features or behaviors.
This PR has added new tests or modified existing tests to cover new code paths.
(Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

…framework Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps · 2025-12-11T08:46:53Z

Greptile Summary

This PR adds a debug feature to detect memory allocations that are not protected by the retry framework (withRetry/withRetryNoSplit). The implementation introduces:

AllocationRetryCoverageTracker: A new tracking utility that captures stack traces of uncovered allocations and logs them to a CSV file. Enabled via the SPARK_RAPIDS_RETRY_COVERAGE_TRACKING environment variable (disabled by default).
RetryStateTracker enhancements: Added isInRetryBlock() method with a thread-local depth counter to properly track nested retry blocks. The depth tracking ensures correct behavior when retry blocks are nested.
Integration points:
- Device allocations are tracked via RMM's onAllocated() callback (requires RMM debug mode)
- Host allocations are tracked in HostAlloc after successful allocation
- The shell script propagates the environment variable to executors

Key Design Decisions:

Off by default to avoid performance overhead in production
Uses thread-local depth counters to handle nested retry blocks correctly
Deduplicates logged stacks using ConcurrentHashMap to reduce noise
Conditionally enables RMM debug mode only when tracking is enabled
CSV output appends across multiple JVM processes for integration test runs

The feature helps developers identify OOM-prone code paths that should be wrapped in retry logic, addressing issue #13672.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The implementation is well-designed with proper synchronization, gated behind an environment variable, and disabled by default. The changes are additive and don't modify existing behavior when the feature is disabled. Thread-local depth tracking correctly handles nested retry blocks. No logic or syntax issues found.
No files require special attention

Important Files Changed

Filename	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala	New file adds debug tracking for memory allocations outside retry blocks; well-structured with proper synchronization and CSV output
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala	Adds `RetryStateTracker` with thread-local depth counter to track nested retry blocks; properly integrates with iterator lifecycle
sql-plugin/src/main/scala/com/nvidia/spark/rapids/DeviceMemoryEventHandler.scala	Adds `onAllocated` callback to check device allocation coverage when RMM debug mode is enabled

Sequence Diagram

sequenceDiagram
    participant App as Application Code
    participant RRI as RmmRapidsRetryIterator
    participant RST as RetryStateTracker
    participant Alloc as Allocator (Host/Device)
    participant ARTC as AllocationRetryCoverageTracker
    participant RMM as RMM (Debug Mode)
    participant CSV as CSV Output File

    Note over App,CSV: Enable tracking: export SPARK_RAPIDS_RETRY_COVERAGE_TRACKING=true

    App->>RRI: withRetryNoSplit(data)(fn)
    activate RRI
    RRI->>RST: enterRetryBlock()
    activate RST
    RST->>RST: increment localRetryBlockDepth
    deactivate RST
    
    RRI->>App: execute fn(data)
    activate App
    App->>Alloc: allocate memory
    activate Alloc
    
    alt Device Memory Allocation
        Alloc->>RMM: allocation occurs
        RMM->>ARTC: onAllocated() callback
    else Host Memory Allocation
        Alloc->>ARTC: checkHostAllocation()
    end
    
    activate ARTC
    ARTC->>RST: isInRetryBlock()
    RST-->>ARTC: return depth > 0 (true)
    Note over ARTC: Allocation is COVERED - no logging
    deactivate ARTC
    
    Alloc-->>App: memory allocated
    deactivate Alloc
    App-->>RRI: result
    deactivate App
    
    RRI->>RST: exitRetryBlock()
    activate RST
    RST->>RST: decrement localRetryBlockDepth
    deactivate RST
    deactivate RRI

    Note over App,CSV: Scenario: Uncovered Allocation

    App->>Alloc: direct allocation (no retry wrapper)
    activate Alloc
    
    alt Device Memory
        Alloc->>RMM: allocation occurs
        RMM->>ARTC: onAllocated() callback
    else Host Memory
        Alloc->>ARTC: checkHostAllocation()
    end
    
    activate ARTC
    ARTC->>RST: isInRetryBlock()
    RST-->>ARTC: return depth == 0 (false)
    ARTC->>ARTC: capture stack trace
    ARTC->>ARTC: filter to Rapids classes
    ARTC->>ARTC: deduplicate by stackKey
    ARTC->>CSV: write uncovered allocation record
    Note over CSV: kind,call_stack
    deactivate ARTC
    
    Alloc-->>App: memory allocated
    deactivate Alloc

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2025-12-12T06:00:23Z

Resulting files generated by the python script which can be opend by https://www.speedscope.app/

uncovered_allocations_DEVICE.speedscope.json

uncovered_allocations_HOST.speedscope.json

Contributes to NVIDIA/spark-rapids#13672 For debugging, I'd like to expose an `enableDebug` parameter in `RmmSpark.setEventHandler` to check if every memory allocation is covered withRetry in spark-rapids. See NVIDIA/spark-rapids#13995  Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps

Additional Comments (2)

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala, line 122-127 (link)

logic: the detection logic will incorrectly flag allocations inside OOMRetryState class methods as covered, even though OOMRetryState isn't a retry framework method - it's just a helper class that happens to have "Retry" in its name
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala, line 146-153 (link)

logic: check-then-act race condition - between checking !loggedStacks.contains(stackKey) on line 146 and calling loggedStacks.add(stackKey) on line 147, another thread could add the same key, causing both threads to log duplicate entries to the file. is duplicate logging acceptable in this debugging tool, or should we ensure strictly unique entries?

_{5 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

thirtiseven · 2025-12-18T10:23:43Z

Additional Comments (2)

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala, line 122-127 (link)
logic: the detection logic will incorrectly flag allocations inside OOMRetryState class methods as covered, even though OOMRetryState isn't a retry framework method - it's just a helper class that happens to have "Retry" in its name

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala, line 146-153 (link)

Should be ok.

logic: check-then-act race condition - between checking !loggedStacks.contains(stackKey) on line 146 and calling loggedStacks.add(stackKey) on line 147, another thread could add the same key, causing both threads to log duplicate entries to the file. is duplicate logging acceptable in this debugging tool, or should we ensure strictly unique entries?

5 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

It's acceptable.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala

abellina

Note we need to upgrade copyrights (2026)

I like the idea, had one comment.

abellina · 2026-01-02T19:58:33Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala

+    // Note: Scala often generates methods like `$anonfun$withRetryNoSplit$1`, so we match by
+    // substring rather than exact method name.
+    // Also exclude AllocationRetryCoverageTracker itself since it's always in the stack.
+    val hasCoverage = stackTrace.exists { element =>


should we, instead of pattern matching, use the RetryStateTracker? It presently tracks if the current thread is retrying, but could it also try when it is in a retry block? We could also gate that on whether we have enabled the feature.

Yes that's a better solution, thanks. Updated.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Copilot

Pull request overview

This PR introduces a debugging feature to track memory allocations that are not covered by the retry framework (withRetry/withRetryNoSplit blocks), helping developers identify code paths that could potentially cause OOM errors without proper retry handling.

Key changes:

Adds AllocationRetryCoverageTracker to detect and log uncovered host and device memory allocations
Enhances RetryStateTracker with depth-based tracking to detect if code is executing inside a retry block
Integrates tracking hooks into HostAlloc for host memory and DeviceMemoryEventHandler for device memory

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala	New utility class that tracks memory allocations not covered by retry framework, logs findings to CSV file
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala	Adds retry block depth tracking to RetryStateTracker and wraps iterator next() with enter/exit calls
sql-plugin/src/main/scala/com/nvidia/spark/rapids/HostAlloc.scala	Integrates coverage tracker check after successful host memory allocation
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala	Enables RMM debug mode when retry coverage tracking is enabled to get allocation callbacks
sql-plugin/src/main/scala/com/nvidia/spark/rapids/DeviceMemoryEventHandler.scala	Implements onAllocated callback to check device memory allocation coverage
integration_tests/run_pyspark_from_build.sh	Propagates SPARK_RAPIDS_RETRY_COVERAGE_TRACKING environment variable to test executors

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala

greptile-apps

Additional Comments (1)

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala, line 585-587 (link)

logic: clearCurThreadRetrying() called here may conflict with the same call at line 755 in parent class RmmRapidsRetryIterator.next(). The parent's next() already handles clearing retry state after completing all retry attempts. Calling it here in the finally block means it will execute even on successful paths, potentially clearing state prematurely or redundantly.

_{6 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

greptile-apps

Additional Comments (1)

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala, line 586-587 (link)

logic: clearCurThreadRetrying() should be called before exitRetryBlock() only on exception path. On success path, the parent's next() already manages setCurThreadRetrying state. Calling clear here may incorrectly reset retry state for nested retry blocks.

_{6 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

integration_tests/run_pyspark_from_build.sh

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

firestarman

Overall it looks good to me.
Can you add some example lines in the csv file got from you local tests into the description ?

firestarman · 2026-01-07T06:23:02Z

Better have more reviews from others

thirtiseven · 2026-01-07T06:48:22Z

Overall it looks good to me. Can you add some example lines in the csv file got from you local tests into the description ?

Thanks. I added the results CSV file to the description.

abellina

I ran this tool locally and I think this is a must. Thanks for doing this.

thirtiseven added 2 commits December 10, 2025 17:51

wip

8c26946

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Add a debug option to check if memory allocation is covered by retry …

22cd1df

…framework Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Dec 11, 2025

thirtiseven marked this pull request as draft December 11, 2025 08:44

greptile-apps bot reviewed Dec 11, 2025

View reviewed changes

This was referenced Dec 11, 2025

Add withRetry to some uncovered cases of device memory allocation #13996

Open

Add a parameter to enable debug for RmmEventHandler NVIDIA/spark-rapids-jni#4061

Merged

thirtiseven added 2 commits December 12, 2025 11:13

Use substring to check withRetry

e10232c

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

fix

9dda24b

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven changed the title ~~[Do not review] Add a debug option to check if memory allocation is covered by retry framework~~ Add a debug option to check if memory allocation is covered by retry framework Dec 12, 2025

thirtiseven self-assigned this Dec 12, 2025

thirtiseven marked this pull request as ready for review December 16, 2025 09:52

greptile-apps bot reviewed Dec 16, 2025

View reviewed changes

thirtiseven requested a review from a team December 17, 2025 10:12

zpuller reviewed Dec 17, 2025

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala Show resolved Hide resolved

address commens

0017fbd

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps bot reviewed Dec 18, 2025

View reviewed changes

This was referenced Dec 22, 2025

[FEA] Device memory allocation operations that are not covered by retry framework #14047

Open

[FEA] Host memory allocation operations that are not covered by retry framework #14048

Open

zpuller reviewed Dec 23, 2025

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala Outdated Show resolved Hide resolved

address comment

e905888

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

firestarman reviewed Dec 30, 2025

View reviewed changes

thirtiseven added 2 commits December 31, 2025 11:31

Fix review feedback for retry coverage tracker

ecd346f

Treat RmmRapidsRetryIterator frames as retry coverage

de06c23

abellina reviewed Jan 2, 2026

View reviewed changes

address comments, copyrights

cf9537c

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven requested a review from Copilot January 4, 2026 09:05

Copilot started reviewing on behalf of thirtiseven January 4, 2026 09:06 View session

Copilot AI reviewed Jan 4, 2026

View reviewed changes

refactor

4787212

greptile-apps bot reviewed Jan 4, 2026

View reviewed changes

Apply suggestions from code review

ffad717

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

greptile-apps bot reviewed Jan 4, 2026

View reviewed changes

address comments

aedc59e

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven requested a review from Copilot January 4, 2026 10:39

Copilot started reviewing on behalf of thirtiseven January 4, 2026 10:39 View session

Copilot AI reviewed Jan 4, 2026

View reviewed changes

address comments

8205121

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven requested review from abellina and firestarman January 6, 2026 07:53

firestarman reviewed Jan 6, 2026

View reviewed changes

address comments

b575b97

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

firestarman reviewed Jan 7, 2026

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala Show resolved Hide resolved

firestarman reviewed Jan 7, 2026

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala Outdated Show resolved Hide resolved

address comments

d62e619

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

firestarman reviewed Jan 7, 2026

View reviewed changes

thirtiseven requested a review from zpuller January 7, 2026 06:50

abellina approved these changes Jan 9, 2026

View reviewed changes

Add a debug option to check if memory allocation is covered by retry framework #13995

Are you sure you want to change the base?

Add a debug option to check if memory allocation is covered by retry framework #13995

Conversation

thirtiseven commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklists

Uh oh!

greptile-apps bot commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

thirtiseven commented Dec 12, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (2)

Uh oh!

thirtiseven commented Dec 18, 2025

Additional Comments (2)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abellina left a comment

Choose a reason for hiding this comment

Uh oh!

abellina Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

thirtiseven Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thirtiseven commented Dec 11, 2025 •

edited

Loading

greptile-apps bot commented Dec 11, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

firestarman left a comment •

edited

Loading