Fix ddp_notebook CUDA fork check to allow passive initialization #21402

arrdel · 2025-12-03T20:47:11Z

What does this PR do?

This PR fixes the overly strict CUDA fork check in ddp_notebook strategy that was causing false positives in notebook environments like Kaggle.

Problem

The previous implementation used torch.cuda.is_initialized() which returns True even when CUDA is passively initialized (e.g., during library imports, device availability checks, or model loading). This caused the error:

RuntimeError: Lightning can't create new processes if CUDA is already initialized.

This happened even when users didn't explicitly call any CUDA functions, making it impossible to use ddp_notebook in many legitimate scenarios.

Solution

This fix uses PyTorch's internal torch.cuda._is_in_bad_fork() function, which more accurately detects when we're in an actual bad fork state.

The implementation includes a fallback to the old check for older PyTorch versions that don't have _is_in_bad_fork.

Testing

Code follows style guidelines
Changes preserve backward compatibility
Fallback exists for older PyTorch versions

📚 Documentation preview 📚: https://pytorch-lightning--21402.org.readthedocs.build/en/21402/

The previous implementation used torch.cuda.is_initialized() which returns True even when CUDA is passively initialized (e.g., during library imports or device availability checks). This caused false positives in environments like Kaggle notebooks where libraries may query CUDA without creating a context. This fix uses PyTorch's internal torch.cuda._is_in_bad_fork() function, which more accurately detects when we're in an actual bad fork state (i.e., CUDA was initialized with a context and then the process was forked). The change allows passive CUDA initialization while still catching genuine problematic cases. Falls back to the old check for older PyTorch versions that don't have _is_in_bad_fork. Fixes Lightning-AI#21389

for more information, see https://pre-commit.ci

Copilot

Pull request overview

This PR fixes an overly strict CUDA fork check in the ddp_notebook strategy that was causing false positives in notebook environments. The implementation now uses PyTorch's internal _is_in_bad_fork() function to more accurately detect actual bad fork states, allowing passive CUDA initialization while still catching problematic cases.

Key Changes

Replaced torch.cuda.is_initialized() check with torch.cuda._is_in_bad_fork() for more accurate detection
Added fallback to the old check for backward compatibility with older PyTorch versions
Updated error messages to be more informative

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-03T20:49:41Z

src/lightning/fabric/strategies/launchers/multiprocessing.py

+        if _IS_INTERACTIVE:
+            message += " You will have to restart the Python kernel."
+        raise RuntimeError(message)
+


Trailing whitespace found at the end of the line. Remove the extra whitespace.

Copilot · 2025-12-03T20:49:42Z

src/lightning/fabric/strategies/launchers/multiprocessing.py

+    # Use PyTorch's internal check for bad fork state, which is more accurate than just checking if CUDA
+    # is initialized. This allows passive CUDA initialization (e.g., from library imports or device queries)
+    # while still catching actual problematic cases where CUDA context was created before forking.
+    _is_in_bad_fork = getattr(torch.cuda, "_is_in_bad_fork", None)
+    if _is_in_bad_fork is not None and _is_in_bad_fork():
+        message = (
+            "Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, "
+            "you must use the 'spawn' start method or avoid CUDA initialization in the main process."
+        )
+        if _IS_INTERACTIVE:
+            message += " You will have to restart the Python kernel."
+        raise RuntimeError(message)
+
+    # Fallback to the old check if _is_in_bad_fork is not available (older PyTorch versions)
+    if _is_in_bad_fork is None and torch.cuda.is_initialized():
+        message = (
+            "Lightning can't create new processes if CUDA is already initialized. Did you manually call"
+            " `torch.cuda.*` functions, have moved the model to the device, or allocated memory on the GPU any"
+            " other way? Please remove any such calls, or change the selected strategy."
+        )
+        if _IS_INTERACTIVE:
+            message += " You will have to restart the Python kernel."
+        raise RuntimeError(message)


The existing test test_check_for_bad_cuda_fork mocks torch.cuda.is_initialized() to return True, which will only test the fallback path in the new implementation (when _is_in_bad_fork is None). This test should be updated to also verify the new behavior when torch.cuda._is_in_bad_fork is available and returns True.

Copilot · 2025-12-03T20:49:42Z

src/lightning/fabric/strategies/launchers/multiprocessing.py

+    # is initialized. This allows passive CUDA initialization (e.g., from library imports or device queries)
+    # while still catching actual problematic cases where CUDA context was created before forking.
+    _is_in_bad_fork = getattr(torch.cuda, "_is_in_bad_fork", None)
+    if _is_in_bad_fork is not None and _is_in_bad_fork():


The code uses getattr to check if _is_in_bad_fork exists, but it doesn't verify if the returned value is callable. While PyTorch's _is_in_bad_fork is indeed a function, it's better practice to verify callability when using getattr on potentially undefined attributes, especially for internal/private APIs that could change.

Consider adding a callable check:

_is_in_bad_fork = getattr(torch.cuda, "_is_in_bad_fork", None) if _is_in_bad_fork is not None and callable(_is_in_bad_fork) and _is_in_bad_fork():

Suggested change

if _is_in_bad_fork is not None and _is_in_bad_fork():

if _is_in_bad_fork is not None and callable(_is_in_bad_fork) and _is_in_bad_fork():

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

codecov · 2025-12-05T10:35:21Z

Codecov Report

❌ Patch coverage is 0% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 79%. Comparing base (79ffe50) to head (fc8a8ec).
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (79ffe50) and HEAD (fc8a8ec). Click for more details.

HEAD has 2005 uploads less than BASE

Flag BASE (79ffe50) HEAD (fc8a8ec)

cpu 479 30

lightning_fabric 120 0

pytest 240 0

python3.12 143 9

python3.12.7 144 9

lightning 240 15

python3.11 96 6

python3.10 48 3

python 48 3

pytorch2.1 48 6

pytest-full 239 30

pytorch_lightning 119 15

pytorch2.6 24 3

pytorch2.4.1 24 3

pytorch2.3 24 3

pytorch2.2.2 24 3

pytorch2.5.1 24 3

pytorch2.9 24 3

pytorch2.7 24 3

pytorch2.8 23 3

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #21402     +/-   ##
=========================================
- Coverage      87%      79%     -8%     
=========================================
  Files         269      266      -3     
  Lines       23804    23754     -50     
=========================================
- Hits        20626    18725   -1901     
- Misses       3178     5029   +1851

Copilot AI review requested due to automatic review settings December 3, 2025 20:47

arrdel requested review from ethanwharris, justusschock, lantiga and tchaton as code owners December 3, 2025 20:47

github-actions bot added the fabric lightning.fabric.Fabric label Dec 3, 2025

Copilot started reviewing on behalf of arrdel December 3, 2025 20:47 View session

[pre-commit.ci] auto fixes from pre-commit.com hooks

fc8a8ec

for more information, see https://pre-commit.ci

Copilot finished reviewing on behalf of arrdel December 3, 2025 20:49

Copilot AI reviewed Dec 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix ddp_notebook CUDA fork check to allow passive initialization #21402

Fix ddp_notebook CUDA fork check to allow passive initialization #21402

arrdel commented Dec 3, 2025 •

edited by github-actions bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 3, 2025

Uh oh!

Copilot AI Dec 3, 2025

Uh oh!

Copilot AI Dec 3, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

codecov bot commented Dec 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	if _is_in_bad_fork is not None and _is_in_bad_fork():
	if _is_in_bad_fork is not None and callable(_is_in_bad_fork) and _is_in_bad_fork():

Fix ddp_notebook CUDA fork check to allow passive initialization #21402

Are you sure you want to change the base?

Fix ddp_notebook CUDA fork check to allow passive initialization #21402

Conversation

arrdel commented Dec 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Problem

Solution

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Uh oh!

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arrdel commented Dec 3, 2025 •

edited by github-actions bot

Loading

codecov bot commented Dec 5, 2025 •

edited

Loading