Skip to content

Conversation

@arrdel
Copy link

@arrdel arrdel commented Dec 3, 2025

What does this PR do?

Fixes #21389

This PR fixes the overly strict CUDA fork check in ddp_notebook strategy that was causing false positives in notebook environments like Kaggle.

Problem

The previous implementation used torch.cuda.is_initialized() which returns True even when CUDA is passively initialized (e.g., during library imports, device availability checks, or model loading). This caused the error:

RuntimeError: Lightning can't create new processes if CUDA is already initialized.

This happened even when users didn't explicitly call any CUDA functions, making it impossible to use ddp_notebook in many legitimate scenarios.

Solution

This fix uses PyTorch's internal torch.cuda._is_in_bad_fork() function, which more accurately detects when we're in an actual bad fork state.

The implementation includes a fallback to the old check for older PyTorch versions that don't have _is_in_bad_fork.

Testing

  • Code follows style guidelines
  • Changes preserve backward compatibility
  • Fallback exists for older PyTorch versions

📚 Documentation preview 📚: https://pytorch-lightning--21402.org.readthedocs.build/en/21402/

The previous implementation used torch.cuda.is_initialized() which returns
True even when CUDA is passively initialized (e.g., during library imports
or device availability checks). This caused false positives in environments
like Kaggle notebooks where libraries may query CUDA without creating a
context.

This fix uses PyTorch's internal torch.cuda._is_in_bad_fork() function,
which more accurately detects when we're in an actual bad fork state (i.e.,
CUDA was initialized with a context and then the process was forked).

The change allows passive CUDA initialization while still catching genuine
problematic cases. Falls back to the old check for older PyTorch versions
that don't have _is_in_bad_fork.

Fixes Lightning-AI#21389
Copilot AI review requested due to automatic review settings December 3, 2025 20:47
@github-actions github-actions bot added the fabric lightning.fabric.Fabric label Dec 3, 2025
Copilot finished reviewing on behalf of arrdel December 3, 2025 20:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an overly strict CUDA fork check in the ddp_notebook strategy that was causing false positives in notebook environments. The implementation now uses PyTorch's internal _is_in_bad_fork() function to more accurately detect actual bad fork states, allowing passive CUDA initialization while still catching problematic cases.

Key Changes

  • Replaced torch.cuda.is_initialized() check with torch.cuda._is_in_bad_fork() for more accurate detection
  • Added fallback to the old check for backward compatibility with older PyTorch versions
  • Updated error messages to be more informative

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if _IS_INTERACTIVE:
message += " You will have to restart the Python kernel."
raise RuntimeError(message)

Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace found at the end of the line. Remove the extra whitespace.

Copilot uses AI. Check for mistakes.
Comment on lines +198 to +220
# Use PyTorch's internal check for bad fork state, which is more accurate than just checking if CUDA
# is initialized. This allows passive CUDA initialization (e.g., from library imports or device queries)
# while still catching actual problematic cases where CUDA context was created before forking.
_is_in_bad_fork = getattr(torch.cuda, "_is_in_bad_fork", None)
if _is_in_bad_fork is not None and _is_in_bad_fork():
message = (
"Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, "
"you must use the 'spawn' start method or avoid CUDA initialization in the main process."
)
if _IS_INTERACTIVE:
message += " You will have to restart the Python kernel."
raise RuntimeError(message)

# Fallback to the old check if _is_in_bad_fork is not available (older PyTorch versions)
if _is_in_bad_fork is None and torch.cuda.is_initialized():
message = (
"Lightning can't create new processes if CUDA is already initialized. Did you manually call"
" `torch.cuda.*` functions, have moved the model to the device, or allocated memory on the GPU any"
" other way? Please remove any such calls, or change the selected strategy."
)
if _IS_INTERACTIVE:
message += " You will have to restart the Python kernel."
raise RuntimeError(message)
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing test test_check_for_bad_cuda_fork mocks torch.cuda.is_initialized() to return True, which will only test the fallback path in the new implementation (when _is_in_bad_fork is None). This test should be updated to also verify the new behavior when torch.cuda._is_in_bad_fork is available and returns True.

Copilot uses AI. Check for mistakes.
# is initialized. This allows passive CUDA initialization (e.g., from library imports or device queries)
# while still catching actual problematic cases where CUDA context was created before forking.
_is_in_bad_fork = getattr(torch.cuda, "_is_in_bad_fork", None)
if _is_in_bad_fork is not None and _is_in_bad_fork():
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code uses getattr to check if _is_in_bad_fork exists, but it doesn't verify if the returned value is callable. While PyTorch's _is_in_bad_fork is indeed a function, it's better practice to verify callability when using getattr on potentially undefined attributes, especially for internal/private APIs that could change.

Consider adding a callable check:

_is_in_bad_fork = getattr(torch.cuda, "_is_in_bad_fork", None)
if _is_in_bad_fork is not None and callable(_is_in_bad_fork) and _is_in_bad_fork():
Suggested change
if _is_in_bad_fork is not None and _is_in_bad_fork():
if _is_in_bad_fork is not None and callable(_is_in_bad_fork) and _is_in_bad_fork():

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@codecov
Copy link

codecov bot commented Dec 5, 2025

Codecov Report

❌ Patch coverage is 0% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 79%. Comparing base (79ffe50) to head (fc8a8ec).
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (79ffe50) and HEAD (fc8a8ec). Click for more details.

HEAD has 2005 uploads less than BASE
Flag BASE (79ffe50) HEAD (fc8a8ec)
cpu 479 30
lightning_fabric 120 0
pytest 240 0
python3.12 143 9
python3.12.7 144 9
lightning 240 15
python3.11 96 6
python3.10 48 3
python 48 3
pytorch2.1 48 6
pytest-full 239 30
pytorch_lightning 119 15
pytorch2.6 24 3
pytorch2.4.1 24 3
pytorch2.3 24 3
pytorch2.2.2 24 3
pytorch2.5.1 24 3
pytorch2.9 24 3
pytorch2.7 24 3
pytorch2.8 23 3
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #21402     +/-   ##
=========================================
- Coverage      87%      79%     -8%     
=========================================
  Files         269      266      -3     
  Lines       23804    23754     -50     
=========================================
- Hits        20626    18725   -1901     
- Misses       3178     5029   +1851     

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fabric lightning.fabric.Fabric

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ddp_notebook on kaggle having CUDA issues

1 participant