-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Fix ddp_notebook CUDA fork check to allow passive initialization #21402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix ddp_notebook CUDA fork check to allow passive initialization #21402
Conversation
The previous implementation used torch.cuda.is_initialized() which returns True even when CUDA is passively initialized (e.g., during library imports or device availability checks). This caused false positives in environments like Kaggle notebooks where libraries may query CUDA without creating a context. This fix uses PyTorch's internal torch.cuda._is_in_bad_fork() function, which more accurately detects when we're in an actual bad fork state (i.e., CUDA was initialized with a context and then the process was forked). The change allows passive CUDA initialization while still catching genuine problematic cases. Falls back to the old check for older PyTorch versions that don't have _is_in_bad_fork. Fixes Lightning-AI#21389
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes an overly strict CUDA fork check in the ddp_notebook strategy that was causing false positives in notebook environments. The implementation now uses PyTorch's internal _is_in_bad_fork() function to more accurately detect actual bad fork states, allowing passive CUDA initialization while still catching problematic cases.
Key Changes
- Replaced
torch.cuda.is_initialized()check withtorch.cuda._is_in_bad_fork()for more accurate detection - Added fallback to the old check for backward compatibility with older PyTorch versions
- Updated error messages to be more informative
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if _IS_INTERACTIVE: | ||
| message += " You will have to restart the Python kernel." | ||
| raise RuntimeError(message) | ||
|
|
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing whitespace found at the end of the line. Remove the extra whitespace.
| # Use PyTorch's internal check for bad fork state, which is more accurate than just checking if CUDA | ||
| # is initialized. This allows passive CUDA initialization (e.g., from library imports or device queries) | ||
| # while still catching actual problematic cases where CUDA context was created before forking. | ||
| _is_in_bad_fork = getattr(torch.cuda, "_is_in_bad_fork", None) | ||
| if _is_in_bad_fork is not None and _is_in_bad_fork(): | ||
| message = ( | ||
| "Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, " | ||
| "you must use the 'spawn' start method or avoid CUDA initialization in the main process." | ||
| ) | ||
| if _IS_INTERACTIVE: | ||
| message += " You will have to restart the Python kernel." | ||
| raise RuntimeError(message) | ||
|
|
||
| # Fallback to the old check if _is_in_bad_fork is not available (older PyTorch versions) | ||
| if _is_in_bad_fork is None and torch.cuda.is_initialized(): | ||
| message = ( | ||
| "Lightning can't create new processes if CUDA is already initialized. Did you manually call" | ||
| " `torch.cuda.*` functions, have moved the model to the device, or allocated memory on the GPU any" | ||
| " other way? Please remove any such calls, or change the selected strategy." | ||
| ) | ||
| if _IS_INTERACTIVE: | ||
| message += " You will have to restart the Python kernel." | ||
| raise RuntimeError(message) |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing test test_check_for_bad_cuda_fork mocks torch.cuda.is_initialized() to return True, which will only test the fallback path in the new implementation (when _is_in_bad_fork is None). This test should be updated to also verify the new behavior when torch.cuda._is_in_bad_fork is available and returns True.
| # is initialized. This allows passive CUDA initialization (e.g., from library imports or device queries) | ||
| # while still catching actual problematic cases where CUDA context was created before forking. | ||
| _is_in_bad_fork = getattr(torch.cuda, "_is_in_bad_fork", None) | ||
| if _is_in_bad_fork is not None and _is_in_bad_fork(): |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code uses getattr to check if _is_in_bad_fork exists, but it doesn't verify if the returned value is callable. While PyTorch's _is_in_bad_fork is indeed a function, it's better practice to verify callability when using getattr on potentially undefined attributes, especially for internal/private APIs that could change.
Consider adding a callable check:
_is_in_bad_fork = getattr(torch.cuda, "_is_in_bad_fork", None)
if _is_in_bad_fork is not None and callable(_is_in_bad_fork) and _is_in_bad_fork():| if _is_in_bad_fork is not None and _is_in_bad_fork(): | |
| if _is_in_bad_fork is not None and callable(_is_in_bad_fork) and _is_in_bad_fork(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #21402 +/- ##
=========================================
- Coverage 87% 79% -8%
=========================================
Files 269 266 -3
Lines 23804 23754 -50
=========================================
- Hits 20626 18725 -1901
- Misses 3178 5029 +1851 |
What does this PR do?
Fixes #21389
This PR fixes the overly strict CUDA fork check in
ddp_notebookstrategy that was causing false positives in notebook environments like Kaggle.Problem
The previous implementation used
torch.cuda.is_initialized()which returnsTrueeven when CUDA is passively initialized (e.g., during library imports, device availability checks, or model loading). This caused the error:This happened even when users didn't explicitly call any CUDA functions, making it impossible to use
ddp_notebookin many legitimate scenarios.Solution
This fix uses PyTorch's internal
torch.cuda._is_in_bad_fork()function, which more accurately detects when we're in an actual bad fork state.The implementation includes a fallback to the old check for older PyTorch versions that don't have
_is_in_bad_fork.Testing
📚 Documentation preview 📚: https://pytorch-lightning--21402.org.readthedocs.build/en/21402/