-
-
Notifications
You must be signed in to change notification settings - Fork 149
Description
Describe the issue:
I have a dask-based application, currently too complicated to give a small reproducer, but I can work on that if necessary. A few key steps happening in my program:
- I create one dask.distributed LocalCluster
- I create one dask Client and I attach it to the LocalCluster
- I have a conditional
import dask_jobqueuein my import statements. I don't usedask_jobqueuein this particular instance of my program, but it's possible to use it optionally by the user given the right configuration parameter and if the package is installed in the Python environment. - I create N distinct dask computation graphs
- I submit the N graphs concurrently via a Python thread pool, something akin to
with concurrent.futures.ThreadPoolExecutor(max_workers=len(ngraphs)) as executor:
futures = [executor.submit(execute_graph, rootnode) for rootnode in rootnodes]
concurrent.futures.wait(futures)I hadn't been using this program for a while, came back to it recently and I found out that it breaks. After many hours spent debugging, I could trace the breaking of my program to whether the dask_jobqueue package is installed in the Python environment or not.
Then I went ahead with git bisect and I found out that this was the offending commit that breaks my program 3a00196
I then noticed there is an unconditional override of the signal handler for SIGINT. This has landed in version 0.9.0, i.e. the latest released version.
I then noticed that the main branch has a different commit, introduced by #668 . This has not landed yet in any officially released version.
I installed a modified version of dask_jobqueue 0.9.0 which just introduces the changes of #668, that is now the signal handler is only overridden in the main thread.
This fixes my program, so I would like to ask to at least publish a version of dask_jobqueue which is not broken.
Anything else we need to know?:
On a more general note, I have serious doubts about the validity of this piece of code. Even with the guard that introduces it only in the main thread, in general there cannot be multiple signal handlers for the same signal. So dask_jobqueue is silently taking over responsibility for managing signal handlers for the entire duration of the program. Has this been specified in the documentation? Furthermore, I can easily imagine many downstream use cases that may want to have their own way to handle SIGINT, which leads to two possible cases:
- The call to
signal.signaldone by the downstream library happens after the call done bydask_jobqueue: in this case, the downstream user is possibly happy, but the assumption made bydask_jobqueuewill be invalidated, so maybe it will be a problem for this library - The last call to
signal.signalis done bydask_jobqueue: in this case, the user is unhappy because their assumptions are broken without any good way to understand what happened.
Could you please clarify your opinion on this matter, and could you foresee finding a more secure and isolated way to get to your desired result, without possibly breaking user code silently?
Environment:
- Dask version: 2025.12.0
- Python version: 3.13
- Operating System: Fedora Linux
- Install method (conda, pip, source): pip, source