Skip to content

Overriding signal handler for SIGINT breaks program without any way to understand in dask-jobqueue==0.9.0 #696

@vepadulano

Description

@vepadulano

Describe the issue:

I have a dask-based application, currently too complicated to give a small reproducer, but I can work on that if necessary. A few key steps happening in my program:

  • I create one dask.distributed LocalCluster
  • I create one dask Client and I attach it to the LocalCluster
  • I have a conditional import dask_jobqueue in my import statements. I don't use dask_jobqueue in this particular instance of my program, but it's possible to use it optionally by the user given the right configuration parameter and if the package is installed in the Python environment.
  • I create N distinct dask computation graphs
  • I submit the N graphs concurrently via a Python thread pool, something akin to
    with concurrent.futures.ThreadPoolExecutor(max_workers=len(ngraphs)) as executor:
        futures = [executor.submit(execute_graph, rootnode) for rootnode in rootnodes]
        concurrent.futures.wait(futures)

I hadn't been using this program for a while, came back to it recently and I found out that it breaks. After many hours spent debugging, I could trace the breaking of my program to whether the dask_jobqueue package is installed in the Python environment or not.

Then I went ahead with git bisect and I found out that this was the offending commit that breaks my program 3a00196

I then noticed there is an unconditional override of the signal handler for SIGINT. This has landed in version 0.9.0, i.e. the latest released version.

I then noticed that the main branch has a different commit, introduced by #668 . This has not landed yet in any officially released version.

I installed a modified version of dask_jobqueue 0.9.0 which just introduces the changes of #668, that is now the signal handler is only overridden in the main thread.

This fixes my program, so I would like to ask to at least publish a version of dask_jobqueue which is not broken.

Anything else we need to know?:

On a more general note, I have serious doubts about the validity of this piece of code. Even with the guard that introduces it only in the main thread, in general there cannot be multiple signal handlers for the same signal. So dask_jobqueue is silently taking over responsibility for managing signal handlers for the entire duration of the program. Has this been specified in the documentation? Furthermore, I can easily imagine many downstream use cases that may want to have their own way to handle SIGINT, which leads to two possible cases:

  1. The call to signal.signal done by the downstream library happens after the call done by dask_jobqueue: in this case, the downstream user is possibly happy, but the assumption made by dask_jobqueue will be invalidated, so maybe it will be a problem for this library
  2. The last call to signal.signal is done by dask_jobqueue: in this case, the user is unhappy because their assumptions are broken without any good way to understand what happened.

Could you please clarify your opinion on this matter, and could you foresee finding a more secure and isolated way to get to your desired result, without possibly breaking user code silently?

Environment:

  • Dask version: 2025.12.0
  • Python version: 3.13
  • Operating System: Fedora Linux
  • Install method (conda, pip, source): pip, source

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions