Skip to content

Do parallel runs always need N+1 tasks? #945

@Sbte

Description

@Sbte

If I want to run a code (in my case POP from OMUSE) on N cores (number_of_workers), say 128, it seems like I need to request N+1, so 129, tasks (in slurm), because AMUSE will call MPI.Spawn 128 times, but the original process already used 1 task, so that makes the total 129.

  • I can not use 127 workers for the code I run through AMUSE, because that will mess up the domain partitioning.
  • If I use 129 tasks, it means that I have to use two supercomputer nodes (each node having 128 cores) so that's also not an option.
  • If I use oversubscribe, I can not pin each process to a physical core, so that's probably bad for performance (this could double the computational time because two processes share a core).

What's the proper way to solve this? Is it really not possible to reuse the original process for a worker?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions