-
Notifications
You must be signed in to change notification settings - Fork 41
CE: Added pages with guidelines for images on Alps #272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Madeeks
wants to merge
31
commits into
eth-cscs:main
Choose a base branch
from
Madeeks:ce-image-guidelines
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
31 commits
Select commit
Hold shift + click to select a range
77c5f2f
CE: Added pages with guidelines for images on Alps
Madeeks 5d22873
Fixed mkdocs table of contents
Madeeks 9504522
Fixed typos
Madeeks 713f8b4
Fixed typo
Madeeks b10b99f
Improved content organization for CE image guidelines
Madeeks 5d74d59
Fixed code blocks
Madeeks 0bab4b2
Updated allowed words in spelling checker
Madeeks 7236858
Apply suggestions from code review
Madeeks ca271ed
CE image guidelines: add links to subpages
Madeeks 1e91952
CE image guidelines: add code block notes for PMIx settings
Madeeks 1dd25cb
Merge branch 'main' into ce-image-guidelines
bcumming 797f947
Merge branch 'main' into ce-image-guidelines
bcumming b0f1dc3
Merge branch 'main' into ce-image-guidelines
bcumming a69d336
wip
bcumming 5df469c
refactor comms index; integrated base image into libfabric
bcumming fc5c463
wip
bcumming e993f95
wip
bcumming 1d3dd28
Merge remote-tracking branch 'bcumming/ce-image-guidelines' into ce-i…
bcumming 87ad5a9
fix spelling
bcumming 1171ce6
spelling
bcumming 43d52c2
moving dockerfile material into the communication library docs
bcumming 6f39d91
move nccl, openmpi and nvshmem docs fully to communication
bcumming df6c449
Merge branch 'main' into ce-image-guidelines
bcumming d6324c3
spelling
bcumming 635f4d7
remove original guides, consolidate all the material
bcumming 09c6fca
spelling
bcumming 28b4ca7
clean up link errors
bcumming beb2c37
remove RCCL docs; clean up more links
bcumming f1f0ab5
Merge branch 'main' into ce-image-guidelines
bcumming 317d767
Rocco PR review fixes
bcumming 9cb94e6
remove explicit nvshmem version from docs
bcumming File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| ARG ubuntu_version=24.04 | ||
| ARG cuda_version=12.8.1 | ||
| FROM docker.io/nvidia/cuda:${cuda_version}-cudnn-devel-ubuntu${ubuntu_version} | ||
|
|
||
| RUN apt-get update \ | ||
| && DEBIAN_FRONTEND=noninteractive \ | ||
| apt-get install -y \ | ||
| build-essential \ | ||
| ca-certificates \ | ||
| pkg-config \ | ||
| automake \ | ||
| autoconf \ | ||
| libtool \ | ||
| cmake \ | ||
| gdb \ | ||
| strace \ | ||
| wget \ | ||
| git \ | ||
| bzip2 \ | ||
| python3 \ | ||
| gfortran \ | ||
| rdma-core \ | ||
| numactl \ | ||
| libconfig-dev \ | ||
| libuv1-dev \ | ||
| libfuse-dev \ | ||
| libfuse3-dev \ | ||
| libyaml-dev \ | ||
| libnl-3-dev \ | ||
| libnuma-dev \ | ||
| libsensors-dev \ | ||
| libcurl4-openssl-dev \ | ||
| libjson-c-dev \ | ||
| libibverbs-dev \ | ||
| --no-install-recommends \ | ||
| && rm -rf /var/lib/apt/lists/* |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| ARG gdrcopy_version=2.5.1 | ||
| RUN git clone --depth 1 --branch v${gdrcopy_version} https://github.com/NVIDIA/gdrcopy.git \ | ||
| && cd gdrcopy \ | ||
| && export CUDA_PATH=/usr/local/cuda \ | ||
| && make CC=gcc CUDA=$CUDA_PATH lib \ | ||
| && make lib_install \ | ||
| && cd ../ && rm -rf gdrcopy | ||
|
|
||
| # Install libfabric | ||
| ARG libfabric_version=1.22.0 | ||
| RUN git clone --branch v${libfabric_version} --depth 1 https://github.com/ofiwg/libfabric.git \ | ||
| && cd libfabric \ | ||
| && ./autogen.sh \ | ||
| && ./configure --prefix=/usr --with-cuda=/usr/local/cuda --enable-cuda-dlopen \ | ||
| --enable-gdrcopy-dlopen --enable-efa \ | ||
| && make -j$(nproc) \ | ||
| && make install \ | ||
| && ldconfig \ | ||
| && cd .. \ | ||
| && rm -rf libfabric |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| ARG nccl_tests_version=2.17.1 | ||
| RUN wget -O nccl-tests-${nccl_tests_version}.tar.gz https://github.com/NVIDIA/nccl-tests/archive/refs/tags/v${nccl_tests_version}.tar.gz \ | ||
| && tar xf nccl-tests-${nccl_tests_version}.tar.gz \ | ||
| && cd nccl-tests-${nccl_tests_version} \ | ||
| && MPI=1 make -j$(nproc) \ | ||
| && cd .. \ | ||
| && rm -rf nccl-tests-${nccl_tests_version}.tar.gz |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| RUN apt-get update \ | ||
| && DEBIAN_FRONTEND=noninteractive \ | ||
| apt-get install -y \ | ||
| python3-venv \ | ||
| python3-dev \ | ||
| --no-install-recommends \ | ||
| && rm -rf /var/lib/apt/lists/* \ | ||
| && rm /usr/lib/python3.12/EXTERNALLY-MANAGED | ||
|
|
||
| # Build NVSHMEM from source | ||
| ARG nvshmem_version=3.4.5 | ||
| RUN wget -q https://developer.download.nvidia.com/compute/redist/nvshmem/${nvshmem_version}/source/nvshmem_src_cuda12-all-all-${nvshmem_version}.tar.gz \ | ||
| && tar -xvf nvshmem_src_cuda12-all-all-${nvshmem_version}.tar.gz \ | ||
| && cd nvshmem_src \ | ||
| && NVSHMEM_BUILD_EXAMPLES=0 \ | ||
| NVSHMEM_BUILD_TESTS=1 \ | ||
| NVSHMEM_DEBUG=0 \ | ||
| NVSHMEM_DEVEL=0 \ | ||
| NVSHMEM_DEFAULT_PMI2=0 \ | ||
| NVSHMEM_DEFAULT_PMIX=1 \ | ||
| NVSHMEM_DISABLE_COLL_POLL=1 \ | ||
| NVSHMEM_ENABLE_ALL_DEVICE_INLINING=0 \ | ||
| NVSHMEM_GPU_COLL_USE_LDST=0 \ | ||
| NVSHMEM_LIBFABRIC_SUPPORT=1 \ | ||
| NVSHMEM_MPI_SUPPORT=1 \ | ||
| NVSHMEM_MPI_IS_OMPI=1 \ | ||
| NVSHMEM_NVTX=1 \ | ||
| NVSHMEM_PMIX_SUPPORT=1 \ | ||
| NVSHMEM_SHMEM_SUPPORT=1 \ | ||
| NVSHMEM_TEST_STATIC_LIB=0 \ | ||
| NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \ | ||
| NVSHMEM_TRACE=0 \ | ||
| NVSHMEM_USE_DLMALLOC=0 \ | ||
| NVSHMEM_USE_NCCL=1 \ | ||
| NVSHMEM_USE_GDRCOPY=1 \ | ||
| NVSHMEM_VERBOSE=0 \ | ||
| NVSHMEM_DEFAULT_UCX=0 \ | ||
| NVSHMEM_UCX_SUPPORT=0 \ | ||
| NVSHMEM_IBGDA_SUPPORT=0 \ | ||
| NVSHMEM_IBGDA_SUPPORT_GPUMEM_ONLY=0 \ | ||
| NVSHMEM_IBDEVX_SUPPORT=0 \ | ||
| NVSHMEM_IBRC_SUPPORT=0 \ | ||
| LIBFABRIC_HOME=/usr \ | ||
| NCCL_HOME=/usr \ | ||
| GDRCOPY_HOME=/usr/local \ | ||
| MPI_HOME=/usr \ | ||
| SHMEM_HOME=/usr \ | ||
| NVSHMEM_HOME=/usr \ | ||
| cmake . \ | ||
| && make -j$(nproc) \ | ||
| && make install \ | ||
| && ldconfig \ | ||
| && cd .. \ | ||
| && rm -r nvshmem_src nvshmem_src_cuda12-all-all-${nvshmem_version}.tar.gz |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| ARG OMPI_VER=5.0.8 | ||
| RUN wget -q https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-${OMPI_VER}.tar.gz \ | ||
| && tar xf openmpi-${OMPI_VER}.tar.gz \ | ||
| && cd openmpi-${OMPI_VER} \ | ||
| && ./configure --prefix=/usr --with-ofi=/usr --with-ucx=/usr \ | ||
| --enable-oshmem --with-cuda=/usr/local/cuda \ | ||
| --with-cuda-libdir=/usr/local/cuda/lib64/stubs \ | ||
| && make -j$(nproc) \ | ||
| && make install \ | ||
| && ldconfig \ | ||
| && cd .. \ | ||
| && rm -rf openmpi-${OMPI_VER}.tar.gz openmpi-${OMPI_VER} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| ARG omb_version=7.5.1 | ||
| RUN wget -q http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-${omb_version}.tar.gz \ | ||
| && tar xf osu-micro-benchmarks-${omb_version}.tar.gz \ | ||
| && cd osu-micro-benchmarks-${omb_version} \ | ||
| && ldconfig /usr/local/cuda/targets/sbsa-linux/lib/stubs \ | ||
| && ./configure --prefix=/usr/local CC=$(which mpicc) CFLAGS="-O3 -lcuda -lnvidia-ml" \ | ||
| --enable-cuda --with-cuda-include=/usr/local/cuda/include \ | ||
| --with-cuda-libpath=/usr/local/cuda/lib64 \ | ||
| CXXFLAGS="-lmpi -lcuda" \ | ||
| && make -j$(nproc) \ | ||
| && make install \ | ||
| && ldconfig \ | ||
| && cd .. \ | ||
| && rm -rf osu-micro-benchmarks-${omb_version} osu-micro-benchmarks-${omb_version}.tar.gz | ||
|
|
||
| WORKDIR /usr/local/libexec/osu-micro-benchmarks/mpi |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| # Install UCX | ||
| ARG UCX_VERSION=1.19.0 | ||
| RUN wget https://github.com/openucx/ucx/releases/download/v${UCX_VERSION}/ucx-${UCX_VERSION}.tar.gz \ | ||
| && tar xzf ucx-${UCX_VERSION}.tar.gz \ | ||
| && cd ucx-${UCX_VERSION} \ | ||
| && mkdir build \ | ||
| && cd build \ | ||
| && ../configure --prefix=/usr --with-cuda=/usr/local/cuda --with-gdrcopy=/usr/local \ | ||
| --enable-mt --enable-devel-headers \ | ||
| && make -j$(nproc) \ | ||
| && make install \ | ||
| && cd ../.. \ | ||
| && rm -rf ucx-${UCX_VERSION}.tar.gz ucx-${UCX_VERSION} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,20 +1,67 @@ | ||
| [](){#ref-software-communication} | ||
| # Communication Libraries | ||
|
|
||
| CSCS provides common communication libraries optimized for the [Slingshot 11 network on Alps][ref-alps-hsn]. | ||
| Communication libraries, like MPI and NCCL, are one of the building blocks for high performance scientific and ML workloads. | ||
| Broadly speaking, there are two levels of communication: | ||
|
|
||
| * **Intra-node** communication between two processes on the same node. | ||
| * **Inter-node** communication between different nodes, over the [Slingshot 11 network][ref-alps-hsn] that connects nodes on Alps. | ||
|
|
||
| To get the best inter-node performance on Alps, they need to be configured to use the [libfabric][ref-communication-libfabric] library that has an optimised back end for the Slingshot 11 network on Alps. | ||
|
|
||
| As such, communication libraries are part of the "base layer" of libraries and tools used by all workloads to fully utilize the hardware on Alps. | ||
| They comprise the *network* layer in the following stack: | ||
|
|
||
| * **CPU**: compilers with support for building applications optimized for the CPU architecture on the node. | ||
| * **GPU**: CUDA and ROCM provide compilers and runtime libraries for NVIDIA and AMD GPUs respectively. | ||
| * **Network**: libfabric, MPI, NCCL, NVSHMEM, need to be configured for the Slingshot network. | ||
|
|
||
| CSCS provides communication libraries optimised for libfabric and Slingshot in uenv, and guidance on how to create container images that use them. | ||
| This section of the documentation provides advice on how to build and install software to use these libraries, and how to deploy them. | ||
|
|
||
| For most scientific applications relying on MPI, [Cray MPICH][ref-communication-cray-mpich] is recommended. | ||
| [MPICH][ref-communication-mpich] and [OpenMPI][ref-communication-openmpi] may also be used, with limitations. | ||
| Cray MPICH, MPICH, and OpenMPI make use of [libfabric][ref-communication-libfabric] to interact with the underlying network. | ||
|
|
||
| Most machine learning applications rely on [NCCL][ref-communication-nccl] or [RCCL][ref-communication-rccl] for high-performance implementations of collectives. | ||
| NCCL and RCCL have to be configured with a plugin using [libfabric][ref-communication-libfabric] to make full use of the Slingshot network. | ||
| Most machine learning applications rely on [NCCL][ref-communication-nccl] for high-performance implementations of collectives. | ||
| NCCL have to be configured with a plugin using [libfabric][ref-communication-libfabric] to make full use of the Slingshot network. | ||
bcumming marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| See the individual pages for each library for information on how to use and best configure the libraries. | ||
|
|
||
| * [Cray MPICH][ref-communication-cray-mpich] | ||
| * [MPICH][ref-communication-mpich] | ||
| * [OpenMPI][ref-communication-openmpi] | ||
| * [NCCL][ref-communication-nccl] | ||
| * [RCCL][ref-communication-rccl] | ||
| * [libfabric][ref-communication-libfabric] | ||
| <div class="grid cards" markdown> | ||
|
|
||
| - __Low Level__ | ||
|
|
||
| Learn about the low-level networking library libfabric, and how to use it in uenv and containers | ||
|
|
||
| [:octicons-arrow-right-24: libfabric][ref-alps] | ||
|
|
||
| </div> | ||
| <div class="grid cards" markdown> | ||
|
|
||
| - __MPI__ | ||
|
|
||
| Cray MPICH is the most optimized and best tested MPI implementation on Alps, and is used by uenv. | ||
|
|
||
| [:octicons-arrow-right-24: Cray MPICH][ref-communication-cray-mpich] | ||
|
|
||
| For compatibility in containers: | ||
|
|
||
| [:octicons-arrow-right-24: MPICH][ref-communication-mpich] | ||
|
|
||
| Also OpenMPI can be built in containers or in uenv | ||
|
|
||
| [:octicons-arrow-right-24: OpenMPI][ref-communication-openmpi] | ||
|
|
||
| </div> | ||
| <div class="grid cards" markdown> | ||
|
|
||
| - __Machine Learning__ | ||
bcumming marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Communication libraries used by ML tools like Torch, and some simulation codes. | ||
|
|
||
| [:octicons-arrow-right-24: NCCL][ref-communication-nccl] | ||
|
|
||
| [:octicons-arrow-right-24: NVSHMEM][ref-communication-nvshmem] | ||
|
|
||
| </div> | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,24 +1,77 @@ | ||
| [](){#ref-communication-libfabric} | ||
| # Libfabric | ||
|
|
||
| [Libfabric](https://ofiwg.github.io/libfabric/), or Open Fabrics Interfaces (OFI), is a low level networking library that abstracts away various networking backends. | ||
| It is used by Cray MPICH, and can be used together with OpenMPI, NCCL, and RCCL to make use of the [Slingshot network on Alps][ref-alps-hsn]. | ||
| [Libfabric](https://ofiwg.github.io/libfabric/), or Open Fabrics Interfaces (OFI), is a low-level networking library that provides an abstract interface for networks. | ||
| Libfabric has backends for different network types, and is the interface chosen by HPE for the [Slingshot network on Alps][ref-alps-hsn], and by AWS for their [EFA network interface](https://aws.amazon.com/hpc/efa/). | ||
|
|
||
| To fully take advantage of the network on Alps: | ||
|
|
||
| * libfabric and its dependencies must be available in your environment (uenv or container); | ||
| * and, communication libraries in your environment like Cray MPICH, OpenMPI, NCCL, and NVSHMEM have to be built or configured to use libfabric. | ||
|
|
||
| !!! question "What about UCX?" | ||
| [Unified Communication X (UCX)](https://openucx.org/) is a low level library that targets the same layer as libfabric. | ||
| Specifically, it provides an open, standards-based, networking API. | ||
| By targeting UCX and libfabric, MPI and NCCL do not need to implement low-level support for each network hardware. | ||
|
|
||
| **There is no UCX back end for the Slingshot network on Alps**, and pre-built software (for example conda packages and containers) often provides versions of MPI built for UCX only. | ||
| Running these images and packages on Alps will lead to very poor network performance or errors. | ||
|
|
||
| [](){#ref-communication-libfabric-using} | ||
| ## Using libfabric | ||
|
|
||
| [](){#ref-communication-libfabric-uenv} | ||
| ### uenv | ||
|
|
||
| If you are using a uenv provided by CSCS, such as [prgenv-gnu][ref-uenv-prgenv-gnu], [Cray MPICH][ref-communication-cray-mpich] is linked to libfabric and the high speed network will be used. | ||
| No changes are required in applications. | ||
|
|
||
| If you are using containers, the system libfabric can be loaded into your container using the [CXI hook provided by the container engine][ref-ce-cxi-hook]. | ||
| Using the hook is essential to make full use of the Alps network. | ||
| [](){#ref-communication-libfabric-ce} | ||
| ### Containers | ||
|
|
||
| The approach is to install libfabric inside the container, along with MPI and NCCL implementations linked against it. | ||
| At runtime, the [container engine][ref-container-engine] [CXI hook][ref-ce-cxi-hook] will replace the libfabric libraries inside the container with the corresponding libraries on the host system. | ||
| This will ensure access to the Slingshot interconnect. | ||
|
|
||
|
|
||
| !!! note "Use NVIDIA containers for the gh200 nodes" | ||
| Container images provided by NVIDIA, which come with CUDA, NCCL and other commonly used libraries are recommended as the base layer for building a container environment on the [gh200][ref-alps-gh200-node] and [a100][ref-alps-a100-node] nodes. | ||
|
|
||
| The version of CUDA, NCCL and compilers in the container can be used once libfabric has been installed. | ||
| Other communication libraries, like MPI and NVSHMEM, provided in the containers can't be used directly. | ||
| Instead, they have to be installed in the container and linked against libfabric. | ||
|
|
||
| !!! example "Installing libfabric in a container for NVIDIA nodes" | ||
| The following lines demonstrate how to configure and install libfabric in a Containerfile. | ||
| Communication frameworks are built with explicit support for CUDA and GDRCopy. | ||
|
|
||
| Some additional features are enabled to increase the portability of the container to non-Alps systems: | ||
|
|
||
| - The libfabric [EFA](https://aws.amazon.com/hpc/efa/) provider is configured with the `--enable-efa` flag, for compatibility with AWS infrastructure. | ||
| - The UCX communication framework is added to facilitate building a broader set of software (e.g. some OpenSHMEM implementations) and for optimized infiniband network support. | ||
|
|
||
| Note that it is assumed that CUDA has already been installed on the system. | ||
| ```Dockerfile | ||
| --8<-- "docs/software/communication/dockerfiles/libfabric" | ||
| --8<-- "docs/software/communication/dockerfiles/ucx" | ||
| ``` | ||
|
|
||
| An example Containerfile that installs libfabric in an NVIDIA container can be expanded below: | ||
|
|
||
| ??? note "The full Containerfile for GH200" | ||
| The Containerfile below is based on an NVIDIA CUDA image, which provides a complete CUDA installation and NCCL. | ||
|
|
||
| ``` | ||
| --8<-- "docs/software/communication/dockerfiles/base" | ||
| --8<-- "docs/software/communication/dockerfiles/libfabric" | ||
| --8<-- "docs/software/communication/dockerfiles/ucx" | ||
| ``` | ||
|
|
||
| [](){#ref-communication-libfabric-performance} | ||
| ## Tuning libfabric | ||
|
|
||
| Tuning libfabric (particularly together with [Cray MPICH][ref-communication-cray-mpich], [OpenMPI][ref-communication-openmpi], [NCCL][ref-communication-nccl], and [RCCL][ref-communication-rccl]) depends on many factors, including the application, workload, and system. | ||
| Tuning libfabric (particularly together with [Cray MPICH][ref-communication-cray-mpich], [OpenMPI][ref-communication-openmpi], and [NCCL][ref-communication-nccl]) depends on many factors, including the application, workload, and system. | ||
| For a comprehensive overview libfabric options for the CXI provider (the provider for the Slingshot network), see the [`fi_cxi` man pages](https://ofiwg.github.io/libfabric/v2.1.0/man/fi_cxi.7.html). | ||
| Note that the exact version deployed on Alps may differ, and not all options may be applicable on Alps. | ||
|
|
||
| See the [Cray MPICH known issues page][ref-communication-cray-mpich-known-issues] for issues when using Cray MPICH together with libfabric. | ||
|
|
||
| !!! todo | ||
| More options? |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.