-
Notifications
You must be signed in to change notification settings - Fork 0
Using the HPC
OSU's High Performance Compute Cluster (HPC) is a large compute server maintained by the College of Engineering. Our cluster maintains many nodes for CPU and GPU-intensive workflows, with partitions specifically optimized for deep learning workflows, such as Nvidia's dgx2 servers.
In order to use OSU's High-Performance Compute Cluster (HPC), you first need the appropriate permissions added to your engineering account.
If you do not yet have an engineering account with OSU, please register for one: https://teach.engr.oregonstate.edu.
A tutorial for creating TEACH accounts can be found here: https://it.engineering.oregonstate.edu/get-engr-account
After creating an engineering account, you will need to enable HPC access. From TEACH, under Account Tools, select High Performance Computing (HPC). If your HPC account is not already enabled, click Create to add HPC permissions to your account.
The HPC is configured with three submit nodes, which allow HPC users to run interactive sessions and dispatch jobs to one or more compute nodes without directly connecting to each specific node. To access HPC resources, first you will need to create an SSH connection to one of these submit nodes. You will connect to one of the following hosts:
submit-a.hpc.engr.oregonstate.edusubmit-b.hpc.engr.oregonstate.edusubmit-c.hpc.engr.oregonstate.edu
To connect to one of these nodes, simply run ssh {ONID}@submit-{a, b, c}.hpc.engr.oregonstate.edu. where your ONID is the username assigned to your email (e.g. for Aidan Beery, email would be beerya@oregonstate.edu, ONID would be beerya). Note that this connection will only work if you are on campus and connected to OSU's network (either wired or via eduroam). If you are off-campus, see Connecting Outside of Campus.
On login, you will be prompted for your ONID password and for Duo authentication. This will be required for every SSH connection you make. which can be very tedious. To eliminate the need for password entry and streamline the logon process, you can set up SSH keys between your client and OSU's servers. These SSH keys will work for any server managed by the College of Engineering (flip, pelican, babylon, submit, os1, etc.)
To add SSH keys, you will first need to generate an SSH key (or use an existing one). We assume for this tutorial you have never created an SSH key before. To do so, on Windows 10 or later, macOS, or Linux, run ssh-keygen -t ed25519. We recommend saving the key in the default location, which will be /home/{USERNAME}/.ssh/id_ed25519 for macOS and Linux users, and c:\Users\{USERNAME}\.ssh\id_ed25519 for Windows users. A SSH key passphrase is optional, but can improve security for sensitive applications.
After this, you will want to add your host's new SSH public key to the target server. For Linux and macOS, the process is relatively simple:
Use ssh-copy-id -i ~/.ssh/id_ed25519.pub {ONID}@access.engr.oregonstate.edu. The specific host we copy our SSH key to does not matter, as the engineering filespace is shared via nfs across all engineering-managed nodes. So, copying to access will enable passwordless login to submit and more. You will be prompted for your ONID password, as well as for Duo authentication. After this, test your new SSH key with ssh {ONID}@access.engr.oregonstate.edu. You should automatically connect to access without requiring a password or Duo prompt.
Windows does not ship with ssh-copy-id by default. It is as this point I would encourage the user to look into using WSL2 for their development workflow. However, without using ssh-copy-id the following is the most straightforward way to copy an SSH key.
cat ~/.ssh/id_ed25519.pub- Take the resulting public key and copy it to your system clipboard
- Connect to
accesswithssh {ONID}@access.engr.oregonstate.edu, providing password and Duo authorization - Open
~/.ssh/authorized_keysin your prefered editor (e.g.nano,vi,emacs) - Paste your copied public key into the authorized keys file.
- Save the file (
:xforvi/vim) - Disconnect from your SSH connection and verify that new connections don't require a password or Duo authorization.
For security reasons, OSU's HPC is only accessible from devices within OSU's network. However, if you are working from home or operating remotely, there are procedures to follow to connect your device to a publicly accessible gateway hosted on OSU's network first, and then connect to the HPC via this proxy.
The most commonly supported way remotely connecting to OSU's subnet is to use the OSU VPN. Please see the following article for a step-by-step guide on installing the Cisco AnyConnect VPN client and authenticating your device with OSU's VPN server: https://oregonstate.teamdynamix.com/TDClient/1935/Portal/KB/ArticleDet?ID=51154
Though it is the most common method to route your device's traffic through OSU's network, using the VPN is not the recommended method for connecting to the HPC remotely. We strongly encourage lab members to set up a SSH ProxyJump in order to connect to the VPN, as the VPN can cause issues with:
- Local web development (when hosting a web server on your LAN)
- Local embedded/on-device development (when connected to a device on your LAN)
- Third-party Linux package repositories (which can sometimes be blocked by OSU's firewall)
As well as increase overall resource utilization on your device. OSU EECS maintains multiple servers with permissive firewalls, allowing students to connect to EECS compute resources without needing to be on OSU's subnet. In principle, we can connect our remote client to one of these access nodes, and then from that server, SSH to a restricted node inside OSU's network, such as the HPC submit nodes. To connect this way, use the following command:
ssh -J {ONID}@access.engr.oregonstate.edu {ONID}@submit-[a, b, c].hpc.engr.oregonstate.edu
The -J option for ssh "jumps" our connection to submit by using access as a proxy. Since access is accessible by devices not on OSU's subnet, we indirectly connect to the HPC without the use of a VPN.
To simplify SSH-based workflows, it is advised to set up a ssh config (~/.ssh.config). Here is an example ssh config entry for the HPC.
Host flip
Hostname flip.engr.oregonstate.edu
User {ONID}
LogLevel error
Port 22
Host hpc
Hostname submit-u.hpc.engr.oregonstate.edu
User {ONID}
LogLevel error
Port 22
After configuring your SSH config, you can connect to submit-b simply with ssh hpc
Slurm is a workload management and job scheduling system. It allows for compute resources to be shared by a variety of users. Instead of provisioning each user a virtual machine, Slurm uses a job queue system to allow users to dispatch a job to one or many compute nodes within the cluster, and execute that user's job once there is available capacity on the node(s). A user can interact with the HPC via Slurm with either an interactive session (running on a specified node) or by submitting a batch job to run in the background (functionally similar to a detached process).
Our lab has exclusive ownership over two nodes on OSU's HPC - cn-m-1 and cn-m-2
cn-m-1
- CPU: 2x Intel Xeon Gold 5222, 4 cores @ 3.10 GHz
- GPU: 6x Nvidia Tesla T4, 15GB VRAM
- RAM: 192GB
cn-m-2
- CPU: 2x Intel Xeon Silver 4215, 8 cores @ 2.50 GHz
- GPU: 2x Nvidia Quadro RTX 6000, 24GB VRAM
- RAM: 192GB
In addition, OSU research students have access to the dgx nodes. Nvidia's DGX servers are nodes designed specifically with AI/ML workloads in mind, and feature very powerful GPU configurations with very large pools of VRAM for storing and computing large datasets or models. Each user account can issue a job onto the dgx cluster using a maximum of 4 V100 GPUs.
As of 01/29/2024, submit nodes, cn-m-1, and cn-m-2 are running CentOS Linux 7.9. In the future, these will be upgraded to Rocky Linux 9 to match the current images deployed on flip and other student-facing OSU EECS compute resources.
You can find more information on OSU's HPC performance capabilities here: https://it.engineering.oregonstate.edu/hpc/about-cluster
Before scheduling a job, it is often useful to know the current status of your selected nodes so you can select nodes with lower utilization.
To see what jobs are currently running on our lab's partition, use squeue -p soundbendor. squeue lists all active jobs in a given partition. A partition is a group of nodes in the cluster with specific properties or permissions. soundbendor is the partition consisting of the nodes cn-m-1 and cn-m-2. Only members of Soundbendor Lab have access to these nodes. Other partitions include dgx-2 and dgxh
Sample output of squeue -p soundbendor
If you would like to know how saturated an individual node or partition is, you can use nodestat. This will tell you information about the total number of CPUs, GPUs, and RAM a node (or group of nodes) has, as well as how many of those are currently in use. This can be particularly useful when one is trying to use the DGX nodes, as these are frequently under heavy utilization and it is desirable to select the node in the dgx2 partition with the least number of GPUs available.
Sample output of nodestat dgx2
For more information about retrieving the status of a given node or partition, see here: https://slurm.schedmd.com/sinfo.html
If you would like to interact with a terminal on a node, in order to run, test, and monitor a given process, this can be accomplished by creating an interactive session. Slurm defines an interactive session as a job with a tty. Functionally, for the user, this emulates a process similar to making an ssh connection to a specified compute resource or node. The advantage of using interactive sessions over direct ssh connections (besides security) are that it allows each interactive session to only be allocated a fixed number of resources, allowing multiple concurrent jobs to be managed by a given node.
To create an interactive session, we use the following command:
srun -w {NODE_NAME} -p {PARTITION} -A {ACCOUNT} -t {TIMESTAMP} -c {NUM_CORE} --mem={RAM} --gres=gpu{NUM_GPU} --pty {SHELL}
-
-w: Specify which node(s) you would like to run your session on. For example,cn-m-1. -
-p: Partition name. Common partitions you will use:-
soundbendor: For accessingcn-m-1andcn-m-2 -
dgx2: For GPU-intensive workloads -
dgxh: Experimental as of 01/29/2024. Equipped with significantly more powerful H100 GPUs.
-
-
-A: Account or user group.- For the
soundbendorpartition, this issoundbendor - For
dgx2, this iseecs. - You may be added to new accounts for specific courses, these accounts may have restrictions which vary from the default
eecsuser group.
- For the
-
-t: Specify the time for which your interactive session should stay alive. Optional. Maximum allowable time oncn-m-1is3-00:00:00(72hrs). -
-c: Specify the number of cores you would like your job to have. Optional. Default is 1 core. -
--mem: Specify the amount of RAM you would like to provision for your job. Optional. For example,--mem=40G -
--gres=gpu: Specify the number of GPUs you would like to request on your node. Optional. Default is 0.-
dgx2limits provisioning to 4 GPUs per user.
-
-
--pty: Specify your preferred shell. For most users, this isbash.
To get more information about creating interactive sessions, see here: https://slurm.schedmd.com/srun.html
Interactive sessions provide a familiar terminal interface to interact with nodes on the HPC, but they also only persist for the duration of your ssh connection with the submit node. For many ML experiments, you may need to train a model for multiple days, in which time you may want to use your terminal for other tasks. Furthermore, it is challenging to manage multiple interactive sessions at a time, which limits one's ability to run multiple experiments simultaneously.
To address this, it is preferable to use sbatch to schedule jobs to run in the background. sbatch files are used to specify the parameters of a job (e.g. what nodes, resources, and account details to use for your job) and allow a job to join a queue. Once the requested resources are available, your job will be executed on the requested node(s), and the details of your program will be stored in a log file in the directory from which you execute your sbatch command.
Sbatch files are shell scripts. Conventionally, they are named with .sbatch extensions. The parameters for an sbatch file are similar to the command line flags for an srun command.
An example sbatch file, for running a Python program on cn-m-1 with 1 GPU
#!/bin/bash
#SBATCH -w cn-m-1
#SBATCH -p soundbendor
#SBATCH -A soundbendor
#SBATCH --job-name=<COOL_RESEARCH_NAME>
#SBATCH -t 3-00:00:00
#SBATCH -c 4
#SBATCH --mem=40G
#SBATCH --gres=gpu:1
#SBATCH --export=ALL
source <MY_CONDA_INSTALL>/bin/activate
source activate <CONDA_ENV_NAME>
python3 <MY_COOL_RESEARCH_PROJECT>
Sbatch files begin with a series of sbatch directives, which define the parameters of our job's execution. Here, we will follow the parameter standards described in Starting an Interactive Session.
Once the desired execution parameters for the job have been specified, the execution context will then enter the environment in which the program code will run. From this point, any arbitrary bash command can be executed to prepare your experiment. As demonstrated above, we activate our package manager, Conda. Conda is typically installed to your user directory (/nfs/stak/users/{ONID}/miniconda) or, if you follow the steps in Configuring your Development Environment, your user directory in the Soundbendor filespace (/nfs/guille/eecs_research/soundbendor/{ONID}).
source <MY_CONDA_INSTALL>/bin/activate
Because our sbatch job does not run in the same environment as our user's shell, we need to forward our environment variables to the job (done using #SBATCH --export=ALL) and we must set up and configure out Conda installation (which is typically handled by our user's .bashrc when launching a new user login shell or an interactive session launched with srun).
source activate <CONDA_ENV_NAME>
After initializing Conda, you will want to activate the Conda environment associated with your experiment, so that your source code has access to its prerequisite dependencies.
python3 <MY_COOL_RESEARCH_PROJECT>
This is the command to run your source code. Here, you should add any command line arguments or other parameters necessary to your experiment. The exact parameters and environmental setup for your experiment will depend on your specific setup and requirements.
For more information about using sbatch, see the following guide: https://slurm.schedmd.com/sbatch.html
Setting up a remote development environment can vary from using a local machine for both writing and running your experiments. Because this system is not managed by you, you must consider a workflow for deploying and managing both source code and environments on the HPC. A few options are available:
- Develop remotely using the VSCode SSH Extension [Not Recommended]
- The VSCode SSH extension is unreliable, and the VSCode Server which runs remotely on the HPC is prone to leaving zombie processes running in your user account, requiring occasional force-kills of all of your user processes via the TEACH interface.
- Develop remotely using a terminal-native editor (e.g. vim, emacs)
- Recommended for those with experience with vim or emacs, or those willing to learn.
- Develop locally using your preferred editor and push your code to the HPC for deployment.
- This option affords the most flexibility, however it will require you maintain all of your projects in a Github repository and deploy your environments to the server after building them locally (which can be accomplished using
environment.ymlfiles in Conda).
- This option affords the most flexibility, however it will require you maintain all of your projects in a Github repository and deploy your environments to the server after building them locally (which can be accomplished using
Your user directory is shared across all EECS servers via the nfs file share. This means that any files you upload to, for example, flip will also be visible in submit-c and cn-m-1. The directory associated with your user account, /nfs/stak/{ONID} is limited to 15gb of file storage. For many of the tasks you will encounter in your ML workflow, such as building experiments with large numbers of external libraries, working with tensor processing frameworks such as Pytorch or TensorFlow, or storing datasets on the HPC, this userspace storage restriction will not be adequate.
In order to circumvent this issue, Soundbendor Lab maintains a storage share on the compute cluster which is specific to our lab. This file directory can be found in /nfs/guille/eecs_research/soundbendor. In this directory, there are many folders used for collaborative projects, dataset storage, archives of previous students' source code, and other lab assets. We recommend creating a new folder for your uses, which you should label using your ONID/Unix account name. In this directory, we advise you host all large binaries, datasets specific to your research, experiment source code, and results.
Conda is a package management tool commonly used for data science and machine learning workflows in Python. Unlike Python's default package manager, pip, Conda allows you to create project-specific environments, eliminating conflicting requirements caused by global package installations and increasing portability by conveinently tracking dependency versions in a central location per project.
It is recommended that you install Miniconda, a slim version of Conda, to reduce overhead from the installation of unnecesscary packages. The following installation procedure will automatically install Miniconda3 to your ONID directory in the Soundbendor file storage. Failing to install Miniconda to the Soundbendor directory will eventually result in exceeding the space requirements imposed on your personal user directory, as Conda environments can often be multiple gigabytes per project.
mkdir -p /nfs/guille/eecs_research/soundbendor/{ONID}/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /nfs/guille/eecs_research/soundbendor/{ONID}/miniconda3/miniconda.sh
bash /nfs/guille/eecs_research/soundbendor/{ONID}/miniconda3/miniconda.sh -b -u -p /nfs/guille/eecs_research/soundbendor/{ONID}/miniconda
rm -rf /nfs/guille/eecs_research/soundbendor/{ONID}/miniconda3/miniconda.sh
A few basic Conda commands:
-
conda create --name {COOL_PROJECT_NAME} python=3.10: Creates a new Conda environment using Python 3.10 -
conda activate {COOL_PROJECT_NAME}: Activate your recently created environment -
conda install {PACKAGE_NAME}: Install a new package to your local environment -
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia: Install the current Pytorch dependencies to your environment -
conda env export --from-history>environment.yml: Export a cross-platform list of dependencies from your currently activated Conda environment and save them to anenvironment.ymlfile -
conda env create -n {COOL_PROJECT_NAME} --file environment.yml: Create a new environment using theenvironment.ymlfile exported from a previously created Conda environment -
conda env listorconda --info env: To see a list of all your Conda environments.
While working with Conda, you will find situations where you want to maintain an identical environment on your local development environment and on the HPC. To do this, we recommend adding an environment.yml file to your Github-tracked project repository which lists all of the dependencies required for your project. Even if you are executing all of your experiments on the HPC, it is very beneficial to keep a live Conda environment on your local development machine for linting and type checking with various Python development tools.
To export your current environment, use conda env export --from-history>environment.yml. This will take all of the dependencies your current project requires and store them in a yml file. From your development machine, you can then commit this environment file to version control.
On the HPC, pull your updated source code, including the environment.yml file. From there, ensuring you have no other Conda environments active at the time, run conda env create -n {COOL_PROJECT_NAME} --file environment.yml. This will rebuild your project environment on the HPC.
Conda is a powerful package management tool, but it is limited by a slow, single-threaded dependency resolution algorithm. Mamba is an improved package manager which uses the same package distribtion channels as Conda, but features much faster dependency resolution and installation. It is recommended for users who are familiar with troubleshooting issues in a remote development environment, as the available documentation for Mamba is less than that of the widely-supported Conda.
If you are unable to use the slurm binary,
- Verify that you are connected to one of the
submitnodes. - Load the
slurmmodule usingmodule load slurm.
- HPC Manager Rob Yelle hosts frequent Intro to HPC training sessions, which are mandatory.
- An explanation of Slurm: https://slurm.schedmd.com/quickstart.html
- Slurm Docs:
sinfo: https://slurm.schedmd.com/sinfo.html - Slurm Docs:
srun: https://slurm.schedmd.com/srun.html - Slurm Docs:
sbatch: https://slurm.schedmd.com/sbatch.html - OSU's HPC Onboarding: https://it.engineering.oregonstate.edu/hpc
- OSU's Slurm how-to: https://it.engineering.oregonstate.edu/hpc/slurm-howto
- Hardware available on the HPC: https://it.engineering.oregonstate.edu/hpc/about-cluster
- Github guide to SSH keys: https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent
- Conda Cheatsheet: https://docs.conda.io/projects/conda/en/latest/_downloads/843d9e0198f2a193a3484886fa28163c/conda-cheatsheet.pdf
- Installing Conda: https://conda.io/projects/conda/en/latest/user-guide/install/index.html
- Mamba, a fast Conda alternative: https://github.com/mamba-org/mamba
- Mamba documentation: https://mamba.readthedocs.io/en/latest/
- Managing Conda Environments: https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#activating-an-environment

