Getting Started with SLURM and Clariden - CSCS KB
TODO: VS Code Integration
Instructions for using the Clariden cluster at CSCS, working with SLURM, and creating and running containers in this environment. Tutorials at https://confluence.cscs.ch/spaces/KB/pages/272793684/CSCS+Knowledge+Base and Clariden https://confluence.cscs.ch/spaces/KB/pages/750223402/Alps+Clariden+User+Guide
You should send your GitHub username to your supervisor so they can add you to the group repository
NOTE: For support visit https://support.cscs.ch where you can find tutorials, join the CSCS Slack for questions, and submit tickets if things are not working
[1/7] Set Up Your Access to Clariden
CSCS KB
Clariden is the supercomputer from CSCS that we mainly use
-
Check your e-mail for an invite, "Invitation to Join CSCS", and complete registration
-
Wait until your account is manually confirmed, you should receive a second e-mail to setup your password and OTP (One-Time Password using an Authenticator App). Confirm your account by logging into https://portal.cscs.ch
-
Run the script for SSH key setup and connection to Clariden
-
Pull the setup and configuration script
curl -sL https://raw.githubusercontent.com/swiss-ai/reasoning_getting-started/main/{cscs-cl_setup.sh,user.env} -OO && chmod +x cscs-cl_setup.sh && ./cscs-cl_setup.sh
-
Add to
user.env
yourWANDB_API_KEY
,HF_TOKEN
, Git Credentials, and any other env variablesLOCAL_GIT_SSH_KEYPATH
is the path to your local private Git SSH key, e.g.$HOME/.ssh/GitKey
(not .pub), if you haven't done so, generate https://www.youtube.com/watch?v=DuMcXyQkj5g then add https://github.com/settings/ssh/new you can test it works withYou may need to add the key to ssh config (replacingssh -T [email protected]
<key_name>
(not .pub))echo -e "\nIdentityFile $HOME/.ssh/<key_name>" >> $HOME/.ssh/config
- You can find your Git email at https://github.com/settings/emails if you want a private email select 'Keep my email addresses private' and use the email in the format
<ID>+<username>@users.noreply.github.com
NOTE: If you want to move
user.env
, make sure to run./cscs-cl_setup.sh
again in the new directory -
To connect to Clariden simply run
cscs-cl
NOTE: Cluster SSH keys are valid for 24h, after which running
cscs-cl
will automatically generate new keys -
If you were able to login but suddenly get
Too many authentication failures
when logging into Clariden, you might have some deprecated keys in your ssh-agent. Try removing all ssh-agent identities (keys) and try againssh-add -D cscs-cl
-
-
The preinstalled packages, like python, can be outdated and limiting. It's a good idea to work with your own miniconda environment
-
Install miniconda by running the following commands, Clariden nodes use the
aarch64
ARM64bit architecture, meaning we can't usex86_64
as is likely what your personal machine is running (you can check on linux usinguname -m
)
NOTE: Answer "no" when prompted "Do you wish to update your shell profile to automatically initialize conda?"cd && wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh bash ./Miniconda3-latest-Linux-aarch64.sh rm ./Miniconda3-latest-Linux-aarch64.sh
To make conda available in your shell, run to add to your shell-rc, e.g.
$HOME/{.bashrc, .zshrc}
If you picked a different path for conda than default~/miniconda3
, change the path accordinglyecho -e "\nsource ~/miniconda3/etc/profile.d/conda.sh" >> $HOME/.${SHELL##*/}rc source $HOME/.${SHELL##*/}rc
You can manually enable and disable the conda env using
conda activate
andconda deactivate
VS Code will usually handle auto-activation of conda envs, if you want a specific conda env automatically activated on CLI, run (replace
base
with<env_name>
for another env)echo -e "\nconda activate base" >> $HOME/.${SHELL##*/}rc
-
If your conda env (e.g.
base
) is activated you should see it in the context indicator of your terminal -(base) [clariden][<user>@clariden-ln001 ~]$
You can now install any packages you needpip install --upgrade pip setuptools ...
-
[2/7] Persistent Storage
CSCS KB
Just connecting to Clariden via cscs-cl
will give you a login node on /users/$USER
with only 50GB of storage and should only be used for configuration files. Any files created during execution on a compute node (discussed later) will be lost once the session ends. For persistent storage, the Clariden cluster has two mounted storage partitions:
/iopsstor
is smaller and intended for faster, short-term access (3PB shared across all users)
Your personal scratch partition is on/iopsstor/scratch/cscs/$USER
for easy access you can add a symbolic link to your home directoryln -s /iopsstor/scratch/cscs/$USER/ $HOME/scratch
/capstor
is slower but larger and intended for large files (150TB and 1M inodes(files)/user)
Your personal storage partition is on/capstor/scratch/cscs/$USER
DO NOT write to capstor from compute nodes during a job, always write to iopsstor. capstor is not meant for quick reading and writing of many filesln -s /capstor/scratch/cscs/$USER/ $HOME/store
IMPORTANT: Files on /iopsstor/scratch
and /capstor/scratch
are cleaned after 30 days, remove temporary files and transfer important data to group capstor (NOT personal capstor as discussed previously, will be discussed in 'Reasoning Projects Framework')
You can check your usage quota by logging into ela.cscs.ch (it currently doesn't work on Clariden)
ssh ela "quota"
[3/7] SLURM Basics
CSCS KB
Clariden uses SLURM to allocate and schedule compute resources across the cluster for efficient and fair usage among users. Example SLURM commands:
-
sinfo -a -l
Check available partitions (queues for jobs to run) and their nodes.-a
show all partitions,-l
in long format -
NOTE: Never run compute-intensive jobs on the login node!
For quick jobs
srun --account=a-a06 --time=01:00 -p debug --pty bash -c '<command>'
--account
is mandatory and can be checked in CSCS Projects (a06 for LLMs) orid -Gn
--time=01:00
specifies runtime (1 minute, shorter jobs get priority)-p
specifies the partition (debug
is usually for quick tests, max. 30min; elsenormal
, max. 24h)--pty
starts an interactive sessionbash -c '<command>'
will run the subsequent command with bashYou can get an interactive compute node for 30min (such as to process data)
srun --account=a-a06 -p debug --pty bash
For experiments you should use
sbatch
(See [5/7]) -
To replace
srun --account=a-$(id -Gn) -p debug --pty
with a shorthandsdebug
command run (--container-writable
allows you to write if in a container, discussed in [4/7])echo -e "\nalias sdebug='srun --account=a-\$(id -Gn) -p debug --pty --container-writable \"\$@\"'" >> $HOME/.${SHELL##*/}rc && source $HOME/.${SHELL##*/}rc
Now you can simply run
sdebug bash
to get an interactive compute nodesdebug <options>
to add options such as-t <MM:SS>
time,bash -c '<command>'
for commands -
squeue --me
Show your own jobs, their<JOBID>
, and<NODELIST>
. You can prependwatch -n <interval>
to refresh the command every<interval>
seconds -
scancel <JOBID>
Cancel an individual <JOBID>scancel --me
Cancel all jobs -
scontrol show job <JOBID>
See more details about your job after completion -
scontrol show nodes <NODELIST>
See specific details about a node, usually nid00NNNN
[4/7] Containers and Env Files
CSCS KB
Clariden containers run with Enroot for consistent and reproducible environments, making it possible to run Docker images without requiring elevated privileges. They are defined by .toml
files which specify the container image to use, along with filesystem paths to mount inside it
-
Create a simple
my_env.toml
file in$HOME/.edf/
(this allows you to call the env file without the full path)mkdir -p $HOME/.edf cat > $HOME/.edf/my_env.toml << EOF image = "/capstor/store/cscs/swissai/a06/containers/nanotron_pretrain/latest/nanotron_pretrain.sqsh" mounts = ["/capstor", "/iopsstor", "/users"] workdir = "/workspace" [annotations] com.hooks.aws_ofi_nccl.enabled = "true" com.hooks.aws_ofi_nccl.variant = "cuda12" EOF
The annotations are arguments to load the proper NCCL plugin that CSCS has prepared for us
NOTE: EDF files expect realpaths (fullpaths), so$GLOBAL_VARS
are NOT allowed, e.g. '$HOME' or '~' are NOT allowed, use/users/<USER>
instead, replacing<USER>
with your actual username ('$USER' also NOT allowed), to figure out the realpath runpwd
in any directory -
Launch an interactive session using the env file
sdebug --environment=my_env bash
NOTE: Only files saved in mounted paths (
my_env
example/capstor
,/iopsstor
, and/users
) are persistent, changes to other paths like/workspace
will be lost once the container session ends -
You can also change the working directory in your
~/.edf/my_env.toml
file to your$HOME
, manually (replacing<USERNAME>
):workdir = "/users/<USERNAME>"
or, with
sed
:sed -i.bak "s|^workdir = .*|workdir = \"/users/$USER\"|" $HOME/.edf/my_env.toml && rm $HOME/.edf/my_env.toml.bak
Now, when you run jobs, you will start in your
$HOME
directory and can write to$HOME/scratch
-
If you aren't already familiar, it is worthwhile to learn CLI text editors like vim
[5/7] Everyday SLURM
-
sdebug <options>
for quick jobs max. 30min (make sure to include a shell, e.g.bash
)squeue --me
to see your jobsctrl+d
to exitscancel <JOBID>
to cancel a <JOBID> orscancel --me
to cancel all your jobs -
For most production workloads or long-running experiments you'll submit jobs non-interactively with
sbatch
. This allows the scheduler to queue up your jobs, allocate resources when they become available, and run your commands without you needing to stay logged in. Runsbatch --help
to see all options availableNOTE: 'normal' partition jobs are max. 24h, 'debug' max. 30min, make sure to checkpoint
- Create a file named
my_first_sbatch.sh
with the following content (read every entry) (substitute 'a-a06' if your project is different)
#!/bin/bash #SBATCH --job-name=my_first_sbatch # A name for your job. Visible in squeue. #SBATCH --account=a-a06 # The account you are charged for the job #SBATCH --nodes=1 # Number of compute nodes to request. #SBATCH --ntasks-per-node=1 # Tasks (processes) per node #SBATCH --time=00:10:00 # HH:MM:SS, set a time limit for this job (here 10min) #SBATCH --partition=debug # Partition to use; "debug" is usually for quick tests #SBATCH --mem=460000 # Memory needed (simply set the mem of a node) #SBATCH --cpus-per-task=288 # CPU cores per task (simply set the number of cpus a node has) #SBATCH --environment=my_env # the environment to use (See [4/7]) #SBATCH --output=/iopsstor/scratch/cscs/%u/my_first_sbatch.out # log file for stdout, prints, et cetera #SBATCH --error=/iopsstor/scratch/cscs/%u/my_first_sbatch.out # log file for stderr, errors # Exit immediately if a command exits with a non-zero status (good practice) set -eo pipefail # Print SLURM variables so you see how your resources are allocated echo "Job Name: $SLURM_JOB_NAME" echo "Job ID: $SLURM_JOB_ID" echo "Allocated Node(s): $SLURM_NODELIST" echo "Number of Tasks: $SLURM_NTASKS" echo "CPUs per Task: $SLURM_CPUS_PER_TASK" echo "Current path: $(pwd)"
-
Run
sbatch my_first_sbatch.sh
thenwatch -n 1 squeue --me
and checkST
the Status CodePD
- Pending,R
- Running,CG
- Completing -
Once completed, check the output file
cat ~/scratch/my_first_sbatch.out
Remember to remove temporary files and transfer important data to
~/project
once jobs are finished (will be discussed in 'Reasoning Projects Framework'), else they will be cleaned after 30 days - Create a file named
[6/7] TODO: VS Code Integration
-
Install Remote Explorer
-
File > Preferences > Extensions
-
Select Remote Explorer by Microsoft and install it
-
-
Enable this setting to prevent disconnects (you need to connect to Clariden with VS Code at least once before this setting appears)
-
File > Preferences > Settings
-
Search for:
remote.SSH.serverListenOnSocket
-
Enable this setting by selecting the checkbox.
-
-
In VS Code, now click on Remote Explorer and select Clariden server (which it took from your ssh config). Once connected you should be able to navigate your home directory on the Clariden login node. If you keep having problems ensure your
ssh clariden
works as expected and manually delete.vscode-server
on Clariden so VS Code reinstalls the VS Code server from scratch
[7/7] (Optional) Building a Container
-
Set up Nvidia GPU Cloud (NGC) access to use Nvidia Containers
-
Navigate to https://ngc.nvidia.com/setup/api-key and create an account if you don't have one
-
Click the green button on the top right named "Generate API Key" and copy it
-
Login to Clariden
cscs-cl
and run the following commands to configureenroot
with your<API_KEY>
(you will need the key again for 'ngc config set' later)NGC_API_KEY="<API_KEY>"
mkdir -p $HOME/.config/enroot cat > $HOME/.config/enroot/.credentials << EOF machine nvcr.io login \$oauthtoken password $NGC_API_KEY machine authn.nvidia.com login \$oauthtoken password $NGC_API_KEY EOF unset NGC_API_KEY
-
Download and unzip ngc-cli for 'ARM64 Linux' from https://ngc.nvidia.com/setup/installers/cli and add it to your PATH
cd && wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.60.2/files/ngccli_arm64.zip -O ngccli_arm64.zip && unzip ngccli_arm64.zip echo -e "\nexport PATH=\"\$PATH:$HOME/ngc-cli\"" >> $HOME/.${SHELL##*/}rc && source $HOME/.${SHELL##*/}rc rm ngc-cli.md5 ngccli_arm64.zip
-
Configure NGC by running the following command, enter your
<API_KEY>
when promptedngc config set
-
Replace the image in your
~/.edf/my_env.toml
file with a 'LINUX / ARM64' image that contains everything to run pytorch on GPUs https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tagsimage = "nvcr.io#nvidia/pytorch:25.01-py3"
-
Run
sdebug --environment=my_env bash
and wait a minute while the container is downloaded, then check if you can import torch (make sure you are not in a conda env)python -c "import torch; print(torch.cuda.get_device_name()); print(torch.cuda.device_count())"
and check GPUs
nvidia-smi
-
-
In the login node, setup your container config
mkdir -p $HOME/.config/containers cat > $HOME/.config/containers/storage.conf << EOF [storage] driver = "overlay" runroot = "/dev/shm/\$USER/runroot" graphroot = "/dev/shm/\$USER/root" [storage.options.overlay] mount_program = "/usr/bin/fuse-overlayfs-1.13" EOF
-
In your home directory of the login node on Clariden, create a file
Dockerfile
FROM nvcr.io/nvidia/pytorch:25.01-py3 # setup RUN apt-get update && apt-get install python3-pip python3-venv -y RUN pip install --upgrade pip setuptools==69.5.1 # Install the rest of dependencies. RUN pip install \ datasets \ transformers \ accelerate \ wandb \ dacite \ pyyaml \ numpy \ packaging \ safetensors \ tqdm \ sentencepiece \ tensorboard \ pandas \ jupyter \ deepspeed \ seaborn # Create a work directory RUN mkdir -p /workspace
The Dockerfile defines steps to build a container image. In this example, we build on top of NVIDIA's PyTorch container 'nvcr.io/nvidia/pytorch:25.01-py3' which comes pre-configured with GPU acceleration and optimized libraries for deep learning. The Dockerfile then installs system dependencies ('python3-pip', 'python3-venv') and a collection of Python libraries for machine learning, data processing, and visualization
Beyond installing packages, a Dockerfile can also define environment variables, set up default commands, configure network settings, expose ports, and optimize the container size using multi-stage builds. Docker's official documentation
-
We will now build the container. DO NOT BUILD ON THE LOGIN NODE. You may hit space or memory limits and it will make the login node less responsive for all other users
Initialize a container without an env
sdebug bash
Once you are on the compute node, navigate to the folder with your Dockerfile and use
podman
to create an image namedmy_pytorch:25.01-py3
(be patient)podman build -t my_pytorch:25.01-py3 .
-
After you created your image you can see it in your local container registry
podman images
-
Use
enroot
to save the image into a.sqsh
compressed SquashFS which you can share with others.enroot import
will convert the container image,-o
specifies the output
NOTE: Save to scratch as the file can be large (easily 20GB, your $HOME only has 50GB). If you are planning to share your image across team-members, contact your supervisor so we can put it on persistent storagecd $HOME/scratch enroot import -o my_pytorch.sqsh podman://my_pytorch:25.01-py3
-
Now you can replace the image in your
~/.edf/my_env.toml
file with the.sqsh
real filepathimage = "/iopsstor/scratch/cscs/<USER>/my_pytorch.sqsh"
NOTE: EDF files expect realpaths (fullpaths), so
$GLOBAL_VARS
are NOT allowed, make sure to replace<USER>
with your actual username or replace withsed
sed -i.bak "s|^image = .*|image = \"/iopsstor/scratch/cscs/$USER/my_pytorch.sqsh\"|" $HOME/.edf/my_env.toml && rm $HOME/.edf/my_env.toml.bak
-
Try it out and check if your software packages are now available when you get a compute node
sdebug --environment=my_env bash -c "pip list"
TODO: Building a Container ontop of group's, creating group container, r-gym, {OpenR1, TinyZero}, Reasoning Resources
Now that you know the basics of Clariden, you can set up the cluster for Reasoning Projects
[1/n] Set Up Shared Storage
/users/$USER
- For personal configuration files, use your home directory ($HOME
,~
) (50GB)/iopsstor/scratch/cscs/$USER
- For compute jobs, use your personal scratch ($SCRATCH
) (30d cleanup)/capstor/scratch/cscs/$USER
- For large files, transfer to your personal storage after compute finished (30d cleanup)
For persistent storage for the most important files and group data, use /capstor/store/cscs/swissai/a06/reasoning
(if you don't have access, message your supervisor)
DO NOT write to this during compute, it costs $$$
Currently, the structure is
/capstor/store/cscs/swissai/a06/reasoning
├── data/ # shared project data
├── imgs/ # project containers
├── models/ # shared models
└── users/ # individual user folders
-
First, create a symbolic link to the project folder
ln -s /capstor/store/cscs/swissai/a06/reasoning $HOME/shared
-
Create your user folder
mkdir -p /capstor/store/cscs/swissai/a06/reasoning/users/$USER
-
Create a symbolic link to your user folder
ln -s /capstor/store/cscs/swissai/a06/reasoning/users/$USER $HOME/project
Now, when you have data you need persistent, you can use
~/project
- For personal persistent data (important source code, results, et cetera)~/shared/*
- For shared persistent data (data, models, et cetera)
DO NOT write to these during compute (that is what $SCRATCH
is for), only transfer data you need saved after or e.g. source-code that cannot fit in your 50GB $HOME