A set of scripts to simplify running interactive SSH sessions on SLURM compute nodes, particularly designed for use with tools like VSCode Remote-SSH. These scripts handle SLURM job submission, wait for the job to start, and proxy the connection, allowing you to seamlessly connect to a container on a compute node.
Automated Setup - Just run one command and follow the prompts:
git clone https://github.com/aihpi/interactive-slurm.git
cd interactive-slurm
./setup.sh
That's it! The setup script will:
- β Generate SSH keys automatically
- β Copy keys to your HPC cluster
- β Install scripts on the remote cluster
- β Configure SSH settings
- β Set up VSCode integration
- β Handle container options (optional)
# Direct compute node access (no container)
ssh slurm-cpu # CPU job
ssh slurm-gpu # GPU job
# Or with containers (if configured)
ssh slurm-cpu-container
ssh slurm-gpu-container
- Install the Remote-SSH extension
- Press
Ctrl/Cmd+Shift+P
β "Remote-SSH: Connect to Host" - Select
slurm-cpu
,slurm-gpu
, or container variants
- π One-Command Setup: Automated installation with interactive prompts
- π³ Optional Containers: Use with or without enroot containers (
.sqsh
files) - π Smart Container Management: Auto-copies from
/sc/projects
to home directory - π Full SLURM Integration: Access to
srun
,sbatch
,scancel
commands - π― VSCode Ready: Perfect integration with Remote-SSH extension
- β‘ Resource Optimized: Sensible defaults (CPU: 16GB/4cores, GPU: 32GB/12cores)
- ποΈ Architecture Aware: Targets x86 nodes to avoid compatibility issues
- π§ Easy Management: List, cancel, and monitor jobs with simple commands
- π Secure: Automatic SSH key generation and distribution
- Access to a SLURM-managed HPC cluster
- SSH access to the cluster's login node
enroot
installed on compute nodes (only if using containers)- VSCode with Remote-SSH extension (optional)
Clone the repository and run the setup script:
git clone https://github.com/aihpi/interactive-slurm.git
cd interactive-slurm
./setup.sh
The setup script will ask you:
-
HPC Login Node: Enter your cluster's hostname
HPC Login Node (hostname or IP) [10.130.0.6]: login.your-cluster.edu
-
Your Username: Enter your HPC username
Your username on the HPC cluster [john.doe]: your.username
-
Container Usage: Choose whether to use containers
Do you want to use containers? [Y/n]: y
-
Container Source: If using containers, specify where to get them
Do you have containers in /sc/projects that you want to copy? [Y/n]: y Container path to copy: /sc/projects/sci-aisc/sqsh-files/pytorch_ssh.sqsh
The script will automatically:
- Generate SSH keys (
~/.ssh/interactive-slurm
) - Copy the public key to your HPC cluster
- Install scripts on the HPC cluster
- Configure your local SSH settings
- Set up VSCode integration
- Test the connection
After setup completes, you can immediately connect:
ssh slurm-cpu # CPU job, direct access
ssh slurm-gpu # GPU job, direct access
ssh slurm-cpu-container # CPU job with container
ssh slurm-gpu-container # GPU job with container
- Open VSCode
- Press
Ctrl/Cmd+Shift+P
- Type "Remote-SSH: Connect to Host"
- Select your desired host (e.g.,
slurm-cpu
) - VSCode will connect to a compute node automatically!
Try connecting via command line first:
# Test CPU connection (should submit a job and connect)
ssh slurm-cpu
Once connected, or via direct SSH to the login node:
# List your running interactive jobs
~/bin/start-ssh-job.bash list
# Cancel all your interactive jobs
~/bin/start-ssh-job.bash cancel
# Get help
~/bin/start-ssh-job.bash help
- First Connection: Takes 30s-5min (job needs to start)
- Job Submission: You'll see "Submitted new vscode-remote-cpu job"
- Connection: Eventually connects to compute node
- Subsequent Connections: Should reuse existing job (faster)
Before reporting issues, verify:
- π± Local: SSH key works:
ssh -i ~/.ssh/interactive-slurm user@hpc
- π₯οΈ HPC: Scripts installed:
ls ~/bin/start-ssh-job.bash
- π₯οΈ HPC: SLURM works:
squeue --me
- π± Local: SSH config generated:
grep slurm-cpu ~/.ssh/config
- π± Local: Basic connection:
ssh slurm-cpu
- π± Local: VSCode extension installed: Remote-SSH
- π₯οΈ HPC: Container exists (if used):
ls ~/your-container.sqsh
To make this work seamlessly with VSCode, you need to configure your local ~/.ssh/config
file. This tells SSH how to connect to your SLURM jobs via the login node.
Add entries like the following to your ~/.ssh/config
on your local machine:
# In your ~/.ssh/config on your LOCAL machine
Host slurm-cpu
HostName login.hpc.yourcluster.edu
User john.doe
IdentityFile ~/.ssh/id_ed25519
ConnectTimeout 30
ProxyCommand ssh login.hpc.yourcluster.edu -l john.doe "~/bin/start-ssh-job.bash cpu /sc/projects/sci-aisc/sqsh-files/pytorch_ssh.sqsh"
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
Host slurm-gpu
HostName login.hpc.yourcluster.edu
User john.doe
IdentityFile ~/.ssh/id_ed25519
ConnectTimeout 30
ProxyCommand ssh login.hpc.yourcluster.edu -l john.doe "~/bin/start-ssh-job.bash gpu /sc/projects/sci-aisc/sqsh-files/pytorch_ssh.sqsh"
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
Replace with your actual values:
HostName login.hpc.yourcluster.edu
: Replace with your HPC login node address (e.g.,login.cluster.university.edu
or IP like192.168.1.100
)User john.doe
: Replace with your HPC cluster usernameIdentityFile ~/.ssh/id_ed25519
: Path to your SSH private key (use~/.ssh/id_rsa
if you have an RSA key)- Container path
/sc/projects/sci-aisc/sqsh-files/pytorch_ssh.sqsh
: Replace with your actual container image path
Container Path Flexibility:
- Container images in
/sc/projects
are automatically copied to your home directory for faster access - You can use paths like
/sc/projects/shared/pytorch.sqsh
- they'll be cached as~/pytorch.sqsh
- Local container files in your home directory are used directly
You can customize the SLURM job parameters and job timeout by editing the variables at the top of the start-ssh-job.bash
script:
SBATCH_PARAM_CPU
: sbatch parameters for CPU jobs.SBATCH_PARAM_GPU
: sbatch parameters for GPU jobs.TIMEOUT
: The time in seconds the script will wait for a job to start before giving up.
All commands (except for the initial SSH connection) should be run on the HPC login node.
Once your ~/.ssh/config
is set up, you can connect:
- From VSCode: Use the "Remote Explorer" extension, find
slurm-cpu
orslurm-gpu
in the list of SSH targets, and click the connect icon. - From your terminal:
ssh slurm-cpu
This will trigger the ProxyCommand
, which runs start-ssh-job.bash
on the login node to request a compute node and establish the connection.
To see your running interactive jobs managed by this script:
~/bin/start-ssh-job.bash list
To cancel all running interactive jobs (both CPU and GPU):
~/bin/start-ssh-job.bash cancel
If you have a running job and want a direct shell on the compute node itself (not inside the container), you can use the ssh
command:
~/bin/start-ssh-job.bash ssh
If you have both a CPU and a GPU job running, it will prompt you to choose which node to connect to.
- Your local SSH client connects to the HPC login node and executes the
ProxyCommand
. - The
start-ssh-job.bash
script runs on the login node. - It checks if a suitable job (
vscode-remote-cpu
orvscode-remote-gpu
) is already running. - If not, it submits a new SLURM job using
sbatch
, requesting a compute node and launching thessh-session.bash
script on it. A random port is chosen for the SSH session. - On the compute node,
ssh-session.bash
starts your specifiedenroot
container. - Inside the container, it starts an
sshd
server on the allocated port, creating a temporary host key for the session. - Meanwhile, on the login node,
start-ssh-job.bash
pollssqueue
until the job's state isRUNNING
. - Once the job is running, it waits for the
sshd
port to become active on the compute node. - Finally, it uses
nc
(netcat) to create a tunnel, piping the SSH connection from the login node to thesshd
daemon inside the container. - Your local SSH client can now communicate with the SSH server running in your container on the compute node.
β "SSH connection test failed" during setup:
# Verify basic SSH access manually
ssh [email protected]
# Check if SSH key was copied correctly
ssh -i ~/.ssh/interactive-slurm [email protected]
β "Failed to copy container file":
- Container path might not exist: check
/sc/projects/...
path - π₯οΈ On HPC cluster: manually copy with
cp /sc/projects/path/container.sqsh ~/
β±οΈ Connection takes too long (>5 minutes):
# Check job queue status
ssh login.your-cluster.edu
squeue --me # See if your jobs are pending
β "Connection refused" or immediate disconnection:
# π± On local machine: Check if job is actually running
ssh login.your-cluster.edu "~/bin/start-ssh-job.bash list"
# Cancel stuck jobs and try again
ssh login.your-cluster.edu "~/bin/start-ssh-job.bash cancel"
β VSCode connection fails:
- First, test command line:
ssh slurm-cpu
- Check VSCode settings:
remote.SSH.connectTimeout
should be β₯300 - View connection logs: VSCode β Output β Remote-SSH
β "Container image not found":
- π₯οΈ On HPC cluster: Check file exists:
ls ~/your-container.sqsh
- For
/sc/projects
paths, setup should auto-copy to home directory
β "enroot-mount failed":
- CPU jobs use x86 nodes by default (should prevent this)
- π₯οΈ On HPC cluster: Verify enroot works:
enroot list
βΈοΈ Job stays PENDING:
- π₯οΈ On HPC cluster: Check resources:
sinfo
andsqueue
- Reduce resource requirements in
~/bin/start-ssh-job.bash
π Multiple jobs created:
- Each SSH connection creates a separate job
- π₯οΈ On HPC cluster: Clean up:
~/bin/start-ssh-job.bash cancel
- Check the job logs on the HPC login node:
cat job.logs
- List running jobs:
~/bin/start-ssh-job.bash list
- Cancel stuck jobs:
~/bin/start-ssh-job.bash cancel
Thank you to https://github.com/gmertes/vscode-remote-hpc for creating the base of these scripts.