Skip to content

Latest commit

 

History

History
534 lines (409 loc) · 10.7 KB

File metadata and controls

534 lines (409 loc) · 10.7 KB

NEXTO Cluster Configuration Guide

Overview

NEXTO uses a modular configuration system where cluster-specific settings are isolated in separate config files. The main nextflow.config contains only base settings and profile declarations.

Configuration Architecture

nextflow.config (base settings + profile declarations)
    ├── standard (local, single CPU)
    ├── local (local, multi-core)
    ├── singularity (local with containers)
    ├── test (minimal test run)
    └── HPC Clusters:
        ├── ozstar → conf/ozstar.config
        ├── hercules → conf/hercules.config
        └── contra → conf/contra.config

Available Cluster Profiles

1. OzSTAR (Swinburne University)

File: conf/ozstar.config

Executor: SLURM Queue Size: 2,000 concurrent jobs Container Path: /fred/oz418/singularity_images/ Bind Mounts: /fred, $HOME Scratch: $JOBFS Modules: apptainer

Usage:

nextflow run nexto_search.nf -profile ozstar --input obs.fil

Key Features:

  • Auto-scaling resources on retries
  • Process-specific maxForks (100-500)
  • SLURM integration with 1-minute polling

2. Hercules (MPIfR Bonn)

File: conf/hercules.config

Executor: SLURM Queue Size: 10,000 concurrent jobs Container Path: /hercules/scratch/vishnu/singularity_images/ Bind Mounts: /hercules, /mandap, /mkfs, $HOME Scratch: /tmp/$USER Modules: jdk/17.0.6

Usage:

nextflow run nexto_search.nf -profile hercules --input obs.fil

Key Features:

  • Dynamic queue selection (short.q, long.q, gpu.q)
  • Time-based queue routing (≤4h → short.q)
  • GPU queue for acceleration searches
  • Very high queue capacity (10,000 jobs)

3. Contra (MPIfR Bonn)

File: conf/contra.config

Executor: HT-Condor Queue Size: 1,500 concurrent jobs Container Path: /homes/vkrishnan/singularity_images/ Bind Mounts: /b, /bscratch, /homes Scratch: false (TMPFS not supported) Modules: jdk/17.0.4, apptainer

Usage:

nextflow run nexto_search.nf -profile contra --input obs.fil

Key Features:

  • HT-Condor executor
  • GPU support via request_GPUs = 1
  • CUDA environment variable passing
  • Error strategy: retry or ignore

Cluster Comparison

Feature OzSTAR Hercules Contra
Executor SLURM SLURM HT-Condor
Max Jobs 2,000 10,000 1,500
Queue System Single Multi (short/long/gpu) Single
Scratch $JOBFS /tmp/$USER false
GPU Support Yes Yes (gpu.q) Yes (Condor)
Max CPUs 32 48 48
Max Memory 128 GB 256 GB 256 GB
Max Time 168h 168h 240h

Creating a Custom Cluster Config

Step 1: Copy Template

cp conf/slurm_example.config conf/mycluster.config

Step 2: Customize Settings

Edit conf/mycluster.config:

params {
    // Set your container paths
    presto_container = "/your/path/presto_latest.sif"
    singularity_cachedir = "/your/cache"

    // Set resource limits
    max_cpus = 64
    max_memory = '512.GB'
    max_time = '72.h'
}

// Module loading (if needed)
process.beforeScript = """
module load singularity
module load cuda/11.8
"""

// Apptainer configuration
apptainer {
    enabled = true
    runOptions = '--env PYTHONNOUSERSITE=1 --nv -B /your/data'
}

// Executor
executor {
    name = 'slurm'  // or 'pbs', 'sge', 'condor'
    queueSize = 500
}

process {
    executor = 'slurm'
    queue = 'normal'
    scratch = '$TMPDIR'

    // Customize resources per process...
}

Step 3: Add to Main Config

Edit nextflow.config:

profiles {
    mycluster {
        includeConfig 'conf/mycluster.config'
    }
}

Step 4: Test

nextflow run nexto_search.nf -profile mycluster --input test.fil

Configuration Elements

Container Settings

params {
    presto_container = "/path/to/presto_latest.sif"
    singularity_cachedir = "/path/to/cache"
}

apptainer {
    enabled = true
    autoMounts = true
    runOptions = '--env PYTHONNOUSERSITE=1 --nv -B /data'
    cacheDir = params.singularity_cachedir
    envWhitelist = 'APPTAINER_BINDPATH,APPTAINER_LD_LIBRARY_PATH'
}

Executor Settings

SLURM

executor {
    name = 'slurm'
    pollInterval = '1 min'
    queueSize = 1000
    submitRateLimit = '20 sec'
}

process {
    executor = 'slurm'
    queue = 'normal'
    clusterOptions = '--account=proj123'
}

PBS/Torque

executor {
    name = 'pbs'
    pollInterval = '1 min'
    queueSize = 500
}

process {
    executor = 'pbs'
    queue = 'batch'
    clusterOptions = '-A proj123'
}

HT-Condor

executor {
    name = 'condor'
    pollInterval = '30 sec'
    queueSize = 1500
}

process {
    executor = 'condor'
    clusterOptions = 'request_GPUs = 1'  // For GPU jobs
}

Resource Allocation

NEXTO uses explicit process-specific resource allocation (no label-based resources). Each process has explicit CPU, memory, and time allocations optimized for pulsar searching workloads:

// Example: Process-specific resource allocation
withName: 'ACCELSEARCH' {
    cpus = {params.use_cuda ? 1 : 16}
    memory = { check_max(8.GB * task.attempt, 'memory') }
    time = { check_max(2.d * task.attempt, 'time') }
    queue = { params.use_cuda ? 'gpu' : 'normal' }
    clusterOptions = { params.use_cuda ? '--gres=gpu:1' : '' }
    maxForks = 200
}

withName: 'RFIFIND' {
    cpus = 1
    memory = { check_max(8.GB * task.attempt, 'memory') }
    time = { check_max(4.h * task.attempt, 'time') }
    maxForks = 400
}

Standard Resource Allocations (consistent across all HPC clusters):

Process CPUs Memory Time maxForks
FILTOOL 16 8GB 4h 1
RFIFIND 1 8GB 4h 400
PREPDATA.* 1 4GB 4h 400
ACCELSEARCH 16 (1 if GPU) 8GB 2d 200
ACCELSEARCH_ZMAX0 8 8GB 4h 200
PREPFOLD.* 1 8GB 4h 400
PSRFOLD_PULSARX 16 8GB 4h 400
SINGLE_PULSE_SEARCH 4 8GB 4h 500
MAKE_ZAPLIST|COMBINE_CANDIDATES|ACCELSIFT 2 4GB 1h 500

Queue Selection

Static:

process.queue = 'normal'

Dynamic (time-based):

queue = { task.time <= 4.h ? 'short' : 'long' }

Process-specific:

withName: 'ACCELSEARCH' {
    queue = 'gpu'
}

Error Handling

process {
    errorStrategy = {
        task.exitStatus in 137..140 || task.exitStatus == 124 ? 'retry' : 'finish'
    }
    maxRetries = 3
    maxErrors = '-1'
}

Exit codes:

  • 137-140: Out of memory, killed
  • 124: Timeout
  • Others: Job failed

Strategies:

  • retry: Retry failed job
  • finish: Continue pipeline
  • ignore: Ignore error
  • terminate: Stop pipeline

Scratch Space

// Use cluster scratch
process.scratch = '$TMPDIR'  // or '$JOBFS', '/tmp/$USER'

// Disable scratch
process.scratch = false

// Process-specific
withName: 'ACCELSIFT' {
    scratch = false
}

Module Loading

// Global
process.beforeScript = """
module load apptainer
module load cuda/11.8
"""

// Cluster-specific
if (System.getenv('HOSTNAME')?.startsWith('ozstar')) {
    process.module = 'apptainer'
}

GPU Configuration

SLURM

withName: 'ACCELSEARCH' {
    queue = 'gpu'
    clusterOptions = '--gres=gpu:1 --constraint=cuda'
}

HT-Condor

withName: 'ACCELSEARCH' {
    clusterOptions = 'request_GPUs = 1'
}

apptainer.runOptions = '--env="CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}"'

Auto-Scaling Resources

// Scale resources on retries
cpus = { check_max(task.attempt * 2, 'cpus') }
memory = { check_max(4.GB * task.attempt, 'memory') }
time = { check_max(4.h * task.attempt, 'time') }

// Attempt 1: 2 CPUs, 4 GB, 4h
// Attempt 2: 4 CPUs, 8 GB, 8h
// Attempt 3: 8 CPUs, 16 GB, 16h

Testing Your Configuration

1. Dry Run

nextflow run nexto_search.nf -profile mycluster --input test.fil -dry-run

2. Small Test

nextflow run nexto_search.nf \
    -profile mycluster,test \
    --input test.fil \
    --dm_high 10

3. Check Container

singularity exec ${params.presto_container} which rfifind

4. Monitor Execution

# Watch jobs
watch squeue -u $USER  # SLURM
watch qstat -u $USER   # PBS
condor_q              # Condor

# Check logs
tail -f .nextflow.log

5. Verify Results

# Check outputs
ls -lh results/*/

# View reports
firefox results/report.html
firefox results/timeline.html

Troubleshooting

Jobs Not Submitting

Check executor:

executor.name = 'slurm'  // Correct
executor = 'slurm'       // Wrong

Check queue exists:

sinfo  # SLURM
qstat -Q  # PBS

Out of Memory Errors

Increase memory:

withName: 'ACCELSEARCH' {
    memory = '64.GB'  // Increase
}

Check actual usage:

grep memory results/report.html

Container Not Found

Check path:

ls -lh /path/to/presto_latest.sif

Set explicitly:

nextflow run ... --presto_container /full/path/presto.sif

Bind Mount Errors

Add missing paths:

apptainer.runOptions = '-B /missing/path -B /another/path'

Permission Errors

Check cache directory:

mkdir -p /path/to/cache
chmod 755 /path/to/cache

Best Practices

1. ✅ Use Cluster-Specific Configs

Keep cluster settings isolated in conf/ directory.

2. ✅ Test Before Production

Always test with minimal data first.

3. ✅ Monitor Resource Usage

Check reports to tune resource allocation.

4. ✅ Use Auto-Scaling

Let resources scale on retries:

memory = { check_max(4.GB * task.attempt, 'memory') }

5. ✅ Set Appropriate Limits

Don't request more than cluster allows:

params.max_cpus = 32  // Cluster limit
params.max_memory = '128.GB'

6. ✅ Use Meaningful Profile Names

-profile ozstar     # Clear
-profile cluster1   # Unclear

7. ✅ Document Your Config

Add comments explaining cluster-specific choices.

8. ✅ Version Control Configs

Track changes to cluster configurations.

Summary

3 HPC cluster configs - OzSTAR, Hercules, Contra ✅ Clean separation - Base vs cluster-specific settings ✅ No cloud profiles - Focused on HPC use cases ✅ SLURM & Condor - Both executors supported ✅ Easy customization - Template-based approach ✅ Auto-scaling - Resources scale on retries ✅ Container-first - Singularity/Apptainer throughout

Your NEXTO installation is now ready for multi-cluster deployment!