Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions .github/scripts/run_monitored_slurm_job.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#!/bin/bash
# Run monitor_slurm_job.sh and recover if the monitor is killed (e.g. SIGKILL
# from the runner OS) before the SLURM job completes. When the monitor exits
# non-zero, sacct is used to verify the job's actual final state; if the SLURM
# job succeeded we exit 0 so the CI step is not falsely marked as failed.
#
# Usage: run_monitored_slurm_job.sh <job_id> <output_file>

set -euo pipefail

if [ $# -ne 2 ]; then
echo "Usage: $0 <job_id> <output_file>"
exit 1
fi

job_id="$1"
output_file="$2"

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

monitor_exit=0
bash "$SCRIPT_DIR/monitor_slurm_job.sh" "$job_id" "$output_file" || monitor_exit=$?

if [ "$monitor_exit" -ne 0 ]; then
echo "Monitor exited with code $monitor_exit; re-checking SLURM job $job_id final state..."
Comment on lines +22 to +25
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -type f -name "*.sh" | grep -E "(monitor|run_monitored)" | head -20

Repository: MFlowCode/MFC

Length of output: 227


🏁 Script executed:

cat -n .github/scripts/monitor_slurm_job.sh | head -30

Repository: MFlowCode/MFC

Length of output: 1111


🏁 Script executed:

cat -n .github/scripts/run_monitored_slurm_job.sh

Repository: MFlowCode/MFC

Length of output: 1821


🏁 Script executed:

cat -n .github/scripts/monitor_slurm_job.sh | sed -n '30,100p'

Repository: MFlowCode/MFC

Length of output: 2652


🏁 Script executed:

cat -n .github/scripts/monitor_slurm_job.sh | sed -n '100,200p'

Repository: MFlowCode/MFC

Length of output: 3546


🏁 Script executed:

cat -n .github/scripts/monitor_slurm_job.sh | tail -40

Repository: MFlowCode/MFC

Length of output: 1502


This wrapper's recovery mechanism is defeated by the child monitor's auto-cancel.

monitor_slurm_job.sh sets trap cleanup EXIT at line 19, which calls scancel if the monitor exits abnormally (line 16). When the monitor is killed before completion (e.g., SIGKILL) or exits at line 98 (terminal state reached without output file), this cleanup trap cancels the SLURM job. The wrapper's recovery logic at run_monitored_slurm_job.sh:24-36 then attempts to query the final job state via sacct, but the job has already been canceled by the child, making the recovery check unreliable.

The defensive error handling in get_job_state() with || true means transient squeue/sacct failures won't trigger cancellation, but the real problem is the auto-cancel policy itself undermines the wrapper's fallback verification.

Recommended fix: Introduce a MONITOR_CANCEL_ON_ERROR environment variable that the wrapper can set to disable auto-cancel in the child, allowing the parent to decide job fate based on final SLURM state.

# Give the SLURM epilog time to finalize if the job just finished
sleep 30
final_state=$(sacct -j "$job_id" -n -X -P -o State 2>/dev/null | head -n1 | cut -d'|' -f1 | tr -d ' ' || echo "UNKNOWN")
final_exit=$(sacct -j "$job_id" -X --format=ExitCode --noheader --parsable2 2>/dev/null | head -n1 | tr -d ' ' || echo "")
echo "Final SLURM state=$final_state exit=$final_exit"
if [ "$final_state" = "COMPLETED" ] && [ "$final_exit" = "0:0" ]; then
echo "SLURM job $job_id completed successfully despite monitor failure — continuing."
else
echo "ERROR: SLURM job $job_id did not complete successfully (state=$final_state exit=$final_exit)"
exit 1
fi
fi
12 changes: 9 additions & 3 deletions .github/scripts/submit_and_monitor_bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,18 @@ device="$2"
interface="$3"
cluster="$4"

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

echo "[$dir] Submitting benchmark for $device-$interface on $cluster..."
cd "$dir"

# Submit and monitor job (submit.sh auto-detects bench mode from script name)
bash .github/workflows/$cluster/submit.sh \
.github/workflows/$cluster/bench.sh "$device" "$interface"
# Always use the PR's submit.sh so both master and PR builds benefit from the
# run_monitored_slurm_job.sh SIGKILL recovery wrapper. The bench script is
# still resolved relative to the current directory (master/ or pr/) so the
# correct branch code is benchmarked. SLURM_SUBMIT_DIR ensures the job runs
# in the right directory regardless of which submit.sh is invoked.
PR_SUBMIT="${SCRIPT_DIR}/../workflows/${cluster}/submit.sh"
bash "$PR_SUBMIT" .github/workflows/$cluster/bench.sh "$device" "$interface"

# Verify the YAML output file was created
job_slug="bench-$device-$interface"
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/bench.yml
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ jobs:
runs-on:
group: ${{ matrix.group }}
labels: ${{ matrix.labels }}
timeout-minutes: 480
timeout-minutes: 240
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Timeout value contradicts PR objectives.

The PR description states "Restores timeout-minutes in bench.yml to 480 (was accidentally set to 240)", but this change sets timeout-minutes: 240. Please clarify whether the intended value is 480 (per PR objectives) or 240 (per current code).

,

steps:
- name: Clone - PR
uses: actions/checkout@v4
Expand Down
5 changes: 4 additions & 1 deletion .github/workflows/frontier/bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,11 @@

source .github/scripts/bench-preamble.sh

# Cap parallel jobs at 64 to avoid overwhelming MPI daemons on large nodes.
n_jobs=$(( $(nproc) > 64 ? 64 : $(nproc) ))

if [ "$job_device" = "gpu" ]; then
./mfc.sh bench --mem 4 -j $n_ranks -o "$job_slug.yaml" -- -c $job_cluster $device_opts -n $n_ranks
else
./mfc.sh bench --mem 1 -j $(nproc) -o "$job_slug.yaml" -- -c $job_cluster $device_opts -n $n_ranks
./mfc.sh bench --mem 1 -j $n_jobs -o "$job_slug.yaml" -- -c $job_cluster $device_opts -n $n_ranks
fi
5 changes: 1 addition & 4 deletions .github/workflows/frontier/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,7 @@ build_opts="$gpu_opts"

. ./mfc.sh load -c $compiler_flag -m $([ "$job_device" = "gpu" ] && echo "g" || echo "c")

# Only set up build cache for test suite, not benchmarks
if [ "$run_bench" != "bench" ]; then
source .github/scripts/setup-build-cache.sh "$cluster_name" "$job_device" "$job_interface"
fi
rm -rf build

source .github/scripts/retry-build.sh
if [ "$run_bench" == "bench" ]; then
Expand Down
25 changes: 5 additions & 20 deletions .github/workflows/frontier/submit.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,13 +25,6 @@ else
exit 1
fi

# Detect job type from submitted script basename
script_basename="$(basename "$1" .sh)"
case "$script_basename" in
bench*) job_type="bench" ;;
*) job_type="test" ;;
esac

if [ "$2" = "cpu" ]; then
sbatch_device_opts="\
#SBATCH -n 32 # Number of cores required"
Expand All @@ -44,17 +37,10 @@ else
fi

# Select SBATCH params based on job type
if [ "$job_type" = "bench" ]; then
sbatch_account="#SBATCH -A ENG160"
sbatch_time="#SBATCH -t 05:59:00"
sbatch_partition="#SBATCH -p extended"
sbatch_extra=""
else
sbatch_account="#SBATCH -A CFD154"
sbatch_time="#SBATCH -t 01:59:00"
sbatch_partition="#SBATCH -p batch"
sbatch_extra="#SBATCH --qos=normal"
fi
sbatch_account="#SBATCH -A CFD154"
sbatch_time="#SBATCH -t 01:59:00"
sbatch_partition="#SBATCH -p batch"
sbatch_extra="#SBATCH --qos=normal"

shard_suffix=""
if [ -n "$4" ]; then
Expand Down Expand Up @@ -102,5 +88,4 @@ fi

echo "Submitted batch job $job_id"

# Use resilient monitoring instead of sbatch -W
bash "$SCRIPT_DIR/../../scripts/monitor_slurm_job.sh" "$job_id" "$output_file"
bash "$SCRIPT_DIR/../../scripts/run_monitored_slurm_job.sh" "$job_id" "$output_file"
12 changes: 9 additions & 3 deletions .github/workflows/phoenix/bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,11 @@

source .github/scripts/bench-preamble.sh

tmpbuild=/storage/scratch1/6/sbryngelson3/mytmp_build
# Cap parallel jobs at 64 to avoid overwhelming MPI daemons on large nodes
# (GNR nodes have 192 cores but nproc is too aggressive for build/bench).
n_jobs=$(( $(nproc) > 64 ? 64 : $(nproc) ))

tmpbuild=/storage/project/r-sbryngelson3-0/sbryngelson3/mytmp_build
currentdir=$tmpbuild/run-$(( RANDOM % 900 ))
mkdir -p $tmpbuild
mkdir -p $currentdir
Expand All @@ -15,10 +19,12 @@ else
bench_opts="--mem 1"
fi

rm -rf build

source .github/scripts/retry-build.sh
RETRY_CLEAN_CMD="./mfc.sh clean" retry_build ./mfc.sh build -j $(nproc) $build_opts || exit 1
RETRY_CLEAN_CMD="./mfc.sh clean" retry_build ./mfc.sh build -j $n_jobs $build_opts || exit 1

./mfc.sh bench $bench_opts -j $(nproc) -o "$job_slug.yaml" -- -c phoenix-bench $device_opts -n $n_ranks
./mfc.sh bench $bench_opts -j $n_jobs -o "$job_slug.yaml" -- -c phoenix-bench $device_opts -n $n_ranks

sleep 10
rm -rf "$currentdir" || true
Expand Down
3 changes: 1 addition & 2 deletions .github/workflows/phoenix/submit.sh
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,5 @@ fi

echo "Submitted batch job $job_id"

# Use resilient monitoring instead of sbatch -W
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
bash "$SCRIPT_DIR/../../scripts/monitor_slurm_job.sh" "$job_id" "$output_file"
bash "$SCRIPT_DIR/../../scripts/run_monitored_slurm_job.sh" "$job_id" "$output_file"
7 changes: 3 additions & 4 deletions .github/workflows/phoenix/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,10 @@
source .github/scripts/gpu-opts.sh
build_opts="$gpu_opts"

# Set up persistent build cache
source .github/scripts/setup-build-cache.sh phoenix "$job_device" "$job_interface"
rm -rf build

# Build with retry; smoke-test cached binaries to catch architecture mismatches
# (SIGILL from binaries compiled on a different compute node).
# Build with retry; smoke-test the freshly built syscheck binary to catch
# architecture mismatches (SIGILL from binaries compiled on a different compute node).
source .github/scripts/retry-build.sh
RETRY_VALIDATE_CMD='syscheck_bin=$(find build/install -name syscheck -type f 2>/dev/null | head -1); [ -z "$syscheck_bin" ] || "$syscheck_bin" > /dev/null 2>&1' \
retry_build ./mfc.sh test -v --dry-run -j 8 $build_opts || exit 1
Expand Down
25 changes: 18 additions & 7 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -224,13 +224,24 @@ endif()

if (CMAKE_BUILD_TYPE STREQUAL "Release")
# Processor tuning: Check if we can target the host's native CPU's ISA.
CHECK_FORTRAN_COMPILER_FLAG("-march=native" SUPPORTS_MARCH_NATIVE)
if (SUPPORTS_MARCH_NATIVE)
add_compile_options($<$<COMPILE_LANGUAGE:Fortran>:-march=native>)
else()
CHECK_FORTRAN_COMPILER_FLAG("-mcpu=native" SUPPORTS_MCPU_NATIVE)
if (SUPPORTS_MCPU_NATIVE)
add_compile_options($<$<COMPILE_LANGUAGE:Fortran>:-mcpu=native>)
# Skip for gcov builds — -march=native on newer CPUs (e.g. Granite Rapids)
# can emit instructions the system assembler doesn't support.
if (NOT MFC_GCov)
CHECK_FORTRAN_COMPILER_FLAG("-march=native" SUPPORTS_MARCH_NATIVE)
if (SUPPORTS_MARCH_NATIVE)
add_compile_options($<$<COMPILE_LANGUAGE:Fortran>:-march=native>)
# Disable AVX-512 FP16: gfortran ≥12 emits vmovw instructions on
# Granite Rapids CPUs, but binutils <2.38 cannot assemble them.
# FP16 is unused in MFC's double-precision computations.
CHECK_FORTRAN_COMPILER_FLAG("-mno-avx512fp16" SUPPORTS_MNO_AVX512FP16)
if (SUPPORTS_MNO_AVX512FP16)
add_compile_options($<$<COMPILE_LANGUAGE:Fortran>:-mno-avx512fp16>)
endif()
else()
CHECK_FORTRAN_COMPILER_FLAG("-mcpu=native" SUPPORTS_MCPU_NATIVE)
if (SUPPORTS_MCPU_NATIVE)
add_compile_options($<$<COMPILE_LANGUAGE:Fortran>:-mcpu=native>)
endif()
endif()
endif()

Expand Down
4 changes: 2 additions & 2 deletions benchmarks/5eq_rk3_weno3_hllc/case.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,8 +191,8 @@
"cyl_coord": "F",
"dt": dt,
"t_step_start": 0,
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(7 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(7 * (5 * size + 5)),
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
# Simulation Algorithm Parameters
"num_patches": 3,
"model_eqns": 2,
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/hypo_hll/case.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@
"p": Nz,
"dt": 1e-8,
"t_step_start": 0,
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(7 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(7 * (5 * size + 5)),
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
# Simulation Algorithm Parameters
"num_patches": 2,
"model_eqns": 2,
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/ibm/case.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,8 @@
"p": Nz,
"dt": mydt,
"t_step_start": 0,
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(7 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(7 * (5 * size + 5)),
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
# Simulation Algorithm Parameters
"num_patches": 1,
"model_eqns": 2,
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/igr/case.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,8 @@
"cyl_coord": "F",
"dt": dt,
"t_step_start": 0,
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(7 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(7 * (5 * size + 5)),
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
# Simulation Algorithm Parameters
"num_patches": 1,
"model_eqns": 2,
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/viscous_weno5_sgb_acoustic/case.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,8 +94,8 @@
"p": Nz,
"dt": dt,
"t_step_start": 0,
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(6 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(6 * (5 * size + 5)),
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
# Simulation Algorithm Parameters
"num_patches": 2,
"model_eqns": 2,
Expand Down
5 changes: 3 additions & 2 deletions toolchain/mfc/build.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import os, typing, hashlib, dataclasses, subprocess, re, time, sys, threading, queue

from rich.panel import Panel
from rich.text import Text
from rich.progress import Progress, SpinnerColumn, BarColumn, TextColumn, TimeElapsedColumn, TaskProgressColumn

from .case import Case
Expand Down Expand Up @@ -273,14 +274,14 @@ def _show_build_error(result: subprocess.CompletedProcess, stage: str):
stdout_text = result.stdout if isinstance(result.stdout, str) else result.stdout.decode('utf-8', errors='replace')
stdout_text = stdout_text.strip()
if stdout_text:
cons.raw.print(Panel(stdout_text, title="Output", border_style="yellow"))
cons.raw.print(Panel(Text(stdout_text), title="Output", border_style="yellow"))

# Show stderr if available
if result.stderr:
stderr_text = result.stderr if isinstance(result.stderr, str) else result.stderr.decode('utf-8', errors='replace')
stderr_text = stderr_text.strip()
if stderr_text:
cons.raw.print(Panel(stderr_text, title="Errors", border_style="red"))
cons.raw.print(Panel(Text(stderr_text), title="Errors", border_style="red"))

cons.print()

Expand Down
Loading