Runbook: Emergency Power Mitigation for GPU Clusters

Ops runbook for safely shedding load and gracefully shutting down GPU clusters during grid shortfalls—checklists, commands, and automation for 2026.

Hook: When the grid tightens, AI ops teams can't guess their way out

AI training and large-scale inference clusters are now regular causes of peak demand events across major grids. In 2026, regulators and grid operators are reacting — some policy moves now require data centers to shoulder grid upgrade costs, and hardware trends (NVLink Fusion, RISC‑V integrations) are changing power and thermal characteristics of GPU racks. When utilities declare capacity shortfalls or a demand response event, operations teams must act fast: shed load safely, pause non‑critical training, and, when necessary, orchestrate a graceful shutdown of GPU nodes without corrupting datasets, models, or risking hardware.

Who this runbook is for

This is a pragmatic, ops‑first runbook for SREs, on‑call engineers, datacenter operators, and facilities teams running GPU clusters for AI training and inference. It's focused on action — checklists, commands, scripts, and orchestration patterns you can execute or automate from your incident tooling (PagerDuty, Slack, Prometheus Alerts, or Quickfix-style runbook automation).

Key assumptions

Your cluster uses common schedulers (Kubernetes, Slurm or a custom scheduler).
You have root or cluster-admin access to compute nodes and orchestration tools.
Basic safety infrastructure exists: UPS/generators, PDUs, BMS access, and change control procedures.

High‑level strategy (inverted pyramid): What matters first

Protect safety and integrity: Preserve human safety, prevent fire and cooling failures.
Prioritize critical services: Keep production inference and safety‑critical workloads running where possible.
Shed training load fast: Pause or suspend low‑priority and long‑running training jobs.
Graceful shutdown: Checkpoint, sync state, stop services in a safe order when forced power reductions are required.
Automate and document: Convert this runbook into automated playbooks and test regularly.

2026 context and why this matters now

Late 2025 and early 2026 brought two important trends that change operational risk profiles:

Policy changes requiring data centers to internalize grid costs in key US regions — shifting the economics in favor of aggressive demand response and enforced load shedding during peak events.
Hardware and platform changes (e.g., NVLink Fusion, SiFive RISC‑V integrations announced in early 2026) that increase inter‑GPU bandwidth and alter CPU/GPU power distribution, making per-node power management more nuanced.

Both make it more likely you will face demand response events or curated power limits—prepare now with tested runbooks.

Before the event: Preparation checklist (run before a grid shortfall)

Prepare these once and validate quarterly.

Inventory & tagging: Tag compute nodes and jobs in your scheduler as critical, best-effort, or archive. Use labels like power:critical or training:noncritical.
Checkpointing policies: Enforce automatic periodic checkpoints (every N minutes/epochs) for long‑running training. Instrument frameworks (PyTorch, TensorFlow) to write incremental checkpoints to replicated object storage.
Power control interfaces: Ensure you can programmatically set GPU power caps (nvidia‑smi), throttle CPU/GPU frequencies (intel_pstate or cpufreq), and control PDUs via API.
Automation hooks: Map alerts to automation: Prometheus alert → webhook → runbook executor (e.g., Quickfix). Keep scripts in a git repo with signed tags.
Test plan: Run simulated demand response drills involving on‑call, facilities, and data plane teams annually.
Failover plans: Identify geographic failover targets (regions with spare capacity) and pre‑authorize cross‑region migrations for critical models.

Immediate checklist (first 0–15 minutes after a demand response or power alert)

Time is critical. Follow these steps in order and confirm each with a sync message in your incident channel.

Mark the incident: Open incident ticket, set severity, and notify facilities. Use a template message that includes the grid notice (ISO/PJM) and expected duration.
Assess available runtime: Query UPS and generator estimates. If UPS runtime < 30 minutes, move toward aggressive shedding.
Protect humans and cooling: Verify CRAC/CRAH units are operational and that airflow is not compromised. If temperatures exceed safe thresholds, initiate emergency power‑off of affected racks per facilities SOP.
Pause non‑critical jobs: Execute scheduler commands to suspend or hold best‑effort jobs (examples below).
Power cap GPUs: Apply a temporary lower power limit to GPUs to buy time while maintaining training checkpoints.
Scale down inference flexibly: Reduce replicas for non‑critical inference endpoints; maintain QoS for critical endpoints via request routing.

Example commands — Kubernetes

# Scale down non-critical deployments
kubectl -n mlapps scale deployment/ml-batch-train --replicas=0

# Cordon nodes and evict non-critical pods (respect PDBs)
kubectl cordon gpu-node-12
kubectl drain gpu-node-12 --ignore-daemonsets --delete-local-data --grace-period=120

# Annotate pods for pending resume
kubectl annotate pods -n mljobs my-train-pod resume-policy=checkpointed

Example commands — Slurm

# Place job on hold or suspend (test in staging first)
scontrol hold 12345
# Or suspend a running job
scontrol suspend 12345
# Resume later
scontrol resume 12345

Power capping (NVIDIA GPUs)

# Show current power limits
nvidia-smi -q -d POWER

# Set a node-level power limit (must be supported by driver)
sudo nvidia-smi -i 0 -pl 200  # set GPU 0 to 200W

# Return to default (store previous values in automation)
sudo nvidia-smi -i 0 -rac

Pausing training safely: application patterns and code snippets

Graceful pause must avoid wasted compute and corrupted checkpoints. Use signal handlers or hooks to write an explicit checkpoint on SIGTERM/SIGUSR1.

PyTorch example: SIGTERM handler for checkpointing

import signal
import torch

running = True

def save_checkpoint(sig, frame):
    global running
    running = False
    torch.save({
        'model': model.state_dict(),
        'optimizer': optimizer.state_dict(),
        'epoch': epoch
    }, f"/mnt/checkpoints/job_{job_id}_ep{epoch}.pt")

signal.signal(signal.SIGTERM, save_checkpoint)

while running:
    train_step()

Embed this in your training entrypoint so orchestrators can send SIGTERM and rely on the process to checkpoint within an expected grace period.

Advanced load‑shedding tactics

Dynamic concurrency: Drop the number of concurrent training jobs per node (use cgroups or container limits).
Duty cycling: Rotate active racks to keep temperatures and power below thresholds, preserving some capacity for critical workloads.
Power capping + clock throttling: Combine GPU power limit with CPU frequency scaling to reduce whole‑node power draw gracefully.
Offload to spot/remote capacity: If your cloud provider or partner pools have flexible capacity, burst training to remote sites with spare headroom.
Progressive preemption: Use fair scheduler weights to preempt least‑valuable jobs first and preserve the smallest possible number of in‑progress checkpoints.

Orchestrating a graceful shutdown (when you must power off racks)

If the grid event requires total power reduction or a controlled shutdown, follow this safe sequence. Always confirm with facilities before cutting power.

Notify stakeholders: Broadcast ETA and scope (which racks/nodes) via incident channels.
Final checkpoint and sync: Trigger forced checkpoint saves and replicate checkpoints to remote object storage or NFS with write confirmation.
Stop schedulers and user processes: For Kubernetes: drain nodes then stop kubelet. For Slurm: set nodes to DRAIN state and stop slurmd.
Stop container runtime: systemctl stop containerd/docker; ensure processes exit cleanly and volumes are unmounted.
Unload GPU drivers if required: Only if instructed by hardware/facilities. Use nvidia-smi to verify there are no active contexts.
Power down via PDU: Follow PDU sequencing; avoid abrupt breaker trips. Facilities may prefer staggered racks to avoid inrush currents.
Confirm power-off: Verify each rack reports down to the monitoring stack and record the state in the incident ticket.

Graceful shutdown commands — sample sequence

# Kubernetes node (run on each node or via SSH orchestration)
sudo systemctl stop kubelet
sudo systemctl stop containerd
# Ensure filesystem sync
sync && sudo umount /mnt/checkpoints || true
# Optional: set node maintenance flag in cluster control plane
kubectl annotate node gpu-node-12 maintenance=shutdown

# Slurm node
scontrol update NodeName=gpu-node-05 State=DRAIN Reason="Grid power event"
systemctl stop slurmd

Bringback sequence (order matters)

Restoring power carelessly can damage equipment. Coordinate with facilities to restore cooling and PDUs first, then follow this sequence:

Confirm stable power and cooling: CRAC units at nominal CFM, inlet temp within spec.
Power PDUs and racks in a staggered pattern: Follow inrush current guidance — bring up FDPs, then PDUs, then rack PDUs.
Boot nodes and validate hardware: Check IPMI health, BMC logs, DIMM/CPU/GPU errors.
Load drivers and runtimes: Start container runtime, drivers, and orchestration agents (kubelet/slurmd).
Sanity checks: Run GPU diagnostic (nvidia-smi -q), disk checks, and network tests. Verify dataset mounts and object storage connectivity.
Resume workloads: Resume held/suspended jobs in controlled batches, monitor checkpoints and job health closely.

Operational scripts & automation examples

Automate the mechanical parts of this runbook. Below is a minimal shell example to cap power and trigger checkpoint via Kubernetes job annotation — adapt and sign before running in production.

#!/bin/bash
# simple automation: apply GPU power cap and scale down batch training
NODES=(gpu-node-01 gpu-node-02)
POWER_W=200

for n in "${NODES[@]}"; do
  ssh root@$n "nvidia-smi -i 0 -pl $POWER_W"
done

# scale down non-critical
kubectl -n mlapps scale deployment/ml-batch-train --replicas=0
kubectl -n mljobs label job -l policy=best-effort paused=true --overwrite

Monitoring, alerting & telemetry to capture during an event

Node power draw (PDUs, outlet-level telemetry)
GPU power and temperature (nvidia-smi/DCGM)
UPS and generator state (BMS metrics)
Scheduler job states and checkpoint timestamps
Environmental sensors: inlet temp, humidity, airflow

Post-incident: lessons, metrics, and upgrades

After the incident, run a formal blameless postmortem with the following deliverables:

MTTR for load shed and full restore.
Number of interrupted jobs and total lost GPU-hours.
Checklist gaps and automation failures.
Plan for hardware/config upgrades (e.g., enable MIG, expand checkpoint frequency, buy automated PDU APIs).

Safety & compliance notes

Never bypass fire suppression/lockout procedures to force power cycling. Coordinate with facilities and compliance; document every step for auditors. If a regulator or ISO mandates curtailment, preserve logs and ticketed evidence of compliance.

Role-based responsibilities (quick reference)

On‑call SRE: Execute workload shedding and checkpoint triggers; update incident ticket.
Facilities: Provide UPS/gen runtime, approve PDU sequencing and power-off commands.
Data scientists: Ensure training scripts support checkpointing and resumability.
Security/Compliance: Validate that backup and replication completed before shutdown.

Case study (short): How a hyperscale AI team avoided data loss in 2025

In late 2025 a large research org faced a PJM curtailment notice during a multi‑week model run. Their pre‑tested runbook did three things in the first 10 minutes: 1) applied a 20% GPU power cap across non‑critical nodes, 2) suspended low‑priority Slurm jobs with a checkpoint handler, and 3) scaled inference down to preserve cooling headroom. UPS runtime extended to allow full checkpoint replication; no model corruption occurred, and only 2% of GPU‑hours were lost. This example underlines the value of automated, rehearsed processes.

Future predictions: 2026 and beyond

Expect tighter coupling between grid operators and large data consumers. Automated demand response APIs will be standard in major regions.
Hardware evolution (NVLink Fusion, RISC‑V integrations, more efficient accelerators) will enable more granular per‑device power management, but will increase orchestration complexity.
Regulation will push owners/operators to provide deterministic power usage data for planning — make sure your telemetry and incident records are audit-ready.

Operational readiness is now a cross‑discipline problem: SRE + facilities + data science. The better you orchestrate those teams under a tested runbook, the lower your MTTR and business risk.

Actionable takeaways

Tag and classify workloads so you can quickly target non‑critical jobs during a grid event.
Implement signal‑based checkpointing at the application level (SIGTERM handlers in training scripts).
Automate power capping and PDU control exposed via APIs; test in staging.
Run quarterly drills with facilities and rehearse the bringdown/bringback sequence.
Record evidence of curtailment responses to meet evolving regulatory requirements in 2026.

Downloadable checklist & next steps

Convert this runbook into a formal playbook for your on‑call rotation. Export the commands and scripts into an automation platform (CI/CD pipeline, runbook executor) and sign them into your repo. Test them in staging under supervised conditions.

Call to action

Ready to reduce MTTR and automate emergency power mitigation for your AI clusters? Download our executable runbook template and sample automation scripts, or contact Quickfix Cloud to integrate these playbooks with your alerting and CI/CD pipelines. Prepare your team before the next grid curtailment — schedule a runbook drill this quarter.

Runbook: Emergency Power Mitigation for AI Clusters During Grid Capacity Shortfalls

Hook: When the grid tightens, AI ops teams can't guess their way out

Who this runbook is for

Key assumptions

High‑level strategy (inverted pyramid): What matters first

2026 context and why this matters now

Before the event: Preparation checklist (run before a grid shortfall)

Immediate checklist (first 0–15 minutes after a demand response or power alert)

Example commands — Kubernetes

Example commands — Slurm

Power capping (NVIDIA GPUs)

Pausing training safely: application patterns and code snippets

PyTorch example: SIGTERM handler for checkpointing

Advanced load‑shedding tactics

Orchestrating a graceful shutdown (when you must power off racks)

Graceful shutdown commands — sample sequence

Bringback sequence (order matters)

Operational scripts & automation examples

Monitoring, alerting & telemetry to capture during an event

Post-incident: lessons, metrics, and upgrades

Safety & compliance notes

Role-based responsibilities (quick reference)

Case study (short): How a hyperscale AI team avoided data loss in 2025

Future predictions: 2026 and beyond

Actionable takeaways

Downloadable checklist & next steps

Call to action

Related Topics

quickfix

Up Next

Status Page Tool Comparison: Hosted vs Self-Hosted Options

Open Source Incident Management Tools Comparison

Feature Flag Platform Comparison for Engineering Teams

Hook: When the grid tightens, AI ops teams can't guess their way out

Who this runbook is for

Key assumptions

High‑level strategy (inverted pyramid): What matters first

2026 context and why this matters now

Before the event: Preparation checklist (run before a grid shortfall)

Immediate checklist (first 0–15 minutes after a demand response or power alert)

Example commands — Kubernetes

Example commands — Slurm

Power capping (NVIDIA GPUs)

Pausing training safely: application patterns and code snippets

PyTorch example: SIGTERM handler for checkpointing

Advanced load‑shedding tactics

Orchestrating a graceful shutdown (when you must power off racks)

Graceful shutdown commands — sample sequence

Bringback sequence (order matters)

Operational scripts & automation examples

Monitoring, alerting & telemetry to capture during an event

Post-incident: lessons, metrics, and upgrades

Safety & compliance notes

Role-based responsibilities (quick reference)

Case study (short): How a hyperscale AI team avoided data loss in 2025

Future predictions: 2026 and beyond

Actionable takeaways

Downloadable checklist & next steps

Call to action

Related Reading

Related Topics

quickfix

Up Next

Status Page Tool Comparison: Hosted vs Self-Hosted Options

Open Source Incident Management Tools Comparison

Feature Flag Platform Comparison for Engineering Teams