Set up HyperHerd with Claude Code¶
Paste this document (or just the URL) into Claude Code and Claude will walk you through HyperHerd setup end-to-end — installing the package, scaffolding a workspace, authoring
hyperherd.yaml+launch.sh, validating, and (optionally) bringing up the autonomous monitor with Discord.
The rest of this page is written in second person to Claude. If you're a human reading this, the Getting started page is the friendlier read.
Your role, Claude¶
You're helping the user get HyperHerd running on their cluster. They may be brand-new to the package, may already have a trainer they want to sweep, and may or may not want the autonomous monitor. Drive the conversation — don't dump everything at once.
Do not skip Phase 0. The user's environment dictates almost every later choice (Python version, container runtime, trainer harness, monitor opt-in). Ask all the questions in one message, wait for answers.
Defer to the user's existing setup. If they already have a train.py and a working environment, your job is to wire HyperHerd around it — don't rewrite their trainer.
When in doubt about HyperHerd specifics, fetch the relevant page from https://allenwlynch.github.io/hyperherd/ (e.g. getting-started/, configuration/, monitor/, discord-setup/, launcher/) rather than guessing.
Phase 0 — Setup interview (ask all of these in one message)¶
- What's the cluster? SLURM partition name, typical resources for one trial (GPUs, memory, walltime). Is
sbatch/sacct/squeuealready on$PATH? - What's the trainer? Path to the existing training script (e.g.
python train.py), and which arg style: Hydra-stylename=valueoverrides,--flags, or something else. - What's the environment? Container (Apptainer/Singularity, Docker), conda/mamba env,
module load, plainpip, oruv. You'll need to know how the user normally invokes their trainer on a compute node. - What parameters do they want to sweep? Names + types (discrete values vs continuous range), and whether it's a full grid, a partial grid (sweep some, hold others fixed), or one-at-a-time around a baseline.
- Do they want the autonomous monitor? That's the agent daemon that ramps trials, diagnoses failures, and chats over Discord. It needs Python ≥ 3.10 and an Anthropic API key (or a Claude Code subscription). If they say no, skip Phase 5.
Phase 1 — Install¶
Pick the right install line based on Phase 0.5:
# Base CLI only — Python ≥ 3.8
pip install hyperherd
# With the autonomous monitor — Python ≥ 3.10
pip install 'hyperherd[monitor]'
After install, verify:
Then install the bundled Claude Code skill so the user (and you) get deeper config help in future sessions:
This drops a skill into ~/.claude/skills/hyperherd-config/. Tell the user that future hyperherd.yaml editing sessions will pick it up automatically.
Phase 2 — Scaffold the workspace¶
This creates two files: hyperherd.yaml (declarative sweep) and launch.sh (bash entry point). Open both and edit them with the user.
hyperherd.yaml essentials¶
A minimal config:
name: my_sweep # used as the SLURM job name and Discord channel name
launcher: ./launch.sh # path is resolved relative to this file
parameters:
learning_rate:
type: continuous
abbrev: lr
low: 1e-5
high: 1e-2
scale: log
steps: 5
optimizer:
type: discrete
abbrev: opt
values: [adam, sgd]
grid: all # full Cartesian product of the above
slurm:
partition: gpu
time: "04:00:00"
mem: 16G
cpus_per_task: 4
gres: "gpu:1"
Key decisions to walk through:
- Grid mode.
grid: all(Cartesian),grid: [param1, param2](sweep these, fix others at theirdefault:), or omitgridfor one-at-a-time. - Discrete vs continuous. Continuous needs
low/high/scale/steps. Log scale requireslow > 0. abbrev. Short, distinct token used in trial names likelr-0.001_opt-adam. Required when the parameter name has anything outside[A-Za-z0-9._-].- Conditions. If parameters interact (e.g.
optimizer=adamshould never usemomentum), useconditions:— fetchconfiguration/andconditions/if needed.
For the full reference, fetch https://allenwlynch.github.io/hyperherd/configuration/.
launch.sh contract¶
The script is invoked as bash launch.sh "<overrides>". $1 is a space-separated name=value string. Available env vars inside the script:
$HYPERHERD_WORKSPACE— absolute workspace path$HYPERHERD_SWEEP_NAME—name:fromhyperherd.yaml(shared across trials)$HYPERHERD_TRIAL_ID— array task index$HYPERHERD_TRIAL_NAME— auto-generated per-trial id (e.g.lr-0.001_opt-adam)
For Hydra trainers, the launcher is a one-liner because Hydra consumes the override string natively:
#!/bin/bash
set -euo pipefail
OVERRIDES="$1"
apptainer exec --nv container.sif python train.py $OVERRIDES
For non-Hydra trainers, parse the string. Either with the parse_overrides() helper:
from hyperherd import parse_overrides
parsed = parse_overrides(sys.argv[1]) # → {"learning_rate": "0.001", "optimizer": "adam"}
…or with bash word-splitting + the user's CLI conventions. Don't invent a parser if one already exists in their trainer.
For container/conda/module patterns, fetch https://allenwlynch.github.io/hyperherd/launcher/.
Phase 3 — Validate before submitting¶
# List every trial the YAML produces (status-agnostic):
herd ls <workspace>
# Submission preview — pending indices + the sbatch script:
herd run <workspace> --dry-run
# Run a single trial locally (no SLURM) to sanity-check the launcher + trainer:
herd test <workspace> 0
# Hydra users: print the resolved config for trial 0 without training:
herd test <workspace> 0 --cfg-job
herd ls answers "what's in the sweep?"; herd run --dry-run answers "what would herd run do right now?". Read the dry-run output carefully — it prints the exact bash that will run on the compute node. If anything looks off (wrong container path, missing module load, wrong override key), fix before submitting.
Phase 4 — Submit¶
Then track:
herd status # one-shot status table
herd tail 3 # last 20 lines of trial 3's stdout/stderr
herd stats # sacct accounting once trials finish
herd res # TSV of params + logged metrics
If a trial fails, fix the issue and re-run herd run — it's idempotent and only resubmits ready/failed/cancelled trials.
To log per-trial metrics from the trainer, add this to the training code:
from hyperherd import log_result
log_result("val_accuracy", 0.94, step=epoch)
log_result("final_loss", 0.12)
For PyTorch Lightning users, the bundled logger forwards every pl_module.log() call automatically:
from hyperherd.integrations.lightning import HyperHerdLogger
trainer = pl.Trainer(logger=[wandb_logger, HyperHerdLogger()])
Phase 5 — Autonomous monitor (optional)¶
Skip this section if the user said no in Phase 0.5.
The monitor is a long-running daemon that:
- Watches the sweep, posts state-change events to Discord
- Runs the staged-rollout / failure-triage / pruning policy via a Claude agent loop
- Accepts slash commands (/status, /tail, /run, /cancel, /prune, /metrics, /stop) and free-form @-mentions
One-time Discord bot setup¶
This is a multi-step walkthrough — fetch https://allenwlynch.github.io/hyperherd/discord-setup/ and run it interactively. The summary:
- Create an application + bot at https://discord.com/developers/applications
- Enable the MESSAGE CONTENT privileged gateway intent
- Generate an invite URL with scopes
bot+applications.commandsand the permissionsView Channels,Send Messages,Read Message History,Manage Channels - Invite the bot to the user's server
- Copy the bot token →
DISCORD_BOT_TOKENenv var - Right-click the server → Copy Server ID →
discord.guild_idinhyperherd.yaml
If the user plans to run multiple sweeps in parallel, create one bot per workspace — Discord allows only one gateway connection per token, so two daemons sharing a token will kick each other off. The daemon's startup preflight detects this via a per-channel heartbeat marker and refuses to start. The marker is cleared on clean shutdown so a normal restart isn't blocked; pass --force-discord if a previous daemon was killed uncleanly and the stale heartbeat hasn't aged out yet (~18 min).
Anthropic credentials¶
Either:
- API console billing:
export ANTHROPIC_API_KEY=sk-ant-... - Claude Code subscription: the user runs
claude /loginonce; the daemon picks up the OAuth token automatically
hyperherd.yaml additions¶
discord:
guild_id: 1234567890123456789
# Optional — external MCP servers the agent should have access to
mcp_servers:
- name: wandb
command: npx
args: ["-y", "@wandb/mcp-server"]
env:
WANDB_API_KEY: ${WANDB_API_KEY}
Per-workspace .env¶
Drop secrets in <workspace>/.env so the daemon picks them up without leaking them to git:
The daemon auto-loads <workspace>/.env at startup and only fills in keys not already set in the environment.
Run¶
The daemon connects to Discord, creates a channel for the sweep, runs a short setup interview (metric, remediation policy, metric source), then operates the sweep autonomously. Wrap in tmux/screen to survive disconnects.
If the user wants the dashboard + slash-command surface but not the agent-driven cost, suggest:
For the full picture, fetch https://allenwlynch.github.io/hyperherd/monitor/.
Common pitfalls¶
HYPERHERD_TRIAL_NAMEconfusion. This is the auto-generated per-trial id;HYPERHERD_SWEEP_NAMEis the shared sweep name. Older code may referenceHYPERHERD_EXPERIMENT_NAME— that's a legacy alias forTRIAL_NAME, still set by HyperHerd but don't write new code against it.- Idempotent training. Trials may be resubmitted (after SLURM-side failures or
scancel). Use$HYPERHERD_TRIAL_NAMEfor a stable output dir, resume from checkpoint on startup, and don't fail on existing output dirs. - Compute nodes without Python. HyperHerd doesn't need Python on the compute node — the per-trial values are baked into the sbatch
casestatement at submission time. Onlybashis required outside the container. abbrevcollisions. Two parameters with the sameabbrevwill produce ambiguous trial names; the validator catches this but the error can be cryptic — pick distinct short tokens.launcher:path. Resolved relative to thehyperherd.yamlfile's directory, not the cwd. Use./launch.shand keep them in the same workspace dir.
When something breaks¶
herd run --dry-runis the first thing to try — it does the same validationherd rundoes without submitting.herd test <workspace> 0runs trial 0 locally (no SLURM) so the launcher / trainer / overrides can be debugged in isolation..hyperherd/logs/<idx>.outand.errcapture each trial's stdout/stderr.herd snapshot <workspace>prints a JSON bundle of the whole sweep state — useful for debugging or for handing the user a diff.
For anything beyond this, point the user at the relevant docs page or fetch it directly:
| Topic | URL |
|---|---|
| Sweep config reference | https://allenwlynch.github.io/hyperherd/configuration/ |
| Conditions | https://allenwlynch.github.io/hyperherd/conditions/ |
| Launcher patterns | https://allenwlynch.github.io/hyperherd/launcher/ |
| Command reference | https://allenwlynch.github.io/hyperherd/commands/ |
| Autonomous monitor | https://allenwlynch.github.io/hyperherd/monitor/ |
| Discord setup | https://allenwlynch.github.io/hyperherd/discord-setup/ |
| Workspace layout | https://allenwlynch.github.io/hyperherd/workspace/ |
| Results & logging | https://allenwlynch.github.io/hyperherd/results/ |