Skip to content

Getting started

Skip the scaffolding

Want to see HyperHerd run before you write any YAML? The repo ships a complete MNIST sweep you can clone and submit as-is. Two minutes from git clone to trials on the queue. Come back here once you want to set up your own.

Installation

# Base CLI (Python ≥ 3.8):
pip install hyperherd

# With the autonomous monitor (Python ≥ 3.10 — Discord, Claude Agent SDK):
pip install 'hyperherd[monitor]'

Install the Claude Code skill

HyperHerd ships with a Claude Code skill for authoring sweep configs. After pip install, run:

herd install-skill

This drops the skill into ~/.claude/skills/hyperherd-config/. Next time you open Claude Code, you can ask it to write or edit a hyperherd.yaml and it will use the skill's checklist + patterns.

Verify

herd --help

Should print the available subcommands.

Dependencies

  • Python ≥ 3.8 for the base herd CLI
  • Python ≥ 3.10 for the [monitor] extras (autonomous monitor daemon — Discord, Claude Agent SDK)
  • PyYAML and Pydantic ≥ 2.0 (installed automatically)
  • A SLURM cluster with sbatch, sacct, squeue, and scancel on the submission host

The trial training environment (where SLURM jobs run) only needs the base package — no Python 3.10 requirement, no monitor deps.

Your first sweep

1. Scaffold a workspace

herd init my_experiment

This creates my_experiment/hyperherd.yaml and my_experiment/launch.sh with template content. Edit both to match your setup:

  • hyperherd.yaml — define your parameters, grid mode, SLURM resources, and conditions
  • launch.sh — set up your container, conda environment, or module loads

If you already have a config or launcher you want to reuse from another experiment, drop them in directly:

herd init my_experiment --config /path/to/existing.yaml --launcher /path/to/launch.sh
Flag What it does
--config FILE Copy FILE in as hyperherd.yaml instead of generating a template
--launcher FILE Copy FILE in as launch.sh instead of generating a template
-f, --force Overwrite existing files in the target directory

2. Validate before submitting

# List every trial in the sweep (no SLURM interaction):
herd ls my_experiment

# Preview the submission plan — pending indices + the sbatch script:
herd run my_experiment --dry-run

# Hydra users: validate Hydra config for trial 0 by running locally with --cfg job:
herd test my_experiment

The two preview commands answer different questions: herd ls is "what does my sweep look like?" (every combination, status-agnostic); herd run --dry-run is "what would herd run do right now?" (only the trials that would actually be submitted, plus the sbatch script).

herd test is the only Hydra-specific subcommand — it appends --cfg job to the override string so Hydra prints the resolved config and exits without running training. If your trainer doesn't use Hydra, skip it and use herd run --dry-run + herd test instead.

3. Launch

herd run my_experiment

You'll see the SLURM job array submitted. Most herd subcommands work without an argument when run from inside the workspace:

cd my_experiment
herd status
herd tail 3

For anything longer than a few minutes, let the autonomous monitor operate the sweep for you:

pip install 'hyperherd[monitor]'   # one-time, on the daemon machine
herd monitor

The daemon connects to Discord (one-time bot setup — see Discord setup), creates a channel for the sweep, walks you through a 3-question setup interview (metric, remediation policy, metric source), then handles staged rollout and failure triage. You direct it from the channel via slash commands or by @-mentioning the bot.

Wrap in tmux to keep it running after you log out:

tmux new -s monitor 'herd monitor'

See Autonomous monitor for the full picture.

4b. Drive it manually (if you'd rather)

If you don't want the agent in the loop, the same workspace can be operated by hand:

herd status        # one-shot status table
herd stats         # sacct accounting (runtime + memory) for every trial
herd tail 3        # last 20 lines of trial 3's stdout

5. Resubmit failures, or edit the sweep mid-flight

herd run is idempotent. Re-running the same workspace only submits trials that are ready, failed, or cancelled:

herd run        # picks up just the trials that need to run

Editing the config while trials are running

You can update hyperherd.yaml between (or even during) runs. On the next herd run, HyperHerd diffs the new manifest against the existing one, appends new trials, and asks before dropping any completed/running trials your edits would remove. Already-completed trials keep their results — they aren't re-run.

Typical workflow: launch a coarse sweep, watch a few results come in, then add another value to a parameter or tighten a condition and run again. The new trials get submitted, the existing ones stay. See reconciliation for the exact rules.

6. Collect results

Log per-trial summary metrics from your training code:

from hyperherd import log_result

log_result("test_accuracy", 0.95)
log_result("final_loss", 0.12)

Then print a TSV summary:

herd res

7. Clean up

herd clean --all

Next steps