HyperHerd¶
Pre-release / actively developed
HyperHerd is in soft launch — the YAML schema, CLI flags, and Python API may change without notice between versions. Pin to an exact version (hyperherd==X.Y.Z) if you build on top of it, and expect breaking changes until a tagged 1.0.
Hyperparameter sweeps on SLURM, run by an autonomous agent. Declare your search in YAML, hand over a one-line launcher script, and walk away — herd monitor submits trials in stages, diagnoses failures, retries when SLURM can fix the problem, and posts only when it can't.
Want to skip ahead?
The repo ships a complete MNIST sweep you can clone and run as-is. PyTorch Lightning + Hydra trainer, 11 trials, all four condition forms in use. Two minutes from git clone to trials on the queue.
Have Claude Code set you up
Open Claude Code in your project directory and paste the block below — Claude will walk you through install, config authoring, validation, and (if you want it) the autonomous monitor end-to-end. Full guide at Set up with Claude Code.
What you write¶
Two files in a workspace directory:
hyperherd.yaml— your sweep declaratively: parameters, grid mode, SLURM resources, conditions.launch.sh— a one-line bash script that receives aname=valueoverride string as$1and runs your training command in whatever environment you need (container, conda, uv, modules).
What you get¶
- One-command sweeps. No sbatch boilerplate, no manual resubmits —
herd rungenerates and submits the array, tracks state, and resumes failed/pending trials on rerun. - An agent that actually operates the sweep.
herd monitorramps trials in stages, diagnoses failures, bumps memory or wall-time when that's the right fix, and pings you only when it can't. - Two-way Discord control. A dedicated channel per sweep with deterministic slash commands (
/status,/run,/cancel,/tail, …) and free-form mentions for the agent. - Edit your sweep mid-run. Bump a parameter range or add a value; the next
herd runappends new trials without touching the ones already running. - Configs you don't have to memorize. The bundled Claude Code skill writes
hyperherd.yamlfor you from a one-paragraph description. - An audit trail. Every trial's parameters, status, and logged metrics live in
.hyperherd/and come out as TSV viaherd resor JSON viaherd snapshot.
Hydra is the recommended trainer harness — its CLI consumes name=value overrides natively, so the string passes through unchanged — but the launcher is free-form bash, so parse the arguments however you want.
Scope¶
HyperHerd is opinionated. It assumes:
- SLURM job arrays as the dispatch mechanism.
name=valueoverrides as the parameter contract.- A bash launcher script as the integration point.
Where to next¶
- Try the MNIST example — the fastest way to see HyperHerd work. Clone, install, run.
- Autonomous monitor — start here if you want the agent runner. Setup, Discord channel, slash commands, failure triage.
- Discord setup — one-time bot creation walkthrough.
- Getting started — install, scaffold, run your first sweep.
- Sweep config reference — every field in
hyperherd.yaml. - Command reference — every
herdsubcommand. - Claude Code skill — generate configs by asking Claude.