Command reference¶
Every herd subcommand is documented below. Most subcommands take an optional workspace directory as their first positional argument (default: .); when run from inside the workspace you can usually omit it.
Agent (--json) mode¶
The read- and run-style commands (run, status, stats, tail, res, stop) accept a --json flag that emits a structured JSON document to stdout instead of the human-formatted table. Use it when an agent or other automation is driving HyperHerd:
- Numeric memory in bytes (not
1.50G); elapsed time in seconds (not01:30:00). - Status uses the stable internal enum:
ready,submitted,queued,running,completed,failed,cancelled. - Empty / unknown values come through as
null, never an empty string. - Errors still go to stderr with a non-zero exit code; stdout in JSON mode is always a single valid JSON document or empty.
- Warnings (preflight, partition checks) print to stderr as in normal mode and do not corrupt the stdout JSON, so
herd run --dry-run --json | jq ...is always safe.
The JSON shape for each command is documented inline below.
herd init¶
Scaffold a new sweep workspace.
Creates DIRECTORY/hyperherd.yaml and DIRECTORY/launch.sh with template content; the experiment name is taken from the directory name. If DIRECTORY is omitted, files are written to the current directory.
The templates have placeholder SLURM resource fields (partition, time, mem, cpus_per_task) you'll edit to match your cluster — herd init doesn't try to be a substitute for opening the YAML.
| Flag | Description |
|---|---|
--config FILE |
Copy FILE in as hyperherd.yaml instead of generating a template — useful for cloning an existing sweep |
--launcher FILE |
Copy FILE in as launch.sh instead of generating a template |
-f, --force |
Overwrite existing files in the target directory |
Example output
Created my_experiment/hyperherd.yaml
Created my_experiment/launch.sh
Next steps:
1. Edit hyperherd.yaml to define your parameters and SLURM resources
2. Edit launch.sh to set up your container/environment
3. Run: herd run my_experiment --dry-run
herd run¶
Submit (or resubmit) the sweep.
Generates the trial manifest, runs preflight checks, writes the sbatch script to .hyperherd/job.sbatch, submits it, and records the SLURM job ID.
herd run is idempotent: it only submits trials whose status is ready, failed, or cancelled. Trials that are submitted, queued, running, or completed are skipped unless you opt in with --force.
| Flag | Description |
|---|---|
-n, --dry-run |
Print the submission plan (sbatch script + pending indices); don't submit. Use herd ls for the full trial list. |
-j, --max-concurrent N |
Cap concurrent running tasks (overrides slurm.max_concurrent) |
-i, --indices SPEC |
Submit only these trial indices, e.g. 0-3,5,7-9 |
-p, --pin NAME=VALUE [NAME=VALUE ...] |
Submit only trials whose swept params match every pin (e.g. --pin batch_size=32 optimizer=adam). Names must be sweep parameters; values are coerced to int/float/str. |
-f, --force |
With --indices, allow resubmitting running/completed trials. Without, allow config edits that drop running/completed trials (kept as orphans). |
Editing the config mid-sweep is supported. If you edit hyperherd.yaml between runs, herd run reconciles the new manifest against the old one: new trials are appended, removed trials are dropped (or kept as orphans with -f if they were already running/completed). See Re-running and reconciliation for the rules.
Example output — successful submission
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✓ Launched 11 trials as SLURM job array 38384012
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
workspace: my_experiment/.hyperherd
logs: my_experiment/.hyperherd/logs
monitor: herd status
Example output — --dry-run
============================================================
DRY RUN — No jobs will be submitted
============================================================
Generated sbatch script (per-trial lookup elided for brevity):
----------------------------------------
#!/bin/bash
#SBATCH --job-name=hyperherd_lr_sweep
#SBATCH --array=0-11
#SBATCH --partition=short
#SBATCH --time=01:00:00
#SBATCH --mem=2G
#SBATCH --cpus-per-task=1
#SBATCH --output=/path/to/lr_sweep/.hyperherd/logs/%a.out
#SBATCH --error=/path/to/lr_sweep/.hyperherd/logs/%a.err
#SBATCH --open-mode=append
# Run divider — visible in both stdout and stderr after append
_HH_DIVIDER="==== HyperHerd run: job ${SLURM_JOB_ID} array-task ${SLURM_ARRAY_TASK_ID} $(date -Iseconds) ===="
printf "\n%s\n\n" "$_HH_DIVIDER"
printf "\n%s\n\n" "$_HH_DIVIDER" >&2
# Export HyperHerd environment variables
export HYPERHERD_WORKSPACE=/path/to/lr_sweep
export HYPERHERD_SWEEP_NAME=lr_sweep
export HYPERHERD_TRIAL_ID="$SLURM_ARRAY_TASK_ID"
# Per-trial lookup baked at submission time (no Python required here).
case "$SLURM_ARRAY_TASK_ID" in
0)
HYPERHERD_TRIAL_NAME=lr-0.0001_opt-adam
HYPERHERD_EXPERIMENT_NAME=lr-0.0001_opt-adam
OVERRIDES='experiment_name=lr-0.0001_opt-adam learning_rate=0.0001 optimizer=adam'
;;
# ... [11 more trial arm(s) elided in dry-run; full script is submitted] ...
*)
echo "HyperHerd: no lookup entry for SLURM_ARRAY_TASK_ID=$SLURM_ARRAY_TASK_ID" >&2
exit 1
;;
esac
export HYPERHERD_TRIAL_NAME HYPERHERD_EXPERIMENT_NAME
# Invoke the user's launcher script
bash /path/to/lr_sweep/launch.sh "$OVERRIDES"
----------------------------------------
Submission plan
Pending: 12 of 12 trial(s)
Indices: 0-11
Use herd ls to see every trial in the sweep.
Agent mode — herd run --dry-run --json emits the submission plan (the indices + sbatch script that would actually be submitted right now, given current status / --pin / --indices). For the full sweep enumeration regardless of status, use herd ls (or its JSON variant when added). The intended workflow for an agent is to inspect the trials, then call herd run --json to submit.
{
"dry_run": true,
"slurm_job_id": null,
"sbatch_path": null,
"submitted_indices": [0, 1, 2, 3],
"sbatch_script": "#!/bin/bash\n#SBATCH --array=0-3\n...",
"trials": [
{"index": 0, "status": "ready", "experiment_name": "lr-0.01_bs-32",
"params": {"lr": 0.01, "bs": 32}},
{"index": 1, "status": "ready", "experiment_name": "lr-0.01_bs-64",
"params": {"lr": 0.01, "bs": 64}}
]
}
A real (non-dry-run) herd run --json returns the same shape with dry_run: false, slurm_job_id populated, sbatch_path set to where the script was written (.hyperherd/job.sbatch), and sbatch_script: null.
herd ls¶
List every trial in the sweep with its swept parameters.
Status-agnostic — shows the shape of the sweep, not what herd run would do next. Reads the manifest if present; otherwise materializes the combinations from hyperherd.yaml so you can herd ls BEFORE the first herd run to sanity-check the YAML.
| Flag | Description |
|---|---|
-p, --pin NAME=VALUE [NAME=VALUE ...] |
Filter to trials whose swept params match every pin (e.g. --pin batch_size=32 optimizer=adam). |
Use herd status for the SLURM-synced status table (this command does not touch SLURM); use herd run --dry-run for a submission preview.
Example output
(no manifest yet — showing combinations from hyperherd.yaml)
Trials: 12
[0] lr-0.0001_opt-adam
learning_rate=0.0001
optimizer=adam
[1] lr-0.0001_opt-sgd
learning_rate=0.0001
optimizer=sgd
[2] lr-0.0001_opt-adamw
learning_rate=0.0001
optimizer=adamw
[3] lr-0.001_opt-adam
learning_rate=0.001
optimizer=adam
[4] lr-0.001_opt-sgd
learning_rate=0.001
optimizer=sgd
[5] lr-0.001_opt-adamw
learning_rate=0.001
herd status¶
Show the current status table for every trial.
Status values:
| Status | Meaning |
|---|---|
ready |
Never submitted |
submitted |
Sent to SLURM, not yet picked up |
queued |
SLURM PENDING |
running |
SLURM RUNNING |
completed |
SLURM COMPLETED |
failed |
SLURM FAILED / TIMEOUT / OUT_OF_MEMORY / NODE_FAIL |
cancelled |
SLURM CANCELLED (or via herd stop) |
herd status syncs from SLURM each time it runs (via sacct).
Example output
Trial Params Status Last Log
------------------------------------------------------------------------------
0 learning_rate=0.0001 optimizer=adam batch_size=... COMPLETED Test acc: 0.9412
1 learning_rate=0.0001 optimizer=sgd batch_size=6... COMPLETED Test acc: 0.9385
2 learning_rate=0.0001 optimizer=adamw batch_size... COMPLETED Test acc: 0.9404
3 learning_rate=0.001 optimizer=adam batch_size=6... COMPLETED Test acc: 0.9601
4 learning_rate=0.001 optimizer=sgd batch_size=64... COMPLETED Test acc: 0.9512
5 learning_rate=0.001 optimizer=adamw batch_size=... COMPLETED Test acc: 0.9588
6 learning_rate=0.01 optimizer=adam batch_size=64... RUNNING Epoch 7/10 loss=0.142
7 learning_rate=0.01 optimizer=sgd batch_size=64 ... RUNNING Epoch 4/10 loss=0.318
8 learning_rate=0.01 optimizer=adamw batch_size=6... RUNNING Epoch 5/10 loss=0.287
9 learning_rate=0.1 optimizer=adam batch_size=64 ... QUEUED
10 learning_rate=0.1 optimizer=adamw batch_size=64... QUEUED
Total: 11 | COMPLETED: 6 RUNNING: 3 QUEUED: 2
Agent mode — herd status --json:
{
"totals": {"total": 11, "running": 4, "completed": 5, "failed": 1, "queued": 1},
"trials": [
{"index": 0, "status": "completed", "experiment_name": "lr-0.001_opt-adam",
"params": {"lr": 0.001, "optimizer": "adam"},
"last_log_line": "Test acc: 0.978"}
]
}
herd stats¶
Print runtime + memory accounting for one or all trials, sourced from sacct.
Columns: index, state, elapsed, max RSS (GB), avg RSS (GB), requested mem (GB), experiment name. Memory values are converted from sacct's raw units to gigabytes.
Example output
idx state elapsed max_rss ave_rss req_mem name
--- --------- -------- ------- ------- ------- -------------------------------------
0 COMPLETED 00:00:38 0.36G 0.36G 4.00G lr-0.0001_opt-adam_bs-64_hd-128_do-0
1 COMPLETED 00:00:34 0.48G 0.48G 4.00G lr-0.0001_opt-sgd_bs-64_hd-128_do-0
2 COMPLETED 00:00:34 0.48G 0.48G 4.00G lr-0.0001_opt-adamw_bs-64_hd-128_do-0
3 COMPLETED 00:02:35 0.54G 0.54G 4.00G lr-0.001_opt-adam_bs-64_hd-128_do-0
4 COMPLETED 00:00:20 0.00G 0.00G 4.00G lr-0.001_opt-sgd_bs-64_hd-128_do-0
5 COMPLETED 00:00:44 0.53G 0.53G 4.00G lr-0.001_opt-adamw_bs-64_hd-128_do-0
6 COMPLETED 00:04:34 0.57G 0.57G 4.00G lr-0.01_opt-adam_bs-64_hd-128_do-0.2
7 COMPLETED 00:04:24 0.55G 0.55G 4.00G lr-0.01_opt-sgd_bs-64_hd-128_do-0.2
8 COMPLETED 00:04:44 0.56G 0.56G 4.00G lr-0.01_opt-adamw_bs-64_hd-128_do-0.2
9 RUNNING 00:01:17 - - 4.00G lr-0.1_opt-adam_bs-64_hd-128_do-0.2
10 COMPLETED 00:02:04 0.55G 0.55G 4.00G lr-0.1_opt-adamw_bs-64_hd-128_do-0.2
Agent mode — herd stats --json emits memory in bytes and elapsed time in seconds, with the SLURM state and the original sacct strings preserved so callers don't have to re-derive them:
{
"trials": [
{"index": 0, "experiment_name": "lr-0.001_opt-adam",
"slurm_state": "COMPLETED",
"elapsed": "00:01:30", "elapsed_seconds": 90,
"max_rss_bytes": 1610612736, "ave_rss_bytes": 858993459,
"req_mem_bytes": 1610612736}
]
}
herd tail¶
Print the last N lines of a trial's logs.
By default herd tail prints both .hyperherd/logs/<index>.out (stdout) and .err (stderr), each prefixed by a labelled header. Use --stdout or --stderr (mutually exclusive) to restrict to one stream. -n (default 20) controls how many lines per stream.
Agent mode — herd tail --json returns each requested stream's path and lines as a structured payload. A stream that doesn't exist on disk shows up with lines: null so an agent can distinguish "no log file" from "empty log file":
{
"index": 3,
"status": "failed",
"experiment_name": "lr-0.1_opt-sgd",
"streams": {
"stdout": {"path": ".hyperherd/logs/3.out", "lines": ["epoch 1", "..."], "requested": 20},
"stderr": {"path": ".hyperherd/logs/3.err", "lines": ["RuntimeError: CUDA OOM"], "requested": 20}
}
}
herd res¶
Print a TSV of every trial's parameters and logged metrics.
Combines manifest.json (parameters, experiment name) with .hyperherd/results/*.json (metrics written by log_result()). Trials without results show empty cells.
Agent mode — herd res --json emits one entry per trial (including those without logged metrics, with metrics: {}):
{
"trials": [
{"index": 0, "experiment_name": "lr-0.001_opt-adam",
"params": {"lr": 0.001, "optimizer": "adam"},
"metrics": {"test_acc": 0.978, "test_loss": 0.071}}
]
}
herd test¶
Run a single trial locally (no SLURM) via the configured launcher.
Default INDEX is 0. The launcher is invoked exactly as the SLURM array would invoke it, so this is the right place to debug the launcher script itself, exercise the trainer end-to-end on a login node, or verify a fix before resubmitting the array.
For safety, herd test refuses any index that has previously been submitted to SLURM — running again would clobber its outputs and logs. Pick a different index, or herd clean --all first.
| Flag | Description |
|---|---|
--cfg-job |
Append --cfg job to the override string. For Hydra trainers, this prints the fully resolved config and exits without running training — handy for catching unknown parameter names, type mismatches, or missing required fields. Because nothing real runs, the previously-submitted guard is skipped in this mode. Hydra-specific — has no effect on launchers whose trainers don't recognize --cfg job. |
herd test runs on the login node, so your launcher's environment must be accessible there. If your launcher requires a GPU container that isn't available on the login node, adapt it to gate the heavy parts on a HYPERHERD_TEST flag, or test manually.
herd stop¶
Cancel a running/queued trial.
Calls scancel <jobid>_<index> and updates the manifest to cancelled. Pass either an INDEX or --all, not both. With --all, every trial whose status is in (submitted, queued, running) is cancelled.
Agent mode — herd stop --json returns one record per cancelled trial (empty list if there was nothing live):
{
"cancelled": [
{"index": 3, "slurm_job_id": "12345", "previous_status": "running"},
{"index": 7, "slurm_job_id": "12345", "previous_status": "queued"}
]
}
herd snapshot¶
Bundle every read-style command's output (status + sacct + logged metrics + per-trial last-log line + recent failed-trial stderr) into a single JSON document.
herd snapshot is JSON-only: it has no human-formatted form. It exists for agent loops where one cheap CLI call per tick beats firing four (status, stats, res, tail) and re-stitching the results — and avoids partial-state races between calls when SLURM transitions a trial mid-snapshot.
| Flag | Description |
|---|---|
-n, --lines |
Max stderr lines to include per failed trial (default: 20) |
--max-failed |
Cap on number of failed trials to attach stderr for (default: 20) |
Shape:
{
"sweep_name": "mnist_sweep",
"workspace": "/home/you/sweeps/mnist_sweep",
"totals": {"total": 11, "running": 4, "completed": 5, "failed": 2},
"trials": [
{
"index": 0, "status": "completed", "experiment_name": "lr-0.001_opt-adam",
"params": {"lr": 0.001, "optimizer": "adam"},
"slurm_job_id": "12345",
"slurm_state": "COMPLETED",
"elapsed": "00:01:30", "elapsed_seconds": 90,
"max_rss_bytes": 1610612736, "ave_rss_bytes": 858993459,
"req_mem_bytes": 1610612736,
"metrics": {"test_acc": 0.978, "test_loss": 0.071},
"last_log_line": "Test acc: 0.978"
}
],
"failed_stderr": [
{
"index": 5,
"stderr_path": ".hyperherd/logs/5.err",
"stderr_lines": ["RuntimeError: CUDA out of memory", "..."],
"stderr_truncated": false
}
]
}
metrics is whatever the trial called log_result() with — empty dict for trials that haven't logged anything yet (not silently dropped). last_log_line is the same one-liner the human herd status table shows in its rightmost column. failed_stderr is keyed by index in ascending order; an agent that wants to group failures by root cause should fingerprint these stderr blocks.
herd monitor¶
Run the autonomous monitor daemon. Connects to Discord, runs the boot interview, operates the sweep until it halts. See Autonomous monitor for the full picture.
| Flag | Description |
|---|---|
--once |
Run exactly one tick and exit (live — calls the model once) |
--dry-run |
Assemble the per-tick state and render the prompt without calling the model. For verifying the deterministic path before paying tokens. |
--trigger {scheduled,failure,completion,user_message,boot} |
Trigger for --once / --dry-run (daemon mode picks its own) |
--max-ticks N |
Stop after N ticks (safety cap for testing) |
If WORKSPACE/.hyperherd doesn't exist, the daemon auto-initializes the manifest first (equivalent to herd run --dry-run) so the agent has trial state to read from its first tick.
Requires Python 3.10+ and the [monitor] extras (pip install 'hyperherd[monitor]'). Discord setup is one-time per server — see Discord setup.
herd clean¶
Cancel jobs and clean up workspace state.
| Flag | Description |
|---|---|
| (none) | Cancel any running jobs but leave the manifest in place |
-l, --logs |
Also remove .hyperherd/logs/ |
-a, --all |
Remove the entire .hyperherd/ state directory |
herd clean -a is destructive — manifests, results, and logs are gone after.
herd install-skill¶
Install the Claude Code skill for authoring sweep configs.
Default scope is user (writes to ~/.claude/skills/hyperherd-config/SKILL.md); project writes to ./.claude/skills/. Use -f to overwrite an existing install.