Skip to content

Command reference

Every herd subcommand is documented below. Most subcommands take an optional workspace directory as their first positional argument (default: .); when run from inside the workspace you can usually omit it.

Agent (--json) mode

The read- and run-style commands (run, status, stats, tail, res, stop) accept a --json flag that emits a structured JSON document to stdout instead of the human-formatted table. Use it when an agent or other automation is driving HyperHerd:

  • Numeric memory in bytes (not 1.50G); elapsed time in seconds (not 01:30:00).
  • Status uses the stable internal enum: ready, submitted, queued, running, completed, failed, cancelled.
  • Empty / unknown values come through as null, never an empty string.
  • Errors still go to stderr with a non-zero exit code; stdout in JSON mode is always a single valid JSON document or empty.
  • Warnings (preflight, partition checks) print to stderr as in normal mode and do not corrupt the stdout JSON, so herd run --dry-run --json | jq ... is always safe.

The JSON shape for each command is documented inline below.

herd init

Scaffold a new sweep workspace.

herd init [DIRECTORY] [--config FILE] [--launcher FILE] [-f]

Creates DIRECTORY/hyperherd.yaml and DIRECTORY/launch.sh with template content; the experiment name is taken from the directory name. If DIRECTORY is omitted, files are written to the current directory.

The templates have placeholder SLURM resource fields (partition, time, mem, cpus_per_task) you'll edit to match your cluster — herd init doesn't try to be a substitute for opening the YAML.

Flag Description
--config FILE Copy FILE in as hyperherd.yaml instead of generating a template — useful for cloning an existing sweep
--launcher FILE Copy FILE in as launch.sh instead of generating a template
-f, --force Overwrite existing files in the target directory

Example output

Created my_experiment/hyperherd.yaml
Created my_experiment/launch.sh

Next steps:
  1. Edit hyperherd.yaml to define your parameters and SLURM resources
  2. Edit launch.sh to set up your container/environment
  3. Run: herd run my_experiment --dry-run

herd run

Submit (or resubmit) the sweep.

herd run [WORKSPACE] [flags]

Generates the trial manifest, runs preflight checks, writes the sbatch script to .hyperherd/job.sbatch, submits it, and records the SLURM job ID.

herd run is idempotent: it only submits trials whose status is ready, failed, or cancelled. Trials that are submitted, queued, running, or completed are skipped unless you opt in with --force.

Flag Description
-n, --dry-run Print the submission plan (sbatch script + pending indices); don't submit. Use herd ls for the full trial list.
-j, --max-concurrent N Cap concurrent running tasks (overrides slurm.max_concurrent)
-i, --indices SPEC Submit only these trial indices, e.g. 0-3,5,7-9
-p, --pin NAME=VALUE [NAME=VALUE ...] Submit only trials whose swept params match every pin (e.g. --pin batch_size=32 optimizer=adam). Names must be sweep parameters; values are coerced to int/float/str.
-f, --force With --indices, allow resubmitting running/completed trials. Without, allow config edits that drop running/completed trials (kept as orphans).

Editing the config mid-sweep is supported. If you edit hyperherd.yaml between runs, herd run reconciles the new manifest against the old one: new trials are appended, removed trials are dropped (or kept as orphans with -f if they were already running/completed). See Re-running and reconciliation for the rules.

Example output — successful submission


━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ✓ Launched 11 trials as SLURM job array 38384012
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  workspace: my_experiment/.hyperherd
  logs:      my_experiment/.hyperherd/logs
  monitor:   herd status

Example output--dry-run

============================================================
DRY RUN — No jobs will be submitted
============================================================

Generated sbatch script (per-trial lookup elided for brevity):
----------------------------------------
#!/bin/bash
#SBATCH --job-name=hyperherd_lr_sweep
#SBATCH --array=0-11
#SBATCH --partition=short
#SBATCH --time=01:00:00
#SBATCH --mem=2G
#SBATCH --cpus-per-task=1
#SBATCH --output=/path/to/lr_sweep/.hyperherd/logs/%a.out
#SBATCH --error=/path/to/lr_sweep/.hyperherd/logs/%a.err
#SBATCH --open-mode=append

# Run divider — visible in both stdout and stderr after append
_HH_DIVIDER="==== HyperHerd run: job ${SLURM_JOB_ID} array-task ${SLURM_ARRAY_TASK_ID} $(date -Iseconds) ===="
printf "\n%s\n\n" "$_HH_DIVIDER"
printf "\n%s\n\n" "$_HH_DIVIDER" >&2

# Export HyperHerd environment variables
export HYPERHERD_WORKSPACE=/path/to/lr_sweep
export HYPERHERD_SWEEP_NAME=lr_sweep
export HYPERHERD_TRIAL_ID="$SLURM_ARRAY_TASK_ID"

# Per-trial lookup baked at submission time (no Python required here).
case "$SLURM_ARRAY_TASK_ID" in
  0)
    HYPERHERD_TRIAL_NAME=lr-0.0001_opt-adam
    HYPERHERD_EXPERIMENT_NAME=lr-0.0001_opt-adam
    OVERRIDES='experiment_name=lr-0.0001_opt-adam learning_rate=0.0001 optimizer=adam'
    ;;
  # ... [11 more trial arm(s) elided in dry-run; full script is submitted] ...
  *)
    echo "HyperHerd: no lookup entry for SLURM_ARRAY_TASK_ID=$SLURM_ARRAY_TASK_ID" >&2
    exit 1
    ;;
esac
export HYPERHERD_TRIAL_NAME HYPERHERD_EXPERIMENT_NAME

# Invoke the user's launcher script
bash /path/to/lr_sweep/launch.sh "$OVERRIDES"
----------------------------------------

Submission plan
  Pending: 12 of 12 trial(s)
  Indices: 0-11
  Use herd ls to see every trial in the sweep.

Agent modeherd run --dry-run --json emits the submission plan (the indices + sbatch script that would actually be submitted right now, given current status / --pin / --indices). For the full sweep enumeration regardless of status, use herd ls (or its JSON variant when added). The intended workflow for an agent is to inspect the trials, then call herd run --json to submit.

{
  "dry_run": true,
  "slurm_job_id": null,
  "sbatch_path": null,
  "submitted_indices": [0, 1, 2, 3],
  "sbatch_script": "#!/bin/bash\n#SBATCH --array=0-3\n...",
  "trials": [
    {"index": 0, "status": "ready", "experiment_name": "lr-0.01_bs-32",
     "params": {"lr": 0.01, "bs": 32}},
    {"index": 1, "status": "ready", "experiment_name": "lr-0.01_bs-64",
     "params": {"lr": 0.01, "bs": 64}}
  ]
}

A real (non-dry-run) herd run --json returns the same shape with dry_run: false, slurm_job_id populated, sbatch_path set to where the script was written (.hyperherd/job.sbatch), and sbatch_script: null.

herd ls

List every trial in the sweep with its swept parameters.

herd ls [WORKSPACE] [-p NAME=VALUE ...]

Status-agnostic — shows the shape of the sweep, not what herd run would do next. Reads the manifest if present; otherwise materializes the combinations from hyperherd.yaml so you can herd ls BEFORE the first herd run to sanity-check the YAML.

Flag Description
-p, --pin NAME=VALUE [NAME=VALUE ...] Filter to trials whose swept params match every pin (e.g. --pin batch_size=32 optimizer=adam).

Use herd status for the SLURM-synced status table (this command does not touch SLURM); use herd run --dry-run for a submission preview.

Example output

(no manifest yet — showing combinations from hyperherd.yaml)
Trials: 12

[0]  lr-0.0001_opt-adam
    learning_rate=0.0001
    optimizer=adam

[1]  lr-0.0001_opt-sgd
    learning_rate=0.0001
    optimizer=sgd

[2]  lr-0.0001_opt-adamw
    learning_rate=0.0001
    optimizer=adamw

[3]  lr-0.001_opt-adam
    learning_rate=0.001
    optimizer=adam

[4]  lr-0.001_opt-sgd
    learning_rate=0.001
    optimizer=sgd

[5]  lr-0.001_opt-adamw
    learning_rate=0.001

herd status

Show the current status table for every trial.

herd status [WORKSPACE]

Status values:

Status Meaning
ready Never submitted
submitted Sent to SLURM, not yet picked up
queued SLURM PENDING
running SLURM RUNNING
completed SLURM COMPLETED
failed SLURM FAILED / TIMEOUT / OUT_OF_MEMORY / NODE_FAIL
cancelled SLURM CANCELLED (or via herd stop)

herd status syncs from SLURM each time it runs (via sacct).

Example output

Trial  Params                                              Status     Last Log
------------------------------------------------------------------------------
    0  learning_rate=0.0001 optimizer=adam batch_size=...  COMPLETED  Test acc: 0.9412
    1  learning_rate=0.0001 optimizer=sgd batch_size=6...  COMPLETED  Test acc: 0.9385
    2  learning_rate=0.0001 optimizer=adamw batch_size...  COMPLETED  Test acc: 0.9404
    3  learning_rate=0.001 optimizer=adam batch_size=6...  COMPLETED  Test acc: 0.9601
    4  learning_rate=0.001 optimizer=sgd batch_size=64...  COMPLETED  Test acc: 0.9512
    5  learning_rate=0.001 optimizer=adamw batch_size=...  COMPLETED  Test acc: 0.9588
    6  learning_rate=0.01 optimizer=adam batch_size=64...  RUNNING    Epoch 7/10  loss=0.142
    7  learning_rate=0.01 optimizer=sgd batch_size=64 ...  RUNNING    Epoch 4/10  loss=0.318
    8  learning_rate=0.01 optimizer=adamw batch_size=6...  RUNNING    Epoch 5/10  loss=0.287
    9  learning_rate=0.1 optimizer=adam batch_size=64 ...  QUEUED   
   10  learning_rate=0.1 optimizer=adamw batch_size=64...  QUEUED   

Total: 11  |  COMPLETED: 6  RUNNING: 3  QUEUED: 2

Agent modeherd status --json:

{
  "totals": {"total": 11, "running": 4, "completed": 5, "failed": 1, "queued": 1},
  "trials": [
    {"index": 0, "status": "completed", "experiment_name": "lr-0.001_opt-adam",
     "params": {"lr": 0.001, "optimizer": "adam"},
     "last_log_line": "Test acc: 0.978"}
  ]
}

herd stats

Print runtime + memory accounting for one or all trials, sourced from sacct.

herd stats [WORKSPACE] [INDEX]

Columns: index, state, elapsed, max RSS (GB), avg RSS (GB), requested mem (GB), experiment name. Memory values are converted from sacct's raw units to gigabytes.

Example output

idx  state      elapsed   max_rss  ave_rss  req_mem  name                                 
---  ---------  --------  -------  -------  -------  -------------------------------------
0    COMPLETED  00:00:38  0.36G    0.36G    4.00G    lr-0.0001_opt-adam_bs-64_hd-128_do-0
1    COMPLETED  00:00:34  0.48G    0.48G    4.00G    lr-0.0001_opt-sgd_bs-64_hd-128_do-0
2    COMPLETED  00:00:34  0.48G    0.48G    4.00G    lr-0.0001_opt-adamw_bs-64_hd-128_do-0
3    COMPLETED  00:02:35  0.54G    0.54G    4.00G    lr-0.001_opt-adam_bs-64_hd-128_do-0
4    COMPLETED  00:00:20  0.00G    0.00G    4.00G    lr-0.001_opt-sgd_bs-64_hd-128_do-0
5    COMPLETED  00:00:44  0.53G    0.53G    4.00G    lr-0.001_opt-adamw_bs-64_hd-128_do-0
6    COMPLETED  00:04:34  0.57G    0.57G    4.00G    lr-0.01_opt-adam_bs-64_hd-128_do-0.2
7    COMPLETED  00:04:24  0.55G    0.55G    4.00G    lr-0.01_opt-sgd_bs-64_hd-128_do-0.2
8    COMPLETED  00:04:44  0.56G    0.56G    4.00G    lr-0.01_opt-adamw_bs-64_hd-128_do-0.2
9    RUNNING    00:01:17  -        -        4.00G    lr-0.1_opt-adam_bs-64_hd-128_do-0.2
10   COMPLETED  00:02:04  0.55G    0.55G    4.00G    lr-0.1_opt-adamw_bs-64_hd-128_do-0.2

Agent modeherd stats --json emits memory in bytes and elapsed time in seconds, with the SLURM state and the original sacct strings preserved so callers don't have to re-derive them:

{
  "trials": [
    {"index": 0, "experiment_name": "lr-0.001_opt-adam",
     "slurm_state": "COMPLETED",
     "elapsed": "00:01:30", "elapsed_seconds": 90,
     "max_rss_bytes": 1610612736, "ave_rss_bytes": 858993459,
     "req_mem_bytes": 1610612736}
  ]
}

herd tail

Print the last N lines of a trial's logs.

herd tail [WORKSPACE] INDEX [-n LINES] [--stdout | --stderr]

By default herd tail prints both .hyperherd/logs/<index>.out (stdout) and .err (stderr), each prefixed by a labelled header. Use --stdout or --stderr (mutually exclusive) to restrict to one stream. -n (default 20) controls how many lines per stream.

Agent modeherd tail --json returns each requested stream's path and lines as a structured payload. A stream that doesn't exist on disk shows up with lines: null so an agent can distinguish "no log file" from "empty log file":

{
  "index": 3,
  "status": "failed",
  "experiment_name": "lr-0.1_opt-sgd",
  "streams": {
    "stdout": {"path": ".hyperherd/logs/3.out", "lines": ["epoch 1", "..."], "requested": 20},
    "stderr": {"path": ".hyperherd/logs/3.err", "lines": ["RuntimeError: CUDA OOM"], "requested": 20}
  }
}

herd res

Print a TSV of every trial's parameters and logged metrics.

herd res [WORKSPACE]

Combines manifest.json (parameters, experiment name) with .hyperherd/results/*.json (metrics written by log_result()). Trials without results show empty cells.

Agent modeherd res --json emits one entry per trial (including those without logged metrics, with metrics: {}):

{
  "trials": [
    {"index": 0, "experiment_name": "lr-0.001_opt-adam",
     "params": {"lr": 0.001, "optimizer": "adam"},
     "metrics": {"test_acc": 0.978, "test_loss": 0.071}}
  ]
}

herd test

Run a single trial locally (no SLURM) via the configured launcher.

herd test [WORKSPACE] [INDEX] [--cfg-job]

Default INDEX is 0. The launcher is invoked exactly as the SLURM array would invoke it, so this is the right place to debug the launcher script itself, exercise the trainer end-to-end on a login node, or verify a fix before resubmitting the array.

For safety, herd test refuses any index that has previously been submitted to SLURM — running again would clobber its outputs and logs. Pick a different index, or herd clean --all first.

Flag Description
--cfg-job Append --cfg job to the override string. For Hydra trainers, this prints the fully resolved config and exits without running training — handy for catching unknown parameter names, type mismatches, or missing required fields. Because nothing real runs, the previously-submitted guard is skipped in this mode. Hydra-specific — has no effect on launchers whose trainers don't recognize --cfg job.

herd test runs on the login node, so your launcher's environment must be accessible there. If your launcher requires a GPU container that isn't available on the login node, adapt it to gate the heavy parts on a HYPERHERD_TEST flag, or test manually.

herd stop

Cancel a running/queued trial.

herd stop [WORKSPACE] INDEX
herd stop [WORKSPACE] --all

Calls scancel <jobid>_<index> and updates the manifest to cancelled. Pass either an INDEX or --all, not both. With --all, every trial whose status is in (submitted, queued, running) is cancelled.

Agent modeherd stop --json returns one record per cancelled trial (empty list if there was nothing live):

{
  "cancelled": [
    {"index": 3, "slurm_job_id": "12345", "previous_status": "running"},
    {"index": 7, "slurm_job_id": "12345", "previous_status": "queued"}
  ]
}

herd snapshot

Bundle every read-style command's output (status + sacct + logged metrics + per-trial last-log line + recent failed-trial stderr) into a single JSON document.

herd snapshot [WORKSPACE] [-n LINES] [--max-failed N]

herd snapshot is JSON-only: it has no human-formatted form. It exists for agent loops where one cheap CLI call per tick beats firing four (status, stats, res, tail) and re-stitching the results — and avoids partial-state races between calls when SLURM transitions a trial mid-snapshot.

Flag Description
-n, --lines Max stderr lines to include per failed trial (default: 20)
--max-failed Cap on number of failed trials to attach stderr for (default: 20)

Shape:

{
  "sweep_name": "mnist_sweep",
  "workspace": "/home/you/sweeps/mnist_sweep",
  "totals": {"total": 11, "running": 4, "completed": 5, "failed": 2},
  "trials": [
    {
      "index": 0, "status": "completed", "experiment_name": "lr-0.001_opt-adam",
      "params": {"lr": 0.001, "optimizer": "adam"},
      "slurm_job_id": "12345",
      "slurm_state": "COMPLETED",
      "elapsed": "00:01:30", "elapsed_seconds": 90,
      "max_rss_bytes": 1610612736, "ave_rss_bytes": 858993459,
      "req_mem_bytes": 1610612736,
      "metrics": {"test_acc": 0.978, "test_loss": 0.071},
      "last_log_line": "Test acc: 0.978"
    }
  ],
  "failed_stderr": [
    {
      "index": 5,
      "stderr_path": ".hyperherd/logs/5.err",
      "stderr_lines": ["RuntimeError: CUDA out of memory", "..."],
      "stderr_truncated": false
    }
  ]
}

metrics is whatever the trial called log_result() with — empty dict for trials that haven't logged anything yet (not silently dropped). last_log_line is the same one-liner the human herd status table shows in its rightmost column. failed_stderr is keyed by index in ascending order; an agent that wants to group failures by root cause should fingerprint these stderr blocks.

herd monitor

Run the autonomous monitor daemon. Connects to Discord, runs the boot interview, operates the sweep until it halts. See Autonomous monitor for the full picture.

herd monitor [WORKSPACE] [flags]
Flag Description
--once Run exactly one tick and exit (live — calls the model once)
--dry-run Assemble the per-tick state and render the prompt without calling the model. For verifying the deterministic path before paying tokens.
--trigger {scheduled,failure,completion,user_message,boot} Trigger for --once / --dry-run (daemon mode picks its own)
--max-ticks N Stop after N ticks (safety cap for testing)

If WORKSPACE/.hyperherd doesn't exist, the daemon auto-initializes the manifest first (equivalent to herd run --dry-run) so the agent has trial state to read from its first tick.

Requires Python 3.10+ and the [monitor] extras (pip install 'hyperherd[monitor]'). Discord setup is one-time per server — see Discord setup.

herd clean

Cancel jobs and clean up workspace state.

herd clean [WORKSPACE] [-l] [-a]
Flag Description
(none) Cancel any running jobs but leave the manifest in place
-l, --logs Also remove .hyperherd/logs/
-a, --all Remove the entire .hyperherd/ state directory

herd clean -a is destructive — manifests, results, and logs are gone after.

herd install-skill

Install the Claude Code skill for authoring sweep configs.

herd install-skill [--scope user|project] [-f]

Default scope is user (writes to ~/.claude/skills/hyperherd-config/SKILL.md); project writes to ./.claude/skills/. Use -f to overwrite an existing install.