Sweep config (hyperherd.yaml)¶
The configuration file is YAML. Every field is documented below with its type, whether it's required, and its default. The config is validated with Pydantic at parse time — invalid configs produce clear error messages immediately.
Top-level fields¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
name |
string | yes | — | Unique experiment name. Used in the SLURM job name (hyperherd_<name>). Must be a valid filename component. |
grid |
string or list | no | omitted | Controls which parameters are swept via Cartesian product. See grid. |
launcher |
string | yes | — | Path to the launcher bash script. Relative paths are resolved relative to the config file's directory. See Launcher Script. |
slurm |
object | no | see below | SLURM resource requests. See slurm. |
hydra |
object | no | see below | Static Hydra overrides. See hydra. |
discord |
object | no | unset | Discord channel for the herd monitor daemon. See discord. |
parameters |
object | yes | — | Hyperparameter definitions. At least one parameter is required. See parameters. |
conditions |
list | no | [] |
Conditional rules that filter or modify parameter combinations. See Conditions. |
The workspace is the directory containing hyperherd.yaml. HyperHerd stores its state in a .hyperherd/ subdirectory within the workspace.
grid¶
Controls how parameter combinations are generated. Three forms:
| Value | Meaning |
|---|---|
grid: all |
Full Cartesian product of all parameter values. No defaults required. |
grid: [lr, weight_decay] |
Cartesian product of the listed parameters only. All other parameters are held at their default. |
| (omitted) | One-at-a-time: start from defaults, vary each parameter independently. All parameters must have a default. |
Full grid (grid: all) generates every combination. With 3 parameters of 4, 3, and 2 levels, you get 4 × 3 × 2 = 24 trials.
Partial grid (grid: [lr, wd]) generates a Cartesian product of only the listed parameters while holding everything else at its default. Use it to explore interactions between specific parameters without exploding the trial count.
One-at-a-time (no grid field) starts from a default combination, then varies each parameter independently. This produces 1 + Σ(levels_i − 1) trials. For the 4/3/2 example: 1 + 3 + 2 + 1 = 7 trials.
# Full grid:
grid: all
# Partial grid:
grid: [learning_rate, weight_decay]
# One-at-a-time (omit grid entirely):
# (no grid field)
When grid is not "all", every parameter that is not in the grid list must have a default. When grid is omitted, all parameters must have a default.
slurm¶
SLURM resource requests. These map directly to #SBATCH directives in the generated batch script.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
partition |
string | no | "default" |
SLURM partition name. |
time |
string | no | "01:00:00" |
Wall-clock time limit in HH:MM:SS format. |
mem |
string | no | "8G" |
Memory per node (e.g. "16G", "512M"). |
cpus_per_task |
integer | no | 1 |
Number of CPU cores per task. |
gres |
string | no | omitted | Generic resources (e.g. "gpu:1", "gpu:a100:2"). If omitted, the --gres line is not included. |
max_concurrent |
integer | no | omitted | Cap on simultaneously running array tasks. Appended as %N to the SLURM array spec (e.g. --array=0-49%5). Override per-run with herd run --max-concurrent N. |
extra_args |
list[string] | no | [] |
Additional raw #SBATCH flags. Each string is placed after #SBATCH verbatim. |
slurm:
partition: gpu
time: "04:00:00"
mem: "32G"
cpus_per_task: 4
gres: "gpu:1"
max_concurrent: 8
extra_args:
- "--export=ALL"
- "--exclusive"
discord¶
Settings for the herd monitor daemon's Discord channel. Required if you want the monitor to be able to talk to you. The token comes from the DISCORD_BOT_TOKEN environment variable — don't put secrets in YAML. Walk through Discord setup once per server to mint the bot and copy the guild ID.
| Field | Type | Default | Description |
|---|---|---|---|
guild_id |
string | unset | Discord server ID. Required to enable the channel — without it the daemon runs without notifications. |
channel_id |
string | unset | Pin to a specific existing channel (skips auto-create). |
channel_name |
string | unset | Override the sweep-derived channel name. Discord lowercases it automatically; non [a-z0-9-] chars are stripped. |
static_overrides¶
A list of extra name=value tokens appended to every trial's override string. Use them for values that are constant across the sweep but differ from your trainer's defaults (dataset paths, fixed seeds, debug flags). The values are passed through verbatim — your launcher is free to parse them however it wants.
The legacy
hydra: { static_overrides: [...] }form still parses for back-compat, but new configs should use the top-level field.
How overrides reach the training command¶
HyperHerd builds an override string for each trial by combining:
experiment_name=<name>(built from parameter abbreviations, e.g.lr-0.001_opt-adam_bs-64)- Each parameter's
name=value - Any
static_overrides - Any condition
setextras (last → these win)
This string is passed as $1 to your launcher script. For example:
experiment_name=lr-0.001_opt-adam_bs-64 learning_rate=0.001 optimizer=adam batch_size=64 data.path=/scratch/datasets
If you're using Hydra this passes through unmodified. If not, your launch.sh parses these name=value pairs into whatever flags your trainer accepts.
Four environment variables are also exported in the SLURM script:
HYPERHERD_WORKSPACE— absolute path to the workspace directoryHYPERHERD_SWEEP_NAME— the sweep'sname:from yaml (shared across all trials)HYPERHERD_TRIAL_ID— the array task index (same as$SLURM_ARRAY_TASK_ID)HYPERHERD_TRIAL_NAME— the auto-generated per-trial identifier (e.g.lr-0.001_opt-adam_bs-64)
Use these for output directories, wandb run names, logging paths, and the log_result() API.
parameters¶
A YAML mapping where each key is the parameter name and the value is a specification object. Parameter names should match the Hydra config keys you want to override (e.g. model.learning_rate, training.batch_size). Dotted names work for nested Hydra paths.
Every parameter requires:
type— either"discrete"or"continuous"abbrev— (optional) a short name used to buildexperiment_name(e.g."lr","opt","bs"). Defaults to the parameter name itself. Required when the parameter name contains characters outside[A-Za-z0-9._-](e.g. spaces,/), since the token ends up as a file-path component.default— (optional) required when the parameter is not in thegrid
Allowed characters in abbrev
Whether implicit (from the parameter name) or explicit, the abbrev token must match [A-Za-z0-9._-]+. Anything else would corrupt the experiment_name directory layout or the space-separated key=value override string.
Discrete parameters¶
| Field | Type | Required | Description |
|---|---|---|---|
type |
string | yes | Must be "discrete". |
abbrev |
string | no | Short name for experiment naming. Defaults to the parameter name; required when the parameter name contains characters outside [A-Za-z0-9._-]. |
values |
list[any] | yes | List of values to sweep over. Strings, integers, floats, or booleans. |
labels |
list[string] | no | Per-value short display tokens used in the experiment name. Must be the same length as values, contain unique non-empty strings without /. Required when any value contains / (e.g. file paths). |
default |
any | no* | Default value. Must be one of values. Required when not in grid. |
parameters:
optimizer:
abbrev: opt
type: discrete
values: [adam, sgd, adamw]
default: adam
num_layers:
abbrev: nl
type: discrete
values: [2, 4, 8]
default: 4
# Path-like values must declare labels so experiment names stay short.
pretrained:
abbrev: pre
type: discrete
values: ["/scratch/ckpts/resnet50.ckpt", "/scratch/ckpts/vit_base.ckpt"]
labels: [resnet50, vit_base]
default: "/scratch/ckpts/resnet50.ckpt"
The full value is still passed to Hydra as the override; labels only affect the auto-generated experiment_name.
Continuous parameters¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
type |
string | yes | — | Must be "continuous". |
abbrev |
string | no | param name | Short name for experiment naming. Defaults to the parameter name; required when the parameter name contains characters outside [A-Za-z0-9._-]. |
low |
float | yes | — | Lower bound (inclusive). |
high |
float | yes | — | Upper bound (inclusive). |
scale |
string | no | "linear" |
"linear" for uniform spacing, "log" for log-uniform. Log scale requires low > 0. |
steps |
integer | no | 5 |
Number of evenly-spaced points to discretize into. |
default |
float | no* | — | Default value. Must be within [low, high]. Required when not in grid. |
parameters:
learning_rate:
abbrev: lr
type: continuous
low: 1e-5
high: 1e-2
scale: log
steps: 5
default: 0.001
Log scale discretization. With low: 1e-4, high: 1e-2, scale: log, steps: 3, values are spaced evenly in log10 space: [0.0001, 0.001, 0.01].
Default validation
Discrete defaults must be in values. Continuous defaults must be within [low, high]. Both are validated at parse time.
Complete example¶
name: resnet_sweep
grid: [learning_rate, weight_decay]
slurm:
partition: gpu
time: "08:00:00"
mem: "32G"
cpus_per_task: 8
gres: "gpu:a100:1"
extra_args:
- "--export=ALL"
launcher: ./launch.sh
static_overrides:
- "data.root=/scratch/imagenet"
- "trainer.max_epochs=90"
parameters:
learning_rate:
abbrev: lr
type: continuous
low: 1e-4
high: 1e-1
scale: log
steps: 4
default: 0.001
optimizer:
abbrev: opt
type: discrete
values: [sgd, adamw]
default: adamw
weight_decay:
abbrev: wd
type: continuous
low: 0.0
high: 0.01
scale: linear
steps: 3
default: 0.001
batch_size:
abbrev: bs
type: discrete
values: [64, 128, 256]
default: 128
conditions:
- name: sgd_no_high_lr
when:
optimizer: sgd
exclude:
learning_rate: [0.1]
grid: [learning_rate, weight_decay] creates a 4 × 3 = 12 trial grid over those two parameters; optimizer and batch_size stay at their defaults (adamw and 128).