Workspace layout¶
After herd run, the workspace directory contains:
my_experiment/
hyperherd.yaml
launch.sh
.hyperherd/
manifest.json # Trial records: {index, params, experiment_name, status}
job_ids.json # Records of submitted SLURM jobs
job.sbatch # The generated sbatch script
logs/
0.out, 0.err # stdout / stderr for array task 0
1.out, 1.err
...
results/
0.json # Logged metrics for trial 0 (from log_result())
1.json
...
| File | Purpose |
|---|---|
manifest.json |
Authoritative mapping of array index → parameter values + experiment name + status. Don't edit manually. |
job_ids.json |
Records each herd run submission with its SLURM job ID and the indices it covered. Used to reconcile status across resubmissions. |
job.sbatch |
The generated SLURM script. Inspect it to verify directives. |
logs/ |
SLURM-captured stdout/stderr. herd status reads the last line of each .out. |
results/ |
Per-trial JSON metrics written by log_result(). Read by herd res. |
Generated SLURM script¶
For reference, HyperHerd generates a batch script roughly like this:
#!/bin/bash
#SBATCH --job-name=hyperherd_resnet_sweep
#SBATCH --array=0-11
#SBATCH --partition=gpu
#SBATCH --time=08:00:00
#SBATCH --mem=32G
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:a100:1
#SBATCH --output=<workspace>/.hyperherd/logs/%a.out
#SBATCH --error=<workspace>/.hyperherd/logs/%a.err
#SBATCH --open-mode=append
#SBATCH --export=ALL
# Export HyperHerd environment variables
export HYPERHERD_WORKSPACE="<workspace>"
export HYPERHERD_SWEEP_NAME="resnet_sweep"
export HYPERHERD_TRIAL_ID="$SLURM_ARRAY_TASK_ID"
# Per-trial values (HYPERHERD_TRIAL_NAME + OVERRIDES) are baked into the
# generated script via a `case "$SLURM_ARRAY_TASK_ID" in ... esac` block.
HYPERHERD_TRIAL_NAME="lr-0.001_opt-adam_bs-64" # set per-task
export HYPERHERD_TRIAL_NAME
OVERRIDES="experiment_name=$HYPERHERD_TRIAL_NAME learning_rate=0.001 optimizer=adam batch_size=64 data.root=/scratch/imagenet trainer.max_epochs=90"
# Invoke the user's launcher script
bash "<workspace>/launch.sh" "$OVERRIDES"
The actual generated script bakes the per-trial values directly into a case statement at submission time, so neither python nor the hyperherd package needs to be available on the compute node — only bash.
Re-running and reconciliation¶
herd run against an existing workspace:
- Loads the existing manifest.
- Generates the new manifest from the current
hyperherd.yaml. - Diffs them.
- New trials are appended.
- Removed trials:
- If their status is in (
submitted,queued,running,completed),herd runrefuses unless you pass-f(in which case they're kept as orphans for traceability). - Otherwise they're dropped.
- If their status is in (
- Submits all
ready/failed/cancelledtrials.
This means you can edit your sweep mid-experiment (add a new parameter value, tighten a condition) without losing already-completed trials.