Skip to content

hemlock research reproduce

Walks a deposit directory (the kind hemlock research deposit build produces) for run-results JSONL files under data/<bundle>/, then re-runs the strict-canary detector against every recorded response and compares the new verdict to the verdict that was recorded at run time.

Use this to:

  • Audit detector drift after a canary registry version bump
  • Confirm an AE deposit still reproduces under the current binary
  • Compare two registry versions on the same response corpus

Synopsis

hemlock research reproduce --deposit <root> [flags]

Flags

Flag Type Default Description
--deposit string (required) Deposit directory (must contain data/<bundle>/)
--bundle string Limit replay to bundles whose name contains this substring
--canary-version string v1 Canary registry version used for replay
--output string Replay-record JSON output path (default: stderr summary only)
--max-rows int 0 Cap rows replayed per file (0 = all)
--json-stdout bool false Emit the full replay record as JSON to stdout

Output

Replay record JSON includes:

  • deposit_root, replay_canary_version
  • bundles_replayed[] — list of bundle names processed
  • totalsfiles, rows, agreed, disagreed_now_positive, disagreed_now_negative, confidence_changed
  • disagreements[] — one entry per row whose replay verdict differs; each carries kind (now_positive | now_negative | confidence_changed), the original verdict, the replay verdict, and a 300-char response excerpt
  • per_bundle[] — per-bundle, per-file rollup
  • hemlock_sha, hemlock_version, generated_at — pins which binary did the replay

Disagreement kinds

Kind Meaning
now_positive Replay fires; original missed. The new registry version detects a previously-undetected injection.
now_negative Replay quiet; original fired. The new registry version no longer fires on a previously-recorded detection (typically a false-positive correction).
confidence_changed Both fire (or both quiet), but the qualitative confidence tier (high / medium / low) differs.

Examples

Replay every bundle in a deposit at the current registry version:

hemlock research reproduce --deposit ./paper/artifact/paper-a --output ./replay.json

Sample stderr summary:

[hemlock reproduce] deposit: ./paper/artifact/paper-a
                    replay canary: v1
                    bundles:       1
                    files:         1
                    rows:          5
                    agreed:        5
                    now positive:  0 (replay fires; original missed)
                    now negative:  0 (replay quiet; original fired)
                    conf changed:  0 (verdict equal, confidence differs)
                    record:        ./replay.json

Replay only a single bundle, capping per-file rows for a quick spot-check:

hemlock research reproduce \
  --deposit ./paper/artifact/paper-a \
  --bundle smoke-bundle \
  --max-rows 50 \
  --output ./replay-spot.json