hemlock research reproduce¶
Walks a deposit directory (the kind hemlock research deposit build produces) for run-results JSONL files under data/<bundle>/, then re-runs the strict-canary detector against every recorded response and compares the new verdict to the verdict that was recorded at run time.
Use this to:
- Audit detector drift after a canary registry version bump
- Confirm an AE deposit still reproduces under the current binary
- Compare two registry versions on the same response corpus
Synopsis¶
Flags¶
| Flag | Type | Default | Description |
|---|---|---|---|
--deposit |
string |
(required) | Deposit directory (must contain data/<bundle>/) |
--bundle |
string |
Limit replay to bundles whose name contains this substring | |
--canary-version |
string |
v1 |
Canary registry version used for replay |
--output |
string |
Replay-record JSON output path (default: stderr summary only) | |
--max-rows |
int |
0 |
Cap rows replayed per file (0 = all) |
--json-stdout |
bool |
false |
Emit the full replay record as JSON to stdout |
Output¶
Replay record JSON includes:
deposit_root,replay_canary_versionbundles_replayed[]— list of bundle names processedtotals—files,rows,agreed,disagreed_now_positive,disagreed_now_negative,confidence_changeddisagreements[]— one entry per row whose replay verdict differs; each carrieskind(now_positive|now_negative|confidence_changed), the original verdict, the replay verdict, and a 300-char response excerptper_bundle[]— per-bundle, per-file rolluphemlock_sha,hemlock_version,generated_at— pins which binary did the replay
Disagreement kinds¶
| Kind | Meaning |
|---|---|
now_positive |
Replay fires; original missed. The new registry version detects a previously-undetected injection. |
now_negative |
Replay quiet; original fired. The new registry version no longer fires on a previously-recorded detection (typically a false-positive correction). |
confidence_changed |
Both fire (or both quiet), but the qualitative confidence tier (high / medium / low) differs. |
Examples¶
Replay every bundle in a deposit at the current registry version:
Sample stderr summary:
[hemlock reproduce] deposit: ./paper/artifact/paper-a
replay canary: v1
bundles: 1
files: 1
rows: 5
agreed: 5
now positive: 0 (replay fires; original missed)
now negative: 0 (replay quiet; original fired)
conf changed: 0 (verdict equal, confidence differs)
record: ./replay.json
Replay only a single bundle, capping per-file rows for a quick spot-check:
hemlock research reproduce \
--deposit ./paper/artifact/paper-a \
--bundle smoke-bundle \
--max-rows 50 \
--output ./replay-spot.json
Related¶
hemlock research deposit— produces the deposit treereproducewalkshemlock run— produces the JSONL files insidedata/<bundle>/hemlock defend detect— the underlying detectorreproducere-runs