Test Harness¶
The test harness is hemlock-lab's core — a 3-layer validation system that compares hemlock's predictions against real-world RAG framework behavior.
Three Layers of Validation¶
graph TB
subgraph layer1["Layer 1: Extraction"]
E1["POST /extract"] --> E2["Did the payload<br/>survive parsing?"]
end
subgraph layer2["Layer 2: Retrieval"]
R1["POST /ingest"] --> R2["POST /query"]
R2 --> R3["Was the poisoned doc<br/>retrieved for a target query?"]
end
subgraph layer3["Layer 3: Injection"]
I1["POST /query<br/>(with poisoned context)"] --> I2["Did the LLM output<br/>contain injected content?"]
end
layer1 --> layer2
layer2 --> layer3
Each layer builds on the previous:
| Layer | Question | Input | Output |
|---|---|---|---|
| Extraction | Does the payload survive parsing? | Poisoned document | Extracted text |
| Retrieval | Is the poisoned doc retrieved? | Ingested collection + query | Ranked results |
| Injection | Does the payload influence the LLM? | Full RAG chain + query | LLM response |
Orchestration¶
harness/run_all.sh is the core single-run orchestrator. It generates a hemlock corpus, runs validate predictions, then tests against all framework pipelines:
# Basic run (all 3 layers, override payload)
bash harness/run_all.sh
# Single layer
bash harness/run_all.sh --layer extraction
# With adaptation flags
bash harness/run_all.sh \
--payload authority \
--authority-style academic \
--target-model qwen \
--target-framework langchain
run_all.sh flags¶
| Flag | Default | Description |
|---|---|---|
--layer |
all |
Layer to run: all, extraction, retrieval, injection |
--host |
localhost |
RAG service hostname |
--query |
What is the refund policy? |
Target retrieval query |
--payload |
override |
Payload category |
--topic |
general knowledge base |
Cover text topic |
--cover-text-file |
Path to explicit cover text file | |
--target-model |
Target LLM for adaptive wrapping: qwen, llama3, mistral, etc. |
|
--target-framework |
Target RAG framework: langchain, llamaindex, haystack, generic |
|
--adaptation-order |
Layer ordering: model-first, framework-first |
|
--authority-style |
Authority-mimicry wrapper: academic, institutional, regulatory |
|
--jailbreak |
Jailbreak wrapper style | |
--dialogue-turns |
Dialogue injection setup turns | |
--guardrail-bypass |
Guardrail evasion technique | |
--system-prompt |
Custom system prompt name (maps to /app/system-prompts/<name>.txt) |
|
--optimize |
Optimization strategy: cem, genetic, whitebox |
|
--cluster-size |
Generate cross-referencing document cluster of this size | |
--reuse-corpus |
Path to existing corpus (skip generation) | |
--save-corpus |
Path to save generated corpus for reuse | |
--hybrid-retrieval |
false |
Enable BM25 + dense hybrid retrieval |
--adaptive |
false |
Enable adaptive feedback loop mode |
--injection-weight |
Joint optimization injection weight [0.0–1.0]. Requires reward server. | |
--injection-model-host |
http://localhost:9090 |
Reward model server URL |
--cover-text-density |
Fraction of cover text to retain [0.3–1.0] | |
--payload-position |
Payload placement: start or end |
Sweep scripts
For multi-payload and multi-model runs, see Sweep Scripts.
Run Commands¶
| Command | Layers | Purpose |
|---|---|---|
make test |
All 3 | Full validation suite |
make test-extract |
Layer 1 | Extraction survival only |
make test-retrieval |
Layer 2 | Retrieval ranking only |
make test-injection |
Layer 3 | End-to-end injection only |
Report Structure¶
Each test run produces a timestamped report directory:
reports/
└── 2026-04-02T10-30-00/
├── run-config.json # effective runtime configuration
├── hemlock-predictions.json # hemlock's predictions for all combinations
├── extraction-results.json # Layer 1 results
├── drift.md # Human-readable drift analysis
├── retrieval-results.json # Layer 2 results
└── injection-results.json # Layer 3 results
Layer Details¶
Each layer has a dedicated page:
| Layer | Page | Key Output |
|---|---|---|
| Extraction | Extraction | Match/Drift/Error per document×framework |
| Retrieval | Retrieval | Ranking position of poisoned docs |
| Injection | Injection | Keyword detection in LLM output |
| Drift Report | Drift Report | Actionable list of prediction errors |
Snapshot Workflow¶
The test harness modifies ChromaDB state (creates collections, ingests documents). To reset:
Always restore between test runs
Leftover collections from previous runs can interfere with retrieval and injection tests. make restore rolls back to the lab-ready snapshot with only seed data.
Next Steps¶
- Extraction Tests — Layer 1 deep dive
- Drift Report — How to interpret and act on drifts
- Sweep Scripts — Multi-payload and multi-model orchestration
- Quick Start — First test run walkthrough
Optimization & Analysis¶
- Bayesian Optimizer — GP hyperparameter search over 10 dimensions
- Reward Model — Training data pipeline, MLP training, and HTTP serving
- Pareto Sweep — Injection-weight ablation for trade-off analysis
- Validation Experiments — Controlled A/B experiment orchestration
- Statistical Analysis — Bootstrap CIs, effect sizes, significance tests
- Figure Generation — Publication-ready plots