Test Harness¶

The test harness is hemlock-lab's core — a 3-layer validation system that compares hemlock's predictions against real-world RAG framework behavior.

Three Layers of Validation¶

graph TB
    subgraph layer1["Layer 1: Extraction"]
        E1["POST /extract"] --> E2["Did the payload<br/>survive parsing?"]
    end

    subgraph layer2["Layer 2: Retrieval"]
        R1["POST /ingest"] --> R2["POST /query"]
        R2 --> R3["Was the poisoned doc<br/>retrieved for a target query?"]
    end

    subgraph layer3["Layer 3: Injection"]
        I1["POST /query<br/>(with poisoned context)"] --> I2["Did the LLM output<br/>contain injected content?"]
    end

    layer1 --> layer2
    layer2 --> layer3

Each layer builds on the previous:

Layer	Question	Input	Output
Extraction	Does the payload survive parsing?	Poisoned document	Extracted text
Retrieval	Is the poisoned doc retrieved?	Ingested collection + query	Ranked results
Injection	Does the payload influence the LLM?	Full RAG chain + query	LLM response

Orchestration¶

harness/run_all.sh is the core single-run orchestrator. It generates a hemlock corpus, runs validate predictions, then tests against all framework pipelines:

# Basic run (all 3 layers, override payload)
bash harness/run_all.sh

# Single layer
bash harness/run_all.sh --layer extraction

# With adaptation flags
bash harness/run_all.sh \
  --payload authority \
  --authority-style academic \
  --target-model qwen \
  --target-framework langchain

run_all.sh flags¶

Flag	Default	Description
`--layer`	`all`	Layer to run: `all`, `extraction`, `retrieval`, `injection`
`--host`	`localhost`	RAG service hostname
`--query`	`What is the refund policy?`	Target retrieval query
`--payload`	`override`	Payload category
`--topic`	`general knowledge base`	Cover text topic
`--cover-text-file`		Path to explicit cover text file
`--target-model`		Target LLM for adaptive wrapping: `qwen`, `llama3`, `mistral`, etc.
`--target-framework`		Target RAG framework: `langchain`, `llamaindex`, `haystack`, `generic`
`--adaptation-order`		Layer ordering: `model-first`, `framework-first`
`--authority-style`		Authority-mimicry wrapper: `academic`, `institutional`, `regulatory`
`--jailbreak`		Jailbreak wrapper style
`--dialogue-turns`		Dialogue injection setup turns
`--guardrail-bypass`		Guardrail evasion technique
`--system-prompt`		Custom system prompt name (maps to `/app/system-prompts/<name>.txt`)
`--optimize`		Optimization strategy: `cem`, `genetic`, `whitebox`
`--cluster-size`		Generate cross-referencing document cluster of this size
`--reuse-corpus`		Path to existing corpus (skip generation)
`--save-corpus`		Path to save generated corpus for reuse
`--hybrid-retrieval`	`false`	Enable BM25 + dense hybrid retrieval
`--adaptive`	`false`	Enable adaptive feedback loop mode
`--injection-weight`		Joint optimization injection weight [0.0–1.0]. Requires reward server.
`--injection-model-host`	`http://localhost:9090`	Reward model server URL
`--cover-text-density`		Fraction of cover text to retain [0.3–1.0]
`--payload-position`		Payload placement: `start` or `end`

Sweep scripts

For multi-payload and multi-model runs, see Sweep Scripts.

Run Commands¶

Command	Layers	Purpose
`make test`	All 3	Full validation suite
`make test-extract`	Layer 1	Extraction survival only
`make test-retrieval`	Layer 2	Retrieval ranking only
`make test-injection`	Layer 3	End-to-end injection only

Report Structure¶

Each test run produces a timestamped report directory:

reports/
└── 2026-04-02T10-30-00/
    ├── run-config.json            # effective runtime configuration
    ├── hemlock-predictions.json   # hemlock's predictions for all combinations
    ├── extraction-results.json    # Layer 1 results
    ├── drift.md                   # Human-readable drift analysis
    ├── retrieval-results.json     # Layer 2 results
    └── injection-results.json     # Layer 3 results

Layer Details¶

Each layer has a dedicated page:

Layer	Page	Key Output
Extraction	Extraction	Match/Drift/Error per document×framework
Retrieval	Retrieval	Ranking position of poisoned docs
Injection	Injection	Keyword detection in LLM output
Drift Report	Drift Report	Actionable list of prediction errors

Snapshot Workflow¶

The test harness modifies ChromaDB state (creates collections, ingests documents). To reset:

# Restore to clean state
make restore

# Run again
make test

Always restore between test runs

Leftover collections from previous runs can interfere with retrieval and injection tests. make restore rolls back to the lab-ready snapshot with only seed data.

Next Steps¶

Extraction Tests — Layer 1 deep dive
Drift Report — How to interpret and act on drifts
Sweep Scripts — Multi-payload and multi-model orchestration
Quick Start — First test run walkthrough

Optimization & Analysis¶

Bayesian Optimizer — GP hyperparameter search over 10 dimensions
Reward Model — Training data pipeline, MLP training, and HTTP serving
Pareto Sweep — Injection-weight ablation for trade-off analysis
Validation Experiments — Controlled A/B experiment orchestration
Statistical Analysis — Bootstrap CIs, effect sizes, significance tests
Figure Generation — Publication-ready plots