Skip to content

Test Harness

The test harness is hemlock-lab's core — a 3-layer validation system that compares hemlock's predictions against real-world RAG framework behavior.


Three Layers of Validation

graph TB
    subgraph layer1["Layer 1: Extraction"]
        E1["POST /extract"] --> E2["Did the payload<br/>survive parsing?"]
    end

    subgraph layer2["Layer 2: Retrieval"]
        R1["POST /ingest"] --> R2["POST /query"]
        R2 --> R3["Was the poisoned doc<br/>retrieved for a target query?"]
    end

    subgraph layer3["Layer 3: Injection"]
        I1["POST /query<br/>(with poisoned context)"] --> I2["Did the LLM output<br/>contain injected content?"]
    end

    layer1 --> layer2
    layer2 --> layer3

Each layer builds on the previous:

Layer Question Input Output
Extraction Does the payload survive parsing? Poisoned document Extracted text
Retrieval Is the poisoned doc retrieved? Ingested collection + query Ranked results
Injection Does the payload influence the LLM? Full RAG chain + query LLM response

Orchestration

harness/run_all.sh is the core single-run orchestrator. It generates a hemlock corpus, runs validate predictions, then tests against all framework pipelines:

# Basic run (all 3 layers, override payload)
bash harness/run_all.sh

# Single layer
bash harness/run_all.sh --layer extraction

# With adaptation flags
bash harness/run_all.sh \
  --payload authority \
  --authority-style academic \
  --target-model qwen \
  --target-framework langchain

run_all.sh flags

Flag Default Description
--layer all Layer to run: all, extraction, retrieval, injection
--host localhost RAG service hostname
--query What is the refund policy? Target retrieval query
--payload override Payload category
--topic general knowledge base Cover text topic
--cover-text-file Path to explicit cover text file
--target-model Target LLM for adaptive wrapping: qwen, llama3, mistral, etc.
--target-framework Target RAG framework: langchain, llamaindex, haystack, generic
--adaptation-order Layer ordering: model-first, framework-first
--authority-style Authority-mimicry wrapper: academic, institutional, regulatory
--jailbreak Jailbreak wrapper style
--dialogue-turns Dialogue injection setup turns
--guardrail-bypass Guardrail evasion technique
--system-prompt Custom system prompt name (maps to /app/system-prompts/<name>.txt)
--optimize Optimization strategy: cem, genetic, whitebox
--cluster-size Generate cross-referencing document cluster of this size
--reuse-corpus Path to existing corpus (skip generation)
--save-corpus Path to save generated corpus for reuse
--hybrid-retrieval false Enable BM25 + dense hybrid retrieval
--adaptive false Enable adaptive feedback loop mode
--injection-weight Joint optimization injection weight [0.0–1.0]. Requires reward server.
--injection-model-host http://localhost:9090 Reward model server URL
--cover-text-density Fraction of cover text to retain [0.3–1.0]
--payload-position Payload placement: start or end

Sweep scripts

For multi-payload and multi-model runs, see Sweep Scripts.


Run Commands

Command Layers Purpose
make test All 3 Full validation suite
make test-extract Layer 1 Extraction survival only
make test-retrieval Layer 2 Retrieval ranking only
make test-injection Layer 3 End-to-end injection only

Report Structure

Each test run produces a timestamped report directory:

reports/
└── 2026-04-02T10-30-00/
    ├── run-config.json            # effective runtime configuration
    ├── hemlock-predictions.json   # hemlock's predictions for all combinations
    ├── extraction-results.json    # Layer 1 results
    ├── drift.md                   # Human-readable drift analysis
    ├── retrieval-results.json     # Layer 2 results
    └── injection-results.json     # Layer 3 results

Layer Details

Each layer has a dedicated page:

Layer Page Key Output
Extraction Extraction Match/Drift/Error per document×framework
Retrieval Retrieval Ranking position of poisoned docs
Injection Injection Keyword detection in LLM output
Drift Report Drift Report Actionable list of prediction errors

Snapshot Workflow

The test harness modifies ChromaDB state (creates collections, ingests documents). To reset:

# Restore to clean state
make restore

# Run again
make test

Always restore between test runs

Leftover collections from previous runs can interfere with retrieval and injection tests. make restore rolls back to the lab-ready snapshot with only seed data.


Next Steps

Optimization & Analysis