Quick Start¶
This guide assumes you have a running stack (docker compose up -d completed with all health checks passing). If not, see Installation first.
Run the Full Test Suite¶
This runs the complete validation workflow:
graph LR
A["hemlock batch"] --> B["hemlock validate"]
B --> C["extraction_test.py"]
C --> D["drift_report.py"]
B --> E["retrieval_test.py"]
B --> F["injection_test.py"]
D --> G["reports/"]
E --> G
F --> G
What Happens¶
hemlock batch— Generates poisoned documents across all 10 formats × 36 techniqueshemlock validate --json— Produces predictions for each document/framework combination- Layer 1: Extraction — POSTs each document to each pipeline's
/extractendpoint, checks if the payload survives - Drift Report — Compares extraction results against hemlock's predictions
- Layer 2: Retrieval — Ingests documents into ChromaDB, queries for target topics, checks ranking
- Layer 3: Injection — Runs full RAG chain (retrieve → prompt → LLM), checks if injected content influences output
Reading the Output¶
Extraction Results¶
[langchain] poisoned-comment-001.html: MATCH (predicted: survive, actual: survive)
[langchain] poisoned-css-hide-001.html: DRIFT (predicted: survive, actual: stripped)
[llamaindex] poisoned-metadata-001.docx: MATCH (predicted: survive, actual: survive)
| Status | Meaning |
|---|---|
| MATCH | hemlock's prediction was correct |
| DRIFT | hemlock's prediction was wrong — needs updating |
| NO_PRED | No prediction available for this combination |
| ERROR | Framework failed to process the file |
Drift Report¶
After extraction tests, a Markdown report is generated in reports/<timestamp>/:
reports/
└── 2026-04-02T10-30-00/
├── run-config.json
├── extraction-results.json
├── drift.md
├── retrieval-results.json
└── injection-results.json
The drift report contains:
- Summary — Total tests, matches, drifts, errors
- Drifted Predictions — Table of every combination where hemlock was wrong
- Action Items — Grouped by framework, specific changes needed in hemlock's validators
What to do with drifts
Each drift is a bug in hemlock's survival matrix. The drift report guide explains how to trace a drift back to the specific hemlock validator code that needs updating.
Run Individual Layers¶
# Layer 1 only — extraction survival
docker compose --profile test run --rm harness bash -c "python extraction_test.py"
# Layer 2 only — retrieval ranking
docker compose --profile test run --rm harness bash -c "python retrieval_test.py"
# Layer 3 only — end-to-end injection
docker compose --profile test run --rm harness bash -c "python injection_test.py"
Reset and Repeat¶
After a test run modifies ChromaDB state (ingested collections), reset to a clean state:
This destroys the chromadb-data volume and starts fresh. Then run tests again:
Example Workflow¶
A typical session for updating hemlock's survival matrix:
# 1. Start stack
cd docker && docker compose up -d
# 2. Run tests
docker compose --profile test run --rm harness
# 3. Read the drift report
cat reports/*/drift.md
# 4. Fix hemlock validators based on drifts
cd ~/projects/hemlock
# ... edit pkg/validate/*.go ...
go test ./...
# 5. Reset and retest
cd ~/projects/hemlock-lab/docker
docker compose down -v && docker compose up -d
docker compose --profile test run --rm harness
Next Steps¶
- Test Harness — Deep dive on each test layer
- Drift Report — How to interpret and act on drift reports
- Operations — Service management and troubleshooting