Skip to content

Quick Start

This guide assumes you have a running stack (docker compose up -d completed with all health checks passing). If not, see Installation first.


Run the Full Test Suite

cd docker
docker compose --profile test run --rm harness

This runs the complete validation workflow:

graph LR
    A["hemlock batch"] --> B["hemlock validate"]
    B --> C["extraction_test.py"]
    C --> D["drift_report.py"]
    B --> E["retrieval_test.py"]
    B --> F["injection_test.py"]
    D --> G["reports/"]
    E --> G
    F --> G

What Happens

  1. hemlock batch — Generates poisoned documents across all 10 formats × 36 techniques
  2. hemlock validate --json — Produces predictions for each document/framework combination
  3. Layer 1: Extraction — POSTs each document to each pipeline's /extract endpoint, checks if the payload survives
  4. Drift Report — Compares extraction results against hemlock's predictions
  5. Layer 2: Retrieval — Ingests documents into ChromaDB, queries for target topics, checks ranking
  6. Layer 3: Injection — Runs full RAG chain (retrieve → prompt → LLM), checks if injected content influences output

Reading the Output

Extraction Results

[langchain] poisoned-comment-001.html: MATCH (predicted: survive, actual: survive)
[langchain] poisoned-css-hide-001.html: DRIFT (predicted: survive, actual: stripped)
[llamaindex] poisoned-metadata-001.docx: MATCH (predicted: survive, actual: survive)
Status Meaning
MATCH hemlock's prediction was correct
DRIFT hemlock's prediction was wrong — needs updating
NO_PRED No prediction available for this combination
ERROR Framework failed to process the file

Drift Report

After extraction tests, a Markdown report is generated in reports/<timestamp>/:

reports/
└── 2026-04-02T10-30-00/
    ├── run-config.json
    ├── extraction-results.json
    ├── drift.md
    ├── retrieval-results.json
    └── injection-results.json

The drift report contains:

  • Summary — Total tests, matches, drifts, errors
  • Drifted Predictions — Table of every combination where hemlock was wrong
  • Action Items — Grouped by framework, specific changes needed in hemlock's validators

What to do with drifts

Each drift is a bug in hemlock's survival matrix. The drift report guide explains how to trace a drift back to the specific hemlock validator code that needs updating.


Run Individual Layers

# Layer 1 only — extraction survival
docker compose --profile test run --rm harness bash -c "python extraction_test.py"

# Layer 2 only — retrieval ranking
docker compose --profile test run --rm harness bash -c "python retrieval_test.py"

# Layer 3 only — end-to-end injection
docker compose --profile test run --rm harness bash -c "python injection_test.py"

Reset and Repeat

After a test run modifies ChromaDB state (ingested collections), reset to a clean state:

# Remove ChromaDB data and restart
docker compose down -v
docker compose up -d

This destroys the chromadb-data volume and starts fresh. Then run tests again:

docker compose --profile test run --rm harness

Example Workflow

A typical session for updating hemlock's survival matrix:

# 1. Start stack
cd docker && docker compose up -d

# 2. Run tests
docker compose --profile test run --rm harness

# 3. Read the drift report
    cat reports/*/drift.md

# 4. Fix hemlock validators based on drifts
cd ~/projects/hemlock
# ... edit pkg/validate/*.go ...
go test ./...

# 5. Reset and retest
cd ~/projects/hemlock-lab/docker
docker compose down -v && docker compose up -d
docker compose --profile test run --rm harness

Next Steps