hemlock-lab¶

RAG Pipeline Validation Lab — validates hemlock's survival matrix against real framework implementations.

hemlock's pkg/validate/ contains Go simulations of how LangChain, LlamaIndex, Unstructured, Haystack, and ColPALI extract text from documents. This lab runs the actual Python frameworks and compares their behavior against hemlock's predictions, producing a drift report that catches when framework updates break the survival matrix.

Why This Exists¶

hemlock ships with static Go validators that simulate RAG framework extraction. Those simulations are educated guesses based on reading framework source code and documentation. But frameworks change — loaders get rewritten, default behaviors shift, new preprocessing steps appear.

Without ground truth, hemlock's survival predictions quietly go stale.

hemlock-lab solves this by running the real Python frameworks in Docker containers and comparing their extraction, retrieval, and injection behavior against hemlock's predictions. Every discrepancy is a drift — a prediction that needs updating.

Architecture¶

Six Docker containers on a hemlock-net bridge network, with Ollama running on the host:

graph TB
    subgraph docker["Docker Compose Stack"]
        direction TB
        subgraph infra["Infrastructure"]
            chromadb["chromadb :8000\nVector Store"]
        end
        subgraph pipelines["Pipeline Containers"]
            lc["langchain-rag :8100"]
            li["llamaindex-rag :8101"]
            un["unstructured-rag :8102"]
            hs["haystack-rag :8103"]
            cp["colpali-rag :8104"]
        end
    end

    ollama["Ollama :11434\n(host)"]

    lc & li & un & hs & cp --> chromadb
    lc & li & un & hs & cp --> ollama

All 5 pipeline services expose an identical API: /extract, /ingest, /query, /health.

Test Layers¶

The test harness validates hemlock's predictions across three layers:

Layer	What It Tests	Script
Extraction	Do payloads survive framework loaders?	`extraction_test.py`
Retrieval	Do poisoned docs rank in top results?	`retrieval_test.py`
Injection	Does the LLM follow injected instructions?	`injection_test.py`

A drift report compares predictions vs reality and produces actionable items grouped by framework.

Quick Start¶

# Clone
git clone https://github.com/professor-moody/hemlock-lab.git
cd hemlock-lab

# Start all services
cd docker && docker compose up -d

# Verify health
for port in 8000 8100 8101 8102 8103 8104; do
  curl -sf "http://localhost:${port}/health" | jq . || echo "Port ${port}: DOWN"
done

# Run full test suite
docker compose --profile test run --rm harness

First time?

See the Prerequisites and Installation guides for detailed setup instructions.

Key Features¶

5 real RAG frameworks — LangChain, LlamaIndex, Unstructured, Haystack, and ColPALI running actual Python code
3 test layers — Extraction survival, retrieval ranking, and end-to-end injection testing
Drift detection — Automated comparison of hemlock predictions vs real framework behavior
Docker Compose workflow — docker compose up -d → test → docker compose down -v → repeat
Environment-driven config — .env overrides for models, prompts, and sweep parameters
Ollama on host — Containers reach the LLM via host.docker.internal:11434

Documentation¶

Section	Contents
Getting Started	Prerequisites, installation, first test run
Architecture	Container topology, services, configuration
Pipelines	Uniform API, per-framework deep dives
Test Harness	3 test layers, drift reports, orchestrator
Operations	Deploy, verify, reset, teardown, troubleshooting
API Reference	REST endpoint documentation
Integration	hemlock feedback loop, aipostex-lab connectivity
Development	Contributing, adding new frameworks
Reference	Makefile targets, port map