Skip to content

hemlock-lab

RAG Pipeline Validation Lab — validates hemlock's survival matrix against real framework implementations.

hemlock's pkg/validate/ contains Go simulations of how LangChain, LlamaIndex, Unstructured, Haystack, and ColPALI extract text from documents. This lab runs the actual Python frameworks and compares their behavior against hemlock's predictions, producing a drift report that catches when framework updates break the survival matrix.


Why This Exists

hemlock ships with static Go validators that simulate RAG framework extraction. Those simulations are educated guesses based on reading framework source code and documentation. But frameworks change — loaders get rewritten, default behaviors shift, new preprocessing steps appear.

Without ground truth, hemlock's survival predictions quietly go stale.

hemlock-lab solves this by running the real Python frameworks in Docker containers and comparing their extraction, retrieval, and injection behavior against hemlock's predictions. Every discrepancy is a drift — a prediction that needs updating.


Architecture

Six Docker containers on a hemlock-net bridge network, with Ollama running on the host:

graph TB
    subgraph docker["Docker Compose Stack"]
        direction TB
        subgraph infra["Infrastructure"]
            chromadb["chromadb :8000\nVector Store"]
        end
        subgraph pipelines["Pipeline Containers"]
            lc["langchain-rag :8100"]
            li["llamaindex-rag :8101"]
            un["unstructured-rag :8102"]
            hs["haystack-rag :8103"]
            cp["colpali-rag :8104"]
        end
    end

    ollama["Ollama :11434\n(host)"]

    lc & li & un & hs & cp --> chromadb
    lc & li & un & hs & cp --> ollama

All 5 pipeline services expose an identical API: /extract, /ingest, /query, /health.


Test Layers

The test harness validates hemlock's predictions across three layers:

Layer What It Tests Script
Extraction Do payloads survive framework loaders? extraction_test.py
Retrieval Do poisoned docs rank in top results? retrieval_test.py
Injection Does the LLM follow injected instructions? injection_test.py

A drift report compares predictions vs reality and produces actionable items grouped by framework.


Quick Start

# Clone
git clone https://github.com/professor-moody/hemlock-lab.git
cd hemlock-lab

# Start all services
cd docker && docker compose up -d

# Verify health
for port in 8000 8100 8101 8102 8103 8104; do
  curl -sf "http://localhost:${port}/health" | jq . || echo "Port ${port}: DOWN"
done

# Run full test suite
docker compose --profile test run --rm harness

First time?

See the Prerequisites and Installation guides for detailed setup instructions.


Key Features

  • 5 real RAG frameworks — LangChain, LlamaIndex, Unstructured, Haystack, and ColPALI running actual Python code
  • 3 test layers — Extraction survival, retrieval ranking, and end-to-end injection testing
  • Drift detection — Automated comparison of hemlock predictions vs real framework behavior
  • Docker Compose workflowdocker compose up -d → test → docker compose down -v → repeat
  • Environment-driven config.env overrides for models, prompts, and sweep parameters
  • Ollama on host — Containers reach the LLM via host.docker.internal:11434

Documentation

Section Contents
Getting Started Prerequisites, installation, first test run
Architecture Container topology, services, configuration
Pipelines Uniform API, per-framework deep dives
Test Harness 3 test layers, drift reports, orchestrator
Operations Deploy, verify, reset, teardown, troubleshooting
API Reference REST endpoint documentation
Integration hemlock feedback loop, aipostex-lab connectivity
Development Contributing, adding new frameworks
Reference Makefile targets, port map