hemlock-lab¶
RAG Pipeline Validation Lab — validates hemlock's survival matrix against real framework implementations.
hemlock's pkg/validate/ contains Go simulations of how LangChain, LlamaIndex, Unstructured, Haystack, and ColPALI extract text from documents. This lab runs the actual Python frameworks and compares their behavior against hemlock's predictions, producing a drift report that catches when framework updates break the survival matrix.
Why This Exists¶
hemlock ships with static Go validators that simulate RAG framework extraction. Those simulations are educated guesses based on reading framework source code and documentation. But frameworks change — loaders get rewritten, default behaviors shift, new preprocessing steps appear.
Without ground truth, hemlock's survival predictions quietly go stale.
hemlock-lab solves this by running the real Python frameworks in Docker containers and comparing their extraction, retrieval, and injection behavior against hemlock's predictions. Every discrepancy is a drift — a prediction that needs updating.
Architecture¶
Six Docker containers on a hemlock-net bridge network, with Ollama running on the host:
graph TB
subgraph docker["Docker Compose Stack"]
direction TB
subgraph infra["Infrastructure"]
chromadb["chromadb :8000\nVector Store"]
end
subgraph pipelines["Pipeline Containers"]
lc["langchain-rag :8100"]
li["llamaindex-rag :8101"]
un["unstructured-rag :8102"]
hs["haystack-rag :8103"]
cp["colpali-rag :8104"]
end
end
ollama["Ollama :11434\n(host)"]
lc & li & un & hs & cp --> chromadb
lc & li & un & hs & cp --> ollama
All 5 pipeline services expose an identical API: /extract, /ingest, /query, /health.
Test Layers¶
The test harness validates hemlock's predictions across three layers:
| Layer | What It Tests | Script |
|---|---|---|
| Extraction | Do payloads survive framework loaders? | extraction_test.py |
| Retrieval | Do poisoned docs rank in top results? | retrieval_test.py |
| Injection | Does the LLM follow injected instructions? | injection_test.py |
A drift report compares predictions vs reality and produces actionable items grouped by framework.
Quick Start¶
# Clone
git clone https://github.com/professor-moody/hemlock-lab.git
cd hemlock-lab
# Start all services
cd docker && docker compose up -d
# Verify health
for port in 8000 8100 8101 8102 8103 8104; do
curl -sf "http://localhost:${port}/health" | jq . || echo "Port ${port}: DOWN"
done
# Run full test suite
docker compose --profile test run --rm harness
First time?
See the Prerequisites and Installation guides for detailed setup instructions.
Key Features¶
- 5 real RAG frameworks — LangChain, LlamaIndex, Unstructured, Haystack, and ColPALI running actual Python code
- 3 test layers — Extraction survival, retrieval ranking, and end-to-end injection testing
- Drift detection — Automated comparison of hemlock predictions vs real framework behavior
- Docker Compose workflow —
docker compose up -d→ test →docker compose down -v→ repeat - Environment-driven config —
.envoverrides for models, prompts, and sweep parameters - Ollama on host — Containers reach the LLM via
host.docker.internal:11434
Documentation¶
| Section | Contents |
|---|---|
| Getting Started | Prerequisites, installation, first test run |
| Architecture | Container topology, services, configuration |
| Pipelines | Uniform API, per-framework deep dives |
| Test Harness | 3 test layers, drift reports, orchestrator |
| Operations | Deploy, verify, reset, teardown, troubleshooting |
| API Reference | REST endpoint documentation |
| Integration | hemlock feedback loop, aipostex-lab connectivity |
| Development | Contributing, adding new frameworks |
| Reference | Makefile targets, port map |