RAG Pipelines¶
hemlock-lab runs 5 production RAG frameworks side-by-side, each with identical endpoints but different internal implementations. This design reveals how the same poisoned document can produce different outcomes depending on which framework processes it.
Why Five Frameworks?¶
hemlock's survival matrix predicts whether a payload survives extraction by a specific framework. Testing against one framework validates one column of the matrix. Testing against all five validates the full prediction surface.
graph LR
doc["Poisoned<br/>Document"] --> lc["LangChain"]
doc --> li["LlamaIndex"]
doc --> un["Unstructured"]
doc --> hy["Haystack"]
doc --> cp["ColPALI"]
lc --> s1["survive ✓"]
li --> s2["stripped ✗"]
un --> s3["survive ✓"]
hy --> s4["survive ✓"]
cp --> s5["survive ✓"]
Each framework has different parsers, different extraction logic, and different default behaviors. A CSS-hidden payload might survive LangChain (which uses BeautifulSoup and preserves hidden elements) but get stripped by Unstructured (which extracts visible text only).
Uniform API Contract¶
Every pipeline exposes the same 4 endpoints:
| Endpoint | Method | Content-Type | Purpose |
|---|---|---|---|
/health |
GET | — | Service health + framework version |
/extract |
POST | multipart/form-data |
Extract text from uploaded document |
/ingest |
POST | multipart/form-data |
Ingest document into ChromaDB collection |
/query |
POST | application/json |
Run RAG query (retrieve → prompt → LLM) |
This contract means the test harness can iterate over all 5 frameworks with identical requests:
PIPELINES = [
("langchain", 8100),
("llamaindex", 8101),
("unstructured", 8102),
("haystack", 8103),
("colpali", 8104),
]
for name, port in PIPELINES:
response = requests.post(
f"http://localhost:{port}/extract",
files={"file": open(doc_path, "rb")}
)
Shared Infrastructure¶
All pipelines share:
| Component | Role |
|---|---|
| ChromaDB (:8000) | Vector storage for ingested documents |
| Ollama (:11434) | Embedding generation + LLM inference |
| nomic-embed-text | Embedding model |
| smollm2:135m | Generation (LLM) model |
Each pipeline connects to these shared services — ChromaDB via the hemlock-net Docker network, Ollama via host.docker.internal — but wraps them in framework-specific abstractions.
Framework Comparison¶
| Framework | Approach | HTML Parser | PDF Parser | DOCX Parser |
|---|---|---|---|---|
| LangChain | Extension → Loader map | BSHTMLLoader |
PyPDFLoader |
Docx2txtLoader |
| LlamaIndex | SimpleDirectoryReader |
Auto-detect | Auto-detect | Auto-detect |
| Unstructured | partition() |
Element-based | Element-based | Element-based |
| Haystack | Extension → Converter map | HTMLToDocument |
PyPDFToDocument |
DocxToDocument |
| Framework | Retriever | Chain Type |
|---|---|---|
| LangChain | Chroma.as_retriever() |
RetrievalQA |
| LlamaIndex | VectorStoreIndex |
query_engine |
| Unstructured | Manual embed + search | Direct prompt |
| Haystack | ChromaDocumentStore |
Pipeline components |
| Framework | Wrapper |
|---|---|
| LangChain | OllamaEmbeddings(model="nomic-embed-text") |
| LlamaIndex | Ollama via Settings.embed_model |
| Unstructured | Direct /api/embed call |
| Haystack | OllamaDocumentEmbedder |
Pipeline Pages¶
Each framework has a dedicated page with implementation details, curl examples, and known behaviors:
| Framework | Version | Port | Page |
|---|---|---|---|
| LangChain | 0.3.35 | 8100 | LangChain |
| LlamaIndex | 0.12.33 | 8101 | LlamaIndex |
| Unstructured | 0.17.2 | 8102 | Unstructured |
| Haystack | 2.12.1 | 8103 | Haystack |
| ColPALI | — | 8104 | ColPALI |
Next Steps¶
- Pick a framework to dive into: LangChain, LlamaIndex, Unstructured, Haystack, ColPALI
- API Reference — Full endpoint documentation
- Test Harness — How pipelines are tested