RAG Pipelines¶

hemlock-lab runs 5 production RAG frameworks side-by-side, each with identical endpoints but different internal implementations. This design reveals how the same poisoned document can produce different outcomes depending on which framework processes it.

Why Five Frameworks?¶

hemlock's survival matrix predicts whether a payload survives extraction by a specific framework. Testing against one framework validates one column of the matrix. Testing against all five validates the full prediction surface.

graph LR
    doc["Poisoned<br/>Document"] --> lc["LangChain"]
    doc --> li["LlamaIndex"]
    doc --> un["Unstructured"]
    doc --> hy["Haystack"]
    doc --> cp["ColPALI"]
    lc --> s1["survive ✓"]
    li --> s2["stripped ✗"]
    un --> s3["survive ✓"]
    hy --> s4["survive ✓"]
    cp --> s5["survive ✓"]

Each framework has different parsers, different extraction logic, and different default behaviors. A CSS-hidden payload might survive LangChain (which uses BeautifulSoup and preserves hidden elements) but get stripped by Unstructured (which extracts visible text only).

Uniform API Contract¶

Every pipeline exposes the same 4 endpoints:

Endpoint	Method	Content-Type	Purpose
`/health`	GET	—	Service health + framework version
`/extract`	POST	`multipart/form-data`	Extract text from uploaded document
`/ingest`	POST	`multipart/form-data`	Ingest document into ChromaDB collection
`/query`	POST	`application/json`	Run RAG query (retrieve → prompt → LLM)

This contract means the test harness can iterate over all 5 frameworks with identical requests:

PIPELINES = [
    ("langchain", 8100),
    ("llamaindex", 8101),
    ("unstructured", 8102),
    ("haystack", 8103),
    ("colpali", 8104),
]

for name, port in PIPELINES:
    response = requests.post(
        f"http://localhost:{port}/extract",
        files={"file": open(doc_path, "rb")}
    )

Shared Infrastructure¶

All pipelines share:

Component	Role
ChromaDB (:8000)	Vector storage for ingested documents
Ollama (:11434)	Embedding generation + LLM inference
nomic-embed-text	Embedding model
smollm2:135m	Generation (LLM) model

Each pipeline connects to these shared services — ChromaDB via the hemlock-net Docker network, Ollama via host.docker.internal — but wraps them in framework-specific abstractions.

Framework Comparison¶

ExtractionRAG ChainEmbedding

Framework	Approach	HTML Parser	PDF Parser	DOCX Parser
LangChain	Extension → Loader map	`BSHTMLLoader`	`PyPDFLoader`	`Docx2txtLoader`
LlamaIndex	`SimpleDirectoryReader`	Auto-detect	Auto-detect	Auto-detect
Unstructured	`partition()`	Element-based	Element-based	Element-based
Haystack	Extension → Converter map	`HTMLToDocument`	`PyPDFToDocument`	`DocxToDocument`

Framework	Retriever	Chain Type
LangChain	`Chroma.as_retriever()`	`RetrievalQA`
LlamaIndex	`VectorStoreIndex`	`query_engine`
Unstructured	Manual embed + search	Direct prompt
Haystack	`ChromaDocumentStore`	Pipeline components

Framework	Wrapper
LangChain	`OllamaEmbeddings(model="nomic-embed-text")`
LlamaIndex	Ollama via `Settings.embed_model`
Unstructured	Direct `/api/embed` call
Haystack	`OllamaDocumentEmbedder`

Pipeline Pages¶

Each framework has a dedicated page with implementation details, curl examples, and known behaviors:

Framework	Version	Port	Page
LangChain	0.3.35	8100	LangChain
LlamaIndex	0.12.33	8101	LlamaIndex
Unstructured	0.17.2	8102	Unstructured
Haystack	2.12.1	8103	Haystack
ColPALI	—	8104	ColPALI

Next Steps¶

Pick a framework to dive into: LangChain, LlamaIndex, Unstructured, Haystack, ColPALI
API Reference — Full endpoint documentation
Test Harness — How pipelines are tested