Skip to content

RAG Pipelines

hemlock-lab runs 5 production RAG frameworks side-by-side, each with identical endpoints but different internal implementations. This design reveals how the same poisoned document can produce different outcomes depending on which framework processes it.


Why Five Frameworks?

hemlock's survival matrix predicts whether a payload survives extraction by a specific framework. Testing against one framework validates one column of the matrix. Testing against all five validates the full prediction surface.

graph LR
    doc["Poisoned<br/>Document"] --> lc["LangChain"]
    doc --> li["LlamaIndex"]
    doc --> un["Unstructured"]
    doc --> hy["Haystack"]
    doc --> cp["ColPALI"]
    lc --> s1["survive ✓"]
    li --> s2["stripped ✗"]
    un --> s3["survive ✓"]
    hy --> s4["survive ✓"]
    cp --> s5["survive ✓"]

Each framework has different parsers, different extraction logic, and different default behaviors. A CSS-hidden payload might survive LangChain (which uses BeautifulSoup and preserves hidden elements) but get stripped by Unstructured (which extracts visible text only).


Uniform API Contract

Every pipeline exposes the same 4 endpoints:

Endpoint Method Content-Type Purpose
/health GET Service health + framework version
/extract POST multipart/form-data Extract text from uploaded document
/ingest POST multipart/form-data Ingest document into ChromaDB collection
/query POST application/json Run RAG query (retrieve → prompt → LLM)

This contract means the test harness can iterate over all 5 frameworks with identical requests:

PIPELINES = [
    ("langchain", 8100),
    ("llamaindex", 8101),
    ("unstructured", 8102),
    ("haystack", 8103),
    ("colpali", 8104),
]

for name, port in PIPELINES:
    response = requests.post(
        f"http://localhost:{port}/extract",
        files={"file": open(doc_path, "rb")}
    )

Shared Infrastructure

All pipelines share:

Component Role
ChromaDB (:8000) Vector storage for ingested documents
Ollama (:11434) Embedding generation + LLM inference
nomic-embed-text Embedding model
smollm2:135m Generation (LLM) model

Each pipeline connects to these shared services — ChromaDB via the hemlock-net Docker network, Ollama via host.docker.internal — but wraps them in framework-specific abstractions.


Framework Comparison

Framework Approach HTML Parser PDF Parser DOCX Parser
LangChain Extension → Loader map BSHTMLLoader PyPDFLoader Docx2txtLoader
LlamaIndex SimpleDirectoryReader Auto-detect Auto-detect Auto-detect
Unstructured partition() Element-based Element-based Element-based
Haystack Extension → Converter map HTMLToDocument PyPDFToDocument DocxToDocument
Framework Retriever Chain Type
LangChain Chroma.as_retriever() RetrievalQA
LlamaIndex VectorStoreIndex query_engine
Unstructured Manual embed + search Direct prompt
Haystack ChromaDocumentStore Pipeline components
Framework Wrapper
LangChain OllamaEmbeddings(model="nomic-embed-text")
LlamaIndex Ollama via Settings.embed_model
Unstructured Direct /api/embed call
Haystack OllamaDocumentEmbedder

Pipeline Pages

Each framework has a dedicated page with implementation details, curl examples, and known behaviors:

Framework Version Port Page
LangChain 0.3.35 8100 LangChain
LlamaIndex 0.12.33 8101 LlamaIndex
Unstructured 0.17.2 8102 Unstructured
Haystack 2.12.1 8103 Haystack
ColPALI 8104 ColPALI

Next Steps