Haystack Pipeline¶

Property	Value
Framework	Haystack
Version	2.12.1
Port	8103
Container	`haystack-rag`
Image	`docker/haystack-rag`

Extraction¶

Haystack uses an extension-to-converter mapping, similar to LangChain's approach:

CONVERTERS = {
    ".html": HTMLToDocument,
    ".pdf":  PDFMinerToDocument,
    ".docx": DOCXToDocument,
    ".md":   MarkdownToDocument,
    ".txt":  TextFileToDocument,
}

How It Works¶

Uploaded file is saved to a temp path
Extension is matched to a converter class
Converter produces Document objects
Text is extracted from doc.content

Implications for hemlock¶

Haystack 2.x uses a component-based architecture where converters are explicit pipeline components:

HTMLToDocument — Parses HTML and extracts text. Behavior depends on the underlying parser (typically html2text or similar)
PDFMinerToDocument — pdfminer-based text layer extraction
DOCXToDocument — Reads document body paragraphs
MarkdownToDocument — Converts Markdown to plain text
TextFileToDocument — Direct file read

RAG Chain¶

Haystack 2.x uses a Pipeline with connected components:

graph LR
    Q["Query"] --> E["OllamaDocumentEmbedder"]
    E --> R["ChromaDocumentStore<br/>Retriever"]
    R --> D["Top-k<br/>Documents"]
    D --> G["Ollama<br/>Generator"]
    G --> A["Answer"]

from haystack import Pipeline
from haystack_integrations.document_stores.chroma import ChromaDocumentStore

document_store = ChromaDocumentStore(
    collection_name=collection,
    host="localhost",
    port=8000
)
retriever = ChromaDocumentStore.as_retriever(document_store)
generator = OllamaGenerator(model="smollm2:135m")

pipeline = Pipeline()
pipeline.add_component("retriever", retriever)
pipeline.add_component("generator", generator)
pipeline.connect("retriever", "generator")

result = pipeline.run({"retriever": {"query": query_text}})

Curl Examples¶

Health Check¶

curl http://localhost:8103/health

{"framework": "haystack", "status": "ok", "version": "2.12.1"}

Extract Text¶

curl -X POST http://localhost:8103/extract \
  -F "file=@poisoned-document.docx"

{"text": "...", "metadata": {"source": "poisoned-document.docx"}}

Ingest Document¶

curl -X POST http://localhost:8103/ingest \
  -F "file=@document.docx" \
  -F "collection=test-collection"

{"status": "ingested", "collection": "test-collection", "chunks": 4}

Query¶

curl -X POST http://localhost:8103/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is the refund policy?", "collection": "test-collection"}'

{"answer": "...", "sources": ["refund-policy.html"]}

Hybrid Query (BM25 + Dense)¶

curl -X POST http://localhost:8103/query-hybrid \
  -F "q=What is the refund policy?" \
  -F "collection=test-collection" \
  -F "k=3"

{"answer": "...", "sources": [...], "retrieval_mode": "hybrid-bm25-dense"}

The /query-hybrid endpoint combines dense vector retrieval (ChromaDB) with a BM25 keyword scorer using reciprocal rank fusion (RRF, k=60). Both retrievers fetch 2×k candidates, scores are fused, and the top-k results feed the LLM prompt. This provides resilience against adversarial documents that are optimized for embedding similarity but lack keyword overlap with the target query.

Haystack-only

/query-hybrid is currently implemented only on the Haystack pipeline. Other pipelines use dense-only retrieval. When --hybrid-retrieval is passed to run_all.sh, the harness automatically sends /query-hybrid only to Haystack and keeps dense-only /query for the other frameworks.

Known Behaviors¶

Technique	Format	Survives?	Notes
CSS hide	HTML	Varies	Depends on HTMLToDocument parser
HTML comment	HTML	✗	Comments stripped by converter
ARIA hidden	HTML	Varies	Depends on parsing depth
Invisible text	HTML	✓	Zero-width chars pass through
Metadata	PDF	✗	PDFMiner ignores metadata
White font	DOCX	✗	DocxToDocument reads text only
Custom XML	DOCX	✗	Converter reads body paragraphs

Haystack 2.x vs 1.x¶

hemlock-lab uses Haystack 2.x, which has a significantly different architecture from 1.x:

Aspect	Haystack 1.x	Haystack 2.x
Pipeline	Sequential	Component graph
Document Store	Class-based	Integration-based
Converters	`FileTypeClassifier`	Explicit converter components
API	`Finder` → `Reader` → `Generator`	Connected pipeline components

Version matters

If hemlock's validators were built against Haystack 1.x behaviors, drifts are expected when testing against 2.x. The drift report will surface these.

Next Steps¶

LangChain — Compare with LangChain's loader approach
Pipeline Overview — Side-by-side framework comparison
API Reference — Full endpoint documentation