Skip to content

Haystack Pipeline

Property Value
Framework Haystack
Version 2.12.1
Port 8103
Container haystack-rag
Image docker/haystack-rag

Extraction

Haystack uses an extension-to-converter mapping, similar to LangChain's approach:

CONVERTERS = {
    ".html": HTMLToDocument,
    ".pdf":  PDFMinerToDocument,
    ".docx": DOCXToDocument,
    ".md":   MarkdownToDocument,
    ".txt":  TextFileToDocument,
}

How It Works

  1. Uploaded file is saved to a temp path
  2. Extension is matched to a converter class
  3. Converter produces Document objects
  4. Text is extracted from doc.content

Implications for hemlock

Haystack 2.x uses a component-based architecture where converters are explicit pipeline components:

  • HTMLToDocument — Parses HTML and extracts text. Behavior depends on the underlying parser (typically html2text or similar)
  • PDFMinerToDocument — pdfminer-based text layer extraction
  • DOCXToDocument — Reads document body paragraphs
  • MarkdownToDocument — Converts Markdown to plain text
  • TextFileToDocument — Direct file read

RAG Chain

Haystack 2.x uses a Pipeline with connected components:

graph LR
    Q["Query"] --> E["OllamaDocumentEmbedder"]
    E --> R["ChromaDocumentStore<br/>Retriever"]
    R --> D["Top-k<br/>Documents"]
    D --> G["Ollama<br/>Generator"]
    G --> A["Answer"]
from haystack import Pipeline
from haystack_integrations.document_stores.chroma import ChromaDocumentStore

document_store = ChromaDocumentStore(
    collection_name=collection,
    host="localhost",
    port=8000
)
retriever = ChromaDocumentStore.as_retriever(document_store)
generator = OllamaGenerator(model="smollm2:135m")

pipeline = Pipeline()
pipeline.add_component("retriever", retriever)
pipeline.add_component("generator", generator)
pipeline.connect("retriever", "generator")

result = pipeline.run({"retriever": {"query": query_text}})

Curl Examples

Health Check

curl http://localhost:8103/health
{"framework": "haystack", "status": "ok", "version": "2.12.1"}

Extract Text

curl -X POST http://localhost:8103/extract \
  -F "file=@poisoned-document.docx"
{"text": "...", "metadata": {"source": "poisoned-document.docx"}}

Ingest Document

curl -X POST http://localhost:8103/ingest \
  -F "file=@document.docx" \
  -F "collection=test-collection"
{"status": "ingested", "collection": "test-collection", "chunks": 4}

Query

curl -X POST http://localhost:8103/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is the refund policy?", "collection": "test-collection"}'
{"answer": "...", "sources": ["refund-policy.html"]}

Hybrid Query (BM25 + Dense)

curl -X POST http://localhost:8103/query-hybrid \
  -F "q=What is the refund policy?" \
  -F "collection=test-collection" \
  -F "k=3"
{"answer": "...", "sources": [...], "retrieval_mode": "hybrid-bm25-dense"}

The /query-hybrid endpoint combines dense vector retrieval (ChromaDB) with a BM25 keyword scorer using reciprocal rank fusion (RRF, k=60). Both retrievers fetch 2×k candidates, scores are fused, and the top-k results feed the LLM prompt. This provides resilience against adversarial documents that are optimized for embedding similarity but lack keyword overlap with the target query.

Haystack-only

/query-hybrid is currently implemented only on the Haystack pipeline. Other pipelines use dense-only retrieval. When --hybrid-retrieval is passed to run_all.sh, the harness automatically sends /query-hybrid only to Haystack and keeps dense-only /query for the other frameworks.


Known Behaviors

Technique Format Survives? Notes
CSS hide HTML Varies Depends on HTMLToDocument parser
HTML comment HTML Comments stripped by converter
ARIA hidden HTML Varies Depends on parsing depth
Invisible text HTML Zero-width chars pass through
Metadata PDF PDFMiner ignores metadata
White font DOCX DocxToDocument reads text only
Custom XML DOCX Converter reads body paragraphs

Haystack 2.x vs 1.x

hemlock-lab uses Haystack 2.x, which has a significantly different architecture from 1.x:

Aspect Haystack 1.x Haystack 2.x
Pipeline Sequential Component graph
Document Store Class-based Integration-based
Converters FileTypeClassifier Explicit converter components
API FinderReaderGenerator Connected pipeline components

Version matters

If hemlock's validators were built against Haystack 1.x behaviors, drifts are expected when testing against 2.x. The drift report will surface these.


Next Steps