Haystack Pipeline¶
| Property | Value |
|---|---|
| Framework | Haystack |
| Version | 2.12.1 |
| Port | 8103 |
| Container | haystack-rag |
| Image | docker/haystack-rag |
Extraction¶
Haystack uses an extension-to-converter mapping, similar to LangChain's approach:
CONVERTERS = {
".html": HTMLToDocument,
".pdf": PDFMinerToDocument,
".docx": DOCXToDocument,
".md": MarkdownToDocument,
".txt": TextFileToDocument,
}
How It Works¶
- Uploaded file is saved to a temp path
- Extension is matched to a converter class
- Converter produces
Documentobjects - Text is extracted from
doc.content
Implications for hemlock¶
Haystack 2.x uses a component-based architecture where converters are explicit pipeline components:
HTMLToDocument— Parses HTML and extracts text. Behavior depends on the underlying parser (typically html2text or similar)PDFMinerToDocument— pdfminer-based text layer extractionDOCXToDocument— Reads document body paragraphsMarkdownToDocument— Converts Markdown to plain textTextFileToDocument— Direct file read
RAG Chain¶
Haystack 2.x uses a Pipeline with connected components:
graph LR
Q["Query"] --> E["OllamaDocumentEmbedder"]
E --> R["ChromaDocumentStore<br/>Retriever"]
R --> D["Top-k<br/>Documents"]
D --> G["Ollama<br/>Generator"]
G --> A["Answer"]
from haystack import Pipeline
from haystack_integrations.document_stores.chroma import ChromaDocumentStore
document_store = ChromaDocumentStore(
collection_name=collection,
host="localhost",
port=8000
)
retriever = ChromaDocumentStore.as_retriever(document_store)
generator = OllamaGenerator(model="smollm2:135m")
pipeline = Pipeline()
pipeline.add_component("retriever", retriever)
pipeline.add_component("generator", generator)
pipeline.connect("retriever", "generator")
result = pipeline.run({"retriever": {"query": query_text}})
Curl Examples¶
Health Check¶
Extract Text¶
Ingest Document¶
curl -X POST http://localhost:8103/ingest \
-F "file=@document.docx" \
-F "collection=test-collection"
Query¶
curl -X POST http://localhost:8103/query \
-H "Content-Type: application/json" \
-d '{"query": "What is the refund policy?", "collection": "test-collection"}'
Hybrid Query (BM25 + Dense)¶
curl -X POST http://localhost:8103/query-hybrid \
-F "q=What is the refund policy?" \
-F "collection=test-collection" \
-F "k=3"
The /query-hybrid endpoint combines dense vector retrieval (ChromaDB) with a BM25 keyword scorer using reciprocal rank fusion (RRF, k=60). Both retrievers fetch 2×k candidates, scores are fused, and the top-k results feed the LLM prompt. This provides resilience against adversarial documents that are optimized for embedding similarity but lack keyword overlap with the target query.
Haystack-only
/query-hybrid is currently implemented only on the Haystack pipeline. Other pipelines use dense-only retrieval. When --hybrid-retrieval is passed to run_all.sh, the harness automatically sends /query-hybrid only to Haystack and keeps dense-only /query for the other frameworks.
Known Behaviors¶
| Technique | Format | Survives? | Notes |
|---|---|---|---|
| CSS hide | HTML | Varies | Depends on HTMLToDocument parser |
| HTML comment | HTML | ✗ | Comments stripped by converter |
| ARIA hidden | HTML | Varies | Depends on parsing depth |
| Invisible text | HTML | ✓ | Zero-width chars pass through |
| Metadata | ✗ | PDFMiner ignores metadata | |
| White font | DOCX | ✗ | DocxToDocument reads text only |
| Custom XML | DOCX | ✗ | Converter reads body paragraphs |
Haystack 2.x vs 1.x¶
hemlock-lab uses Haystack 2.x, which has a significantly different architecture from 1.x:
| Aspect | Haystack 1.x | Haystack 2.x |
|---|---|---|
| Pipeline | Sequential | Component graph |
| Document Store | Class-based | Integration-based |
| Converters | FileTypeClassifier |
Explicit converter components |
| API | Finder → Reader → Generator |
Connected pipeline components |
Version matters
If hemlock's validators were built against Haystack 1.x behaviors, drifts are expected when testing against 2.x. The drift report will surface these.
Next Steps¶
- LangChain — Compare with LangChain's loader approach
- Pipeline Overview — Side-by-side framework comparison
- API Reference — Full endpoint documentation