Adding Pipelines¶
This guide walks through adding a new RAG framework to hemlock-lab. Every pipeline follows the same Docker container pattern, making it straightforward to extend.
Overview¶
Adding a new pipeline requires 4 changes:
- Dockerfile + app.py — Container with FastAPI application
- docker-compose.yml — Service definition
- Test harness — Add framework to pipeline list
- hemlock validator — Add prediction logic
Step 1: Create the Container¶
Create docker/<framework>-rag/ with two files:
Dockerfile¶
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
EXPOSE <PORT>
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "<PORT>"]
app.py¶
import os
from fastapi import FastAPI, File, UploadFile, Form
app = FastAPI()
CHROMADB_HOST = os.getenv("CHROMADB_HOST", "chromadb")
CHROMADB_PORT = int(os.getenv("CHROMADB_PORT", "8000"))
OLLAMA_URL = os.getenv("OLLAMA_URL", "http://host.docker.internal:11434")
OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", "smollm2:135m")
@app.get("/health")
async def health():
return {
"framework": "<framework_name>",
"status": "ok",
"version": "<version>"
}
@app.post("/extract")
async def extract(file: UploadFile = File(...)):
content = await file.read()
# Use framework's native extraction
text = framework_extract(content, file.filename)
return {
"text": text,
"metadata": {"source": file.filename, "format": get_ext(file.filename)}
}
@app.post("/ingest")
async def ingest(file: UploadFile = File(...), collection: str = Form(...)):
content = await file.read()
text = framework_extract(content, file.filename)
# Chunk, embed, store in ChromaDB
chunks = chunk_text(text)
store_in_chromadb(chunks, collection)
return {"status": "ingested", "collection": collection, "chunks": len(chunks)}
@app.post("/query")
async def query(request: dict):
query_text = request["query"]
collection = request["collection"]
# Retrieve from ChromaDB, build prompt, call LLM
answer = rag_chain(query_text, collection)
return {"answer": answer, "sources": [...]}
Use the framework's native extraction
The whole point of hemlock-lab is testing how each framework's parser handles poisoned content. Don't write custom extraction logic — use the framework's built-in document loaders/parsers.
Step 2: Add to docker-compose.yml¶
Add a service definition in docker/docker-compose.yml:
<framework>-rag:
build: ./<framework>-rag
container_name: <framework>-rag
ports:
- "<PORT>:<PORT>"
environment:
- CHROMADB_HOST=chromadb
- CHROMADB_PORT=8000
- OLLAMA_URL=http://host.docker.internal:11434
- OLLAMA_MODEL=${OLLAMA_MODEL:-smollm2:135m}
- OLLAMA_EMBED_MODEL=${OLLAMA_EMBED_MODEL:-nomic-embed-text}
depends_on:
chromadb:
condition: service_healthy
networks:
- hemlock-net
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-sf", "http://localhost:<PORT>/health"]
interval: 10s
timeout: 5s
retries: 3
Step 3: Update Test Harness¶
Add to the pipeline list in harness/extraction_test.py:
PIPELINES = [
("langchain", 8100),
("llamaindex", 8101),
("unstructured", 8102),
("haystack", 8103),
("colpali", 8104),
("<framework>", <PORT>), # New
]
Step 4: Add hemlock Validator¶
In hemlock itself, add a validator for the new framework:
// pkg/validate/<framework>.go
package validate
type FrameworkValidator struct{}
func (v *FrameworkValidator) Predict(technique, format string) string {
// Initial predictions based on framework documentation
// hemlock-lab will validate and correct these
return "unknown"
}
Step 5: Documentation¶
Add documentation pages:
docs/pipelines/<framework>.md— Framework details- Update
docs/pipelines/index.md— Add to comparison tables - Update
docs/architecture/services.md— Add container details - Update
docs/reference/ports.md— Add port mapping
Testing the New Pipeline¶
# Build and start the new container
docker compose up -d --build <framework>-rag
# Verify health
curl http://localhost:<PORT>/health
# Run extraction tests
docker compose --profile test run --rm harness
# Check for any framework-specific quirks
cat reports/*/drift.md | grep "<framework>"
Next Steps¶
- Contributing — General contribution guidelines
- Pipelines — Existing pipeline documentation