Skip to content

Adding Pipelines

This guide walks through adding a new RAG framework to hemlock-lab. Every pipeline follows the same Docker container pattern, making it straightforward to extend.


Overview

Adding a new pipeline requires 4 changes:

  1. Dockerfile + app.py — Container with FastAPI application
  2. docker-compose.yml — Service definition
  3. Test harness — Add framework to pipeline list
  4. hemlock validator — Add prediction logic

Step 1: Create the Container

Create docker/<framework>-rag/ with two files:

Dockerfile

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .

EXPOSE <PORT>
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "<PORT>"]

app.py

import os
from fastapi import FastAPI, File, UploadFile, Form

app = FastAPI()

CHROMADB_HOST = os.getenv("CHROMADB_HOST", "chromadb")
CHROMADB_PORT = int(os.getenv("CHROMADB_PORT", "8000"))
OLLAMA_URL = os.getenv("OLLAMA_URL", "http://host.docker.internal:11434")
OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", "smollm2:135m")

@app.get("/health")
async def health():
    return {
        "framework": "<framework_name>",
        "status": "ok",
        "version": "<version>"
    }

@app.post("/extract")
async def extract(file: UploadFile = File(...)):
    content = await file.read()
    # Use framework's native extraction
    text = framework_extract(content, file.filename)
    return {
        "text": text,
        "metadata": {"source": file.filename, "format": get_ext(file.filename)}
    }

@app.post("/ingest")
async def ingest(file: UploadFile = File(...), collection: str = Form(...)):
    content = await file.read()
    text = framework_extract(content, file.filename)
    # Chunk, embed, store in ChromaDB
    chunks = chunk_text(text)
    store_in_chromadb(chunks, collection)
    return {"status": "ingested", "collection": collection, "chunks": len(chunks)}

@app.post("/query")
async def query(request: dict):
    query_text = request["query"]
    collection = request["collection"]
    # Retrieve from ChromaDB, build prompt, call LLM
    answer = rag_chain(query_text, collection)
    return {"answer": answer, "sources": [...]}

Use the framework's native extraction

The whole point of hemlock-lab is testing how each framework's parser handles poisoned content. Don't write custom extraction logic — use the framework's built-in document loaders/parsers.


Step 2: Add to docker-compose.yml

Add a service definition in docker/docker-compose.yml:

  <framework>-rag:
    build: ./<framework>-rag
    container_name: <framework>-rag
    ports:
      - "<PORT>:<PORT>"
    environment:
      - CHROMADB_HOST=chromadb
      - CHROMADB_PORT=8000
      - OLLAMA_URL=http://host.docker.internal:11434
      - OLLAMA_MODEL=${OLLAMA_MODEL:-smollm2:135m}
      - OLLAMA_EMBED_MODEL=${OLLAMA_EMBED_MODEL:-nomic-embed-text}
    depends_on:
      chromadb:
        condition: service_healthy
    networks:
      - hemlock-net
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-sf", "http://localhost:<PORT>/health"]
      interval: 10s
      timeout: 5s
      retries: 3

Step 3: Update Test Harness

Add to the pipeline list in harness/extraction_test.py:

PIPELINES = [
    ("langchain", 8100),
    ("llamaindex", 8101),
    ("unstructured", 8102),
    ("haystack", 8103),
    ("colpali", 8104),
    ("<framework>", <PORT>),   # New
]

Step 4: Add hemlock Validator

In hemlock itself, add a validator for the new framework:

// pkg/validate/<framework>.go
package validate

type FrameworkValidator struct{}

func (v *FrameworkValidator) Predict(technique, format string) string {
    // Initial predictions based on framework documentation
    // hemlock-lab will validate and correct these
    return "unknown"
}

Step 5: Documentation

Add documentation pages:

  1. docs/pipelines/<framework>.md — Framework details
  2. Update docs/pipelines/index.md — Add to comparison tables
  3. Update docs/architecture/services.md — Add container details
  4. Update docs/reference/ports.md — Add port mapping

Testing the New Pipeline

# Build and start the new container
docker compose up -d --build <framework>-rag

# Verify health
curl http://localhost:<PORT>/health

# Run extraction tests
docker compose --profile test run --rm harness

# Check for any framework-specific quirks
cat reports/*/drift.md | grep "<framework>"

Next Steps