ColPALI Pipeline¶

Property	Value
Framework	ColPALI
Version	0.1.0
Port	8104
Container	`colpali-rag`
Image	`docker/colpali-rag`

Overview¶

ColPALI is a multimodal RAG pipeline that handles both image and text documents. Unlike the text-only pipelines, ColPALI extracts metadata from image formats (PNG tEXt/iTXt chunks, EXIF fields) alongside standard text extraction. This makes it relevant for testing image-based payload hiding techniques.

Simplified Implementation

This is a simplified implementation for hemlock testing. Production ColPALI uses late-interaction vision-language models (e.g., ColQwen2) for document understanding. The hemlock-lab version focuses on image metadata extraction and text-based embedding via Ollama.

Extraction¶

ColPALI uses a modality-based extraction strategy:

File Type	Extraction Method
PNG	Parse tEXt and iTXt chunks for metadata keywords
JPG/JPEG/GIF/WEBP	Image filename as metadata
Text-based (HTML, TXT, MD, etc.)	Direct UTF-8 decode
Binary	Filename only

Image Metadata Extraction¶

For PNG files, ColPALI parses raw chunk data to extract:

tEXt chunks — keyword/value pairs (e.g., Description, Comment, Author)
iTXt chunks — internationalized text with compression support
Standard metadata keys: Description, Comment, Author, Title, Subject, XMP

This is significant for hemlock testing because payloads hidden in image metadata will be extracted and ingested by ColPALI, potentially surviving into RAG query responses.

RAG Chain¶

graph LR
    Q["Query"] --> E["Embed via<br/>nomic-embed-text"]
    E --> R["ChromaDB<br/>Retrieval"]
    R --> C["Build Context"]
    C --> G["Ollama<br/>smollm2:135m"]
    G --> A["Answer"]

# Embed query text
query_embedding = embed_text(query, model="nomic-embed-text")

# Retrieve from ChromaDB
results = collection.query(query_embeddings=[query_embedding], n_results=k)

# Build prompt with context
context = "\n\n".join(results["documents"][0])
prompt = f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"

# Generate via Ollama
response = ollama_generate(prompt, model="smollm2:135m")

API Endpoints¶

Health Check¶

curl http://localhost:8104/health

{"framework": "colpali", "version": "0.1.0", "status": "ok"}

Extract¶

curl -X POST http://localhost:8104/extract \
  -F "file=@document.html"

Ingest¶

curl -X POST http://localhost:8104/ingest \
  -F "file=@image.png" \
  -F "collection=test-collection"

Query¶

curl -X POST http://localhost:8104/query \
  -F "q=What is the refund policy?" \
  -F "collection=test-collection"

Query format difference

ColPALI uses form-encoded query parameters (q, collection, k) rather than JSON body. This differs from the other 4 pipelines which use application/json for /query.

Environment Variables¶

Variable	Default	Purpose
`CHROMADB_HOST`	`chromadb`	ChromaDB hostname (Docker network)
`CHROMADB_PORT`	`8000`	ChromaDB port
`OLLAMA_HOST`	`http://host.docker.internal:11434`	Ollama URL
`OLLAMA_MODEL`	`smollm2:135m`	Generation model
`OLLAMA_EMBED_MODEL`	`nomic-embed-text`	Embedding model
`SYSTEM_PROMPT_FILE`	—	Optional system prompt file path

Known Behaviors¶

Behavior	Notes
PNG metadata extraction	Extracts tEXt/iTXt chunks — payloads in image metadata will survive
Non-image text files	Direct UTF-8 decode — no framework-specific parsing
Single-chunk ingestion	Each document ingested as one chunk (no splitting)
Multimodal metadata tag	Documents tagged `type: multimodal` in ChromaDB

Next Steps¶

Pipeline Comparison — How ColPALI compares to other frameworks
API Reference — Full endpoint documentation