ColPALI Pipeline¶
| Property | Value |
|---|---|
| Framework | ColPALI |
| Version | 0.1.0 |
| Port | 8104 |
| Container | colpali-rag |
| Image | docker/colpali-rag |
Overview¶
ColPALI is a multimodal RAG pipeline that handles both image and text documents. Unlike the text-only pipelines, ColPALI extracts metadata from image formats (PNG tEXt/iTXt chunks, EXIF fields) alongside standard text extraction. This makes it relevant for testing image-based payload hiding techniques.
Simplified Implementation
This is a simplified implementation for hemlock testing. Production ColPALI uses late-interaction vision-language models (e.g., ColQwen2) for document understanding. The hemlock-lab version focuses on image metadata extraction and text-based embedding via Ollama.
Extraction¶
ColPALI uses a modality-based extraction strategy:
| File Type | Extraction Method |
|---|---|
| PNG | Parse tEXt and iTXt chunks for metadata keywords |
| JPG/JPEG/GIF/WEBP | Image filename as metadata |
| Text-based (HTML, TXT, MD, etc.) | Direct UTF-8 decode |
| Binary | Filename only |
Image Metadata Extraction¶
For PNG files, ColPALI parses raw chunk data to extract:
tEXtchunks — keyword/value pairs (e.g.,Description,Comment,Author)iTXtchunks — internationalized text with compression support- Standard metadata keys: Description, Comment, Author, Title, Subject, XMP
This is significant for hemlock testing because payloads hidden in image metadata will be extracted and ingested by ColPALI, potentially surviving into RAG query responses.
RAG Chain¶
graph LR
Q["Query"] --> E["Embed via<br/>nomic-embed-text"]
E --> R["ChromaDB<br/>Retrieval"]
R --> C["Build Context"]
C --> G["Ollama<br/>smollm2:135m"]
G --> A["Answer"]
# Embed query text
query_embedding = embed_text(query, model="nomic-embed-text")
# Retrieve from ChromaDB
results = collection.query(query_embeddings=[query_embedding], n_results=k)
# Build prompt with context
context = "\n\n".join(results["documents"][0])
prompt = f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"
# Generate via Ollama
response = ollama_generate(prompt, model="smollm2:135m")
API Endpoints¶
Health Check¶
Extract¶
Ingest¶
Query¶
curl -X POST http://localhost:8104/query \
-F "q=What is the refund policy?" \
-F "collection=test-collection"
Query format difference
ColPALI uses form-encoded query parameters (q, collection, k) rather than JSON body. This differs from the other 4 pipelines which use application/json for /query.
Environment Variables¶
| Variable | Default | Purpose |
|---|---|---|
CHROMADB_HOST |
chromadb |
ChromaDB hostname (Docker network) |
CHROMADB_PORT |
8000 |
ChromaDB port |
OLLAMA_HOST |
http://host.docker.internal:11434 |
Ollama URL |
OLLAMA_MODEL |
smollm2:135m |
Generation model |
OLLAMA_EMBED_MODEL |
nomic-embed-text |
Embedding model |
SYSTEM_PROMPT_FILE |
— | Optional system prompt file path |
Known Behaviors¶
| Behavior | Notes |
|---|---|
| PNG metadata extraction | Extracts tEXt/iTXt chunks — payloads in image metadata will survive |
| Non-image text files | Direct UTF-8 decode — no framework-specific parsing |
| Single-chunk ingestion | Each document ingested as one chunk (no splitting) |
| Multimodal metadata tag | Documents tagged type: multimodal in ChromaDB |
Next Steps¶
- Pipeline Comparison — How ColPALI compares to other frameworks
- API Reference — Full endpoint documentation