Skip to content

ColPALI Pipeline

Property Value
Framework ColPALI
Version 0.1.0
Port 8104
Container colpali-rag
Image docker/colpali-rag

Overview

ColPALI is a multimodal RAG pipeline that handles both image and text documents. Unlike the text-only pipelines, ColPALI extracts metadata from image formats (PNG tEXt/iTXt chunks, EXIF fields) alongside standard text extraction. This makes it relevant for testing image-based payload hiding techniques.

Simplified Implementation

This is a simplified implementation for hemlock testing. Production ColPALI uses late-interaction vision-language models (e.g., ColQwen2) for document understanding. The hemlock-lab version focuses on image metadata extraction and text-based embedding via Ollama.


Extraction

ColPALI uses a modality-based extraction strategy:

File Type Extraction Method
PNG Parse tEXt and iTXt chunks for metadata keywords
JPG/JPEG/GIF/WEBP Image filename as metadata
Text-based (HTML, TXT, MD, etc.) Direct UTF-8 decode
Binary Filename only

Image Metadata Extraction

For PNG files, ColPALI parses raw chunk data to extract:

  • tEXt chunks — keyword/value pairs (e.g., Description, Comment, Author)
  • iTXt chunks — internationalized text with compression support
  • Standard metadata keys: Description, Comment, Author, Title, Subject, XMP

This is significant for hemlock testing because payloads hidden in image metadata will be extracted and ingested by ColPALI, potentially surviving into RAG query responses.


RAG Chain

graph LR
    Q["Query"] --> E["Embed via<br/>nomic-embed-text"]
    E --> R["ChromaDB<br/>Retrieval"]
    R --> C["Build Context"]
    C --> G["Ollama<br/>smollm2:135m"]
    G --> A["Answer"]
# Embed query text
query_embedding = embed_text(query, model="nomic-embed-text")

# Retrieve from ChromaDB
results = collection.query(query_embeddings=[query_embedding], n_results=k)

# Build prompt with context
context = "\n\n".join(results["documents"][0])
prompt = f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"

# Generate via Ollama
response = ollama_generate(prompt, model="smollm2:135m")

API Endpoints

Health Check

curl http://localhost:8104/health
{"framework": "colpali", "version": "0.1.0", "status": "ok"}

Extract

curl -X POST http://localhost:8104/extract \
  -F "file=@document.html"

Ingest

curl -X POST http://localhost:8104/ingest \
  -F "file=@image.png" \
  -F "collection=test-collection"

Query

curl -X POST http://localhost:8104/query \
  -F "q=What is the refund policy?" \
  -F "collection=test-collection"

Query format difference

ColPALI uses form-encoded query parameters (q, collection, k) rather than JSON body. This differs from the other 4 pipelines which use application/json for /query.


Environment Variables

Variable Default Purpose
CHROMADB_HOST chromadb ChromaDB hostname (Docker network)
CHROMADB_PORT 8000 ChromaDB port
OLLAMA_HOST http://host.docker.internal:11434 Ollama URL
OLLAMA_MODEL smollm2:135m Generation model
OLLAMA_EMBED_MODEL nomic-embed-text Embedding model
SYSTEM_PROMPT_FILE Optional system prompt file path

Known Behaviors

Behavior Notes
PNG metadata extraction Extracts tEXt/iTXt chunks — payloads in image metadata will survive
Non-image text files Direct UTF-8 decode — no framework-specific parsing
Single-chunk ingestion Each document ingested as one chunk (no splitting)
Multimodal metadata tag Documents tagged type: multimodal in ChromaDB

Next Steps