Ingest Endpoint¶

Extracts text from an uploaded document, generates embeddings, and stores it in a ChromaDB collection.

Request¶

POST /ingest
Content-Type: multipart/form-data

Field	Type	Required	Description
`file`	file	✓	The document to ingest
`collection`	string	✓	Target ChromaDB collection name

Response¶

Success (200)¶

{
  "status": "ingested",
  "collection": "test-collection",
  "chunks": 5,
  "document_id": "poisoned-001.html"
}

Field	Type	Description
`status`	string	Always `"ingested"` on success
`collection`	string	The collection name used
`chunks`	integer	Number of chunks created
`document_id`	string	Original filename used as document ID

Error (400)¶

{
  "error": "Missing required field: collection"
}

Error (500)¶

{
  "error": "ChromaDB connection failed",
  "detail": "Connection refused on port 8000"
}

How Ingestion Works¶

graph LR
    F["Upload file"] --> E["Extract text"]
    E --> C["Chunk text"]
    C --> V["Generate embeddings<br/>(Ollama)"]
    V --> S["Store in ChromaDB"]

Extract — Same extraction logic as /extract endpoint
Chunk — Split text into overlapping chunks (size varies by framework)
Embed — Each chunk is embedded via Ollama (nomic-embed-text)
Store — Chunks + embeddings stored in the specified ChromaDB collection

Collection Auto-Creation¶

If the specified collection doesn't exist, it's created automatically. If it exists, documents are added to the existing collection.

Examples¶

Ingest a Single Document¶

curl -X POST http://localhost:8100/ingest \
  -F "file=@document.html" \
  -F "collection=test-collection"

Ingest Multiple Documents¶

for doc in /tmp/hemlock-batch/*.html; do
  curl -s -X POST http://localhost:8100/ingest \
    -F "file=@${doc}" \
    -F "collection=test-collection"
done

Verify Ingestion¶

# Check collection document count via ChromaDB API
curl http://localhost:8000/api/v1/collections/test-collection/count

Chunking Strategies¶

Each framework chunks documents differently:

Framework	Default Chunk Size	Overlap
LangChain	1000 chars	200 chars
LlamaIndex	1024 tokens	20 tokens
Unstructured	Element-based	N/A
Haystack	500 words	50 words

Chunking affects retrieval

Smaller chunks are more precise but may lose context. Larger chunks preserve context but may dilute payload embeddings. This is why the same document can have different retrieval rankings across frameworks.

Next Steps¶

Query — Retrieve and query ingested documents
Retrieval Tests — How ingestion is tested