Ingest Endpoint¶
Extracts text from an uploaded document, generates embeddings, and stores it in a ChromaDB collection.
Request¶
| Field | Type | Required | Description |
|---|---|---|---|
file |
file | ✓ | The document to ingest |
collection |
string | ✓ | Target ChromaDB collection name |
Response¶
Success (200)¶
{
"status": "ingested",
"collection": "test-collection",
"chunks": 5,
"document_id": "poisoned-001.html"
}
| Field | Type | Description |
|---|---|---|
status |
string | Always "ingested" on success |
collection |
string | The collection name used |
chunks |
integer | Number of chunks created |
document_id |
string | Original filename used as document ID |
Error (400)¶
Error (500)¶
How Ingestion Works¶
graph LR
F["Upload file"] --> E["Extract text"]
E --> C["Chunk text"]
C --> V["Generate embeddings<br/>(Ollama)"]
V --> S["Store in ChromaDB"]
- Extract — Same extraction logic as
/extractendpoint - Chunk — Split text into overlapping chunks (size varies by framework)
- Embed — Each chunk is embedded via Ollama (
nomic-embed-text) - Store — Chunks + embeddings stored in the specified ChromaDB collection
Collection Auto-Creation¶
If the specified collection doesn't exist, it's created automatically. If it exists, documents are added to the existing collection.
Examples¶
Ingest a Single Document¶
curl -X POST http://localhost:8100/ingest \
-F "file=@document.html" \
-F "collection=test-collection"
Ingest Multiple Documents¶
for doc in /tmp/hemlock-batch/*.html; do
curl -s -X POST http://localhost:8100/ingest \
-F "file=@${doc}" \
-F "collection=test-collection"
done
Verify Ingestion¶
# Check collection document count via ChromaDB API
curl http://localhost:8000/api/v1/collections/test-collection/count
Chunking Strategies¶
Each framework chunks documents differently:
| Framework | Default Chunk Size | Overlap |
|---|---|---|
| LangChain | 1000 chars | 200 chars |
| LlamaIndex | 1024 tokens | 20 tokens |
| Unstructured | Element-based | N/A |
| Haystack | 500 words | 50 words |
Chunking affects retrieval
Smaller chunks are more precise but may lose context. Larger chunks preserve context but may dilute payload embeddings. This is why the same document can have different retrieval rankings across frameworks.
Next Steps¶
- Query — Retrieve and query ingested documents
- Retrieval Tests — How ingestion is tested