Extract Endpoint¶
Extracts text from an uploaded document using the framework's native document parser.
Request¶
| Field | Type | Required | Description |
|---|---|---|---|
file |
file | ✓ | The document to extract text from |
Supported Formats¶
| Extension | MIME Type |
|---|---|
.html |
text/html |
.pdf |
application/pdf |
.docx |
application/vnd.openxmlformats-officedocument.wordprocessingml.document |
.md |
text/markdown |
.txt |
text/plain |
Response¶
Success (200)¶
{
"text": "Extracted text content from the document...",
"metadata": {
"source": "filename.html",
"format": "html",
"length": 2451
}
}
| Field | Type | Description |
|---|---|---|
text |
string | The extracted text content |
metadata.source |
string | Original filename |
metadata.format |
string | Detected file format |
metadata.length |
integer | Character count of extracted text |
Error (400)¶
Error (415)¶
Examples¶
Extract HTML¶
Extract PDF¶
Extract and Check for Payload¶
# Extract and search for a specific string
curl -s -X POST http://localhost:8100/extract \
-F "file=@poisoned.html" \
| jq -r '.text' \
| grep -c "PAYLOAD_STRING"
Framework Differences¶
The same document can produce different extracted text depending on the framework:
This is exactly what the extraction tests measure.
Next Steps¶
- Ingest — Store extracted text in ChromaDB
- Extraction Tests — Automated testing of this endpoint