Extract Endpoint¶

Extracts text from an uploaded document using the framework's native document parser.

Request¶

POST /extract
Content-Type: multipart/form-data

Field	Type	Required	Description
`file`	file	✓	The document to extract text from

Supported Formats¶

Extension	MIME Type
`.html`	`text/html`
`.pdf`	`application/pdf`
`.docx`	`application/vnd.openxmlformats-officedocument.wordprocessingml.document`
`.md`	`text/markdown`
`.txt`	`text/plain`

Response¶

Success (200)¶

{
  "text": "Extracted text content from the document...",
  "metadata": {
    "source": "filename.html",
    "format": "html",
    "length": 2451
  }
}

Field	Type	Description
`text`	string	The extracted text content
`metadata.source`	string	Original filename
`metadata.format`	string	Detected file format
`metadata.length`	integer	Character count of extracted text

Error (400)¶

{
  "error": "No file provided"
}

Error (415)¶

{
  "error": "Unsupported format: .xlsx"
}

Examples¶

Extract HTML¶

curl -X POST http://localhost:8100/extract \
  -F "file=@document.html"

Extract PDF¶

curl -X POST http://localhost:8100/extract \
  -F "file=@report.pdf"

Extract and Check for Payload¶

# Extract and search for a specific string
curl -s -X POST http://localhost:8100/extract \
  -F "file=@poisoned.html" \
| jq -r '.text' \
| grep -c "PAYLOAD_STRING"

Framework Differences¶

The same document can produce different extracted text depending on the framework:

LangChainLlamaIndexUnstructuredHaystack

curl -s -X POST http://localhost:8100/extract -F "file=@hidden.html" | jq -r '.text'
# May include CSS-hidden text (BSHTMLLoader preserves structure)

curl -s -X POST http://localhost:8101/extract -F "file=@hidden.html" | jq -r '.text'
# Behavior depends on HTMLTagReader version

curl -s -X POST http://localhost:8102/extract -F "file=@hidden.html" | jq -r '.text'
# Typically strips hidden content (visible text only)

curl -s -X POST http://localhost:8103/extract -F "file=@hidden.html" | jq -r '.text'
# Depends on HTMLToDocument implementation

This is exactly what the extraction tests measure.

Next Steps¶

Ingest — Store extracted text in ChromaDB
Extraction Tests — Automated testing of this endpoint