Skip to content

Extract Endpoint

Extracts text from an uploaded document using the framework's native document parser.


Request

POST /extract
Content-Type: multipart/form-data
Field Type Required Description
file file The document to extract text from

Supported Formats

Extension MIME Type
.html text/html
.pdf application/pdf
.docx application/vnd.openxmlformats-officedocument.wordprocessingml.document
.md text/markdown
.txt text/plain

Response

Success (200)

{
  "text": "Extracted text content from the document...",
  "metadata": {
    "source": "filename.html",
    "format": "html",
    "length": 2451
  }
}
Field Type Description
text string The extracted text content
metadata.source string Original filename
metadata.format string Detected file format
metadata.length integer Character count of extracted text

Error (400)

{
  "error": "No file provided"
}

Error (415)

{
  "error": "Unsupported format: .xlsx"
}

Examples

Extract HTML

curl -X POST http://localhost:8100/extract \
  -F "file=@document.html"

Extract PDF

curl -X POST http://localhost:8100/extract \
  -F "file=@report.pdf"

Extract and Check for Payload

# Extract and search for a specific string
curl -s -X POST http://localhost:8100/extract \
  -F "file=@poisoned.html" \
| jq -r '.text' \
| grep -c "PAYLOAD_STRING"

Framework Differences

The same document can produce different extracted text depending on the framework:

curl -s -X POST http://localhost:8100/extract -F "file=@hidden.html" | jq -r '.text'
# May include CSS-hidden text (BSHTMLLoader preserves structure)
curl -s -X POST http://localhost:8101/extract -F "file=@hidden.html" | jq -r '.text'
# Behavior depends on HTMLTagReader version
curl -s -X POST http://localhost:8102/extract -F "file=@hidden.html" | jq -r '.text'
# Typically strips hidden content (visible text only)
curl -s -X POST http://localhost:8103/extract -F "file=@hidden.html" | jq -r '.text'
# Depends on HTMLToDocument implementation

This is exactly what the extraction tests measure.


Next Steps