Unstructured Pipeline¶

Property	Value
Framework	Unstructured
Version	0.16.11
Port	8102
Container	`unstructured-rag`
Image	`docker/unstructured-rag`

Extraction¶

Unstructured uses partition() which returns typed Element objects:

from unstructured.partition.auto import partition

elements = partition(filename=temp_path)
text = "\n\n".join(el.text for el in elements if el.text)

How It Works¶

Uploaded file is saved to a temp path
partition() auto-detects the file type and applies the appropriate partitioner
Returns a list of typed Elements: Title, NarrativeText, ListItem, Table, etc.
Text is concatenated from all elements that have non-empty .text

Implications for hemlock¶

Unstructured's element-based approach is fundamentally different from the other frameworks. Instead of extracting raw text, it creates a structured representation of the document:

HTML — partition_html() parses the raw DOM, not the rendered view. CSS visibility rules are not enforced — elements with display:none, zero font size, offscreen positioning, or aria-hidden all contribute text to the element list.
PDF — partition_pdf() extracts text from the content stream, including text at off-page coordinates.
DOCX — partition_docx() reads all character runs in the document body. Font size 0 and white-on-white text still contribute to the element list. Custom XML data parts are stripped.
RTF — partition_rtf() reads the document body; the info block (metadata) is stripped.
Markdown — partition_md() parses Markdown structure. YAML front matter is included. HTML comments and link/image attributes are stripped.
TXT — partition_text() reads content verbatim, preserving all Unicode characters including zero-width and combining marks.
CSV — Field values are extracted; formula strings (e.g., =CONCATENATE(...)) are not evaluated.
XLSX — Cell values from all sheets (including hidden sheets) are extracted. Cell comments and document metadata are stripped.
EPUB/DOCX metadata — OPC document property fields (dc:title, dc:description, etc.) are stripped.
PDF XMP metadata — Metadata streams are stripped.
JSON — Unstructured only accepts its own serialised output format; arbitrary JSON files are rejected with an error.

Text-layer extraction, not render-layer

Unstructured processes documents at the text/parse layer, not the visual rendering layer. It strips metadata containers, comments, and non-text structure nodes, but it does not enforce CSS visibility or font styling. Many "visually hidden" techniques therefore survive Unstructured's extraction.

Phase 6 empirical results (2026-04-11)

A full sweep of all 63 hemlock techniques showed that 25/58 (43%) testable technique–format combinations survive Unstructured extraction. The "0% Unstructured" figure in earlier aggregate reports refers to injection ASR (attack success rate against the LLM), not extraction survival rate. Unstructured does not run generation natively, so injection tests produce 0% regardless of extraction outcomes.

RAG Chain¶

Unlike the other frameworks, Unstructured doesn't have a built-in RAG chain abstraction. hemlock-lab implements the chain manually:

graph LR
    Q["Query"] --> E1["Embed Query<br/>(Ollama)"]
    E1 --> S["ChromaDB<br/>Search"]
    S --> D["Top-k<br/>Chunks"]
    D --> P["Prompt<br/>Template"]
    P --> L["Ollama<br/>smollm2:135m"]
    L --> A["Answer"]

# Manual embedding
query_embedding = ollama_embed(query_text)

# Manual ChromaDB search
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=5
)

# Manual prompt construction
context = "\n".join(results["documents"][0])
prompt = f"Context:\n{context}\n\nQuestion: {query_text}\nAnswer:"

# Direct Ollama call
response = ollama_generate(prompt)

Curl Examples¶

Health Check¶

curl http://localhost:8102/health

{"framework": "unstructured", "status": "ok", "version": "0.17.2"}

Extract Text¶

curl -X POST http://localhost:8102/extract \
  -F "file=@poisoned-document.html"

{"text": "...", "elements": 12}

Ingest Document¶

curl -X POST http://localhost:8102/ingest \
  -F "file=@document.html" \
  -F "collection=test-collection"

{"status": "ingested", "collection": "test-collection", "chunks": 8}

Query¶

curl -X POST http://localhost:8102/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is the refund policy?", "collection": "test-collection"}'

{"answer": "...", "sources": ["refund-policy.html"]}

Extraction Survival Matrix (Phase 6 Empirical Survey)¶

Empirical results from testing all 63 hemlock techniques against the live Unstructured pipeline (v0.16.11). Custom payload HEMLOCK-PHASE6-BYPASS-TEST used throughout. ERR = Unstructured rejected the file (e.g., JSON schema mismatch). N/A = not applicable (langchain also errored on PNG inputs).

Surviving techniques (25)¶

Format	Technique	Stealth	Notes
csv	bom-prefix	50	BOM-prefixed field value extracted verbatim
csv	extra-column	45	Extra column value extracted
csv	header-inject	50	Header cell value extracted
csv	quoted-field	55	Multiline quoted field extracted
docx	fontzero	80	Zero-point font run extracted (styling ignored)
docx	hidden-paragraph	75	Word `vanish` property ignored, text extracted
docx	whitefont	70	White font run extracted (colour ignored)
epub	aria-hidden	65	`aria-hidden` attribute does not suppress extraction
epub	css-hide	70	CSS zero-font-size span text extracted
html	aria-hidden	70	`aria-hidden` span text extracted
html	camouflage	80	Behind-image text extracted (CSS z-index ignored)
html	color-transparent	85	Transparent text extracted
html	css-hide	75	`display:none` / zero-size CSS text extracted
html	invisible-div	55	`display:none` div text extracted
html	offscreen	80	Offscreen positioned text extracted
markdown	frontmatter	55	YAML front matter included in output
pdf	offpage	70	Off-page text in content stream extracted
rtf	fontzero	75	Zero-point font body text extracted
rtf	white-text	70	White-on-white body text extracted
txt	bidi-override	70	BiDi override characters preserved
txt	diacritical	85	Combining diacritical marks preserved
txt	homoglyph	80	Cyrillic homoglyphs preserved
txt	zero-width	85	Zero-width Unicode characters preserved
xlsx	fontzero	80	Zero-point font cell value extracted
xlsx	hidden-sheet	75	Hidden sheet cell values extracted

Blocked techniques (33)¶

Format	Technique	Blocked because
csv	formula-injection	Formula string returned, not evaluated value
docx	chunk-boundary	Payload split across paragraphs; marker not present in any single element
docx	comment	Comments stripped
docx	custom-xml	Custom XML data parts stripped
docx	metadata	OPC document properties stripped
docx	metadata-distributed	OPC properties stripped (`docProps/custom.xml`)
epub	comment	XHTML comment nodes stripped
epub	metadata	OPF `dc:description` stripped
epub	metadata-distributed	OPF metadata stripped
epub	toc	NCX navigation labels not part of content elements
html	chunk-boundary	Payload split across `<p>` tags; marker not in a single element
html	comment	HTML comment nodes stripped
html	microdata	Microdata attributes (`itemprop`, `content`) stripped
html	noscript	`<noscript>` content stripped
markdown	chunk-boundary	Payload split across heading sections
markdown	html-comment	HTML comments within Markdown stripped
markdown	image-alt	Image alt text not included in element text
markdown	link-title	Link title attributes not included in element text
pdf	annotation	Annotations not in content stream
pdf	chunk-boundary	Payload split across pages
pdf	invisible-text	Invisible rendering mode text stripped
pdf	javascript	JS actions stripped
pdf	xmp-distributed	XMP metadata stripped
pdf	xmp-metadata	XMP metadata stripped
png	multi-chunk	PNG tEXt chunks not extracted (no OCR)
png	steganographic	LSB steganography not detected
png	text-chunk	PNG tEXt chunk not extracted
png	xmp-metadata	PNG iTXt/XMP chunk not extracted
rtf	comment	RTF annotation group stripped
rtf	fonttable	RTF font table not in text layer
rtf	metadata	RTF info block stripped
txt	chunk-boundary	Payload split across ~512-char boundary
xlsx	comment	Cell comments stripped
xlsx	metadata	`docProps/core.xml` properties stripped

Error cases (5)¶

Format	Technique	Error
docx	metadata-distributed	`docProps/custom.xml` not in ZIP archive
json	metadata-key	`partition_json` only accepts Unstructured schema
json	nested-object	`partition_json` only accepts Unstructured schema
json	prototype-key	`partition_json` only accepts Unstructured schema
json	unicode-escape	`partition_json` only accepts Unstructured schema

Known Behaviors (Summary)¶

Technique	Format	Survives?	Notes
CSS hide	HTML	✓	DOM text extracted; CSS visibility not enforced
HTML comment	HTML	✗	Comment nodes stripped
ARIA hidden	HTML	✓	Attribute not enforced; text extracted
Invisible text	HTML	✓	Zero-width chars in any element pass through
White font	DOCX	✓	Font colour ignored; run text extracted
Hidden paragraph	DOCX	✓	Word `vanish` property ignored
Metadata	DOCX/EPUB/XLSX	✗	OPC/OPF property containers stripped
Frontmatter	Markdown	✓	YAML header included in extraction
Offpage text	PDF	✓	Off-page coordinates in content stream extracted
Annotation	PDF	✗	Annotations not in content stream
XMP metadata	PDF	✗	Metadata streams stripped
RTF metadata	RTF	✗	Info block stripped; body text preserved
Zero-width / homoglyph / diacritical	TXT	✓	Characters preserved verbatim
PNG metadata	PNG	✗	tEXt/iTXt chunks not extracted
JSON (arbitrary)	JSON	ERR	Only Unstructured-schema JSON accepted

Next Steps¶

Haystack — Compare with Haystack's converter approach
LangChain — Compare with LangChain's loader approach