Framework Comparison¶

Each RAG framework extracts text from documents differently. These differences determine which hiding techniques survive extraction and which are stripped. This page documents the extraction behavior of each framework by format and provides the complete survival matrix for all technique-framework combinations in the original 17-technique set.

LangChain¶

LangChain's document loaders are the most permissive of the four frameworks. They prioritize extracting as much text content as possible, which means many hiding techniques survive.

Extraction Behavior by Format¶

Format	Loader	Behavior
HTML	`BSHTMLLoader`	Uses `BeautifulSoup.get_text()` to strip tags. Retains text from `display:none` elements. Strips HTML comments.
DOCX	`Docx2txtLoader`	Extracts `w:t` elements from `word/document.xml` via `python-docx`. Also reads `docProps/core.xml` metadata (title, subject, description). Does not extract comments or custom XML parts.
PDF	`PyPDFLoader`	Basic text extraction from content streams. Includes annotations (`/Contents` fields). Does not reliably extract XMP metadata in all configurations.
TXT	`TextLoader`	Raw content pass-through. No transformation or normalization.
Markdown	`TextLoader`	Raw content pass-through. HTML comments within Markdown are preserved.

LangChain Key Insight

LangChain's HTML loader strips comments but does not strip content from hidden elements. This makes invisible-div and css-hide effective techniques. For DOCX, the metadata extraction path makes the metadata technique viable.

LlamaIndex¶

LlamaIndex uses html2text for HTML conversion, which applies more aggressive stripping than LangChain's BeautifulSoup approach. DOCX and PDF extraction is similar but with less metadata coverage.

Extraction Behavior by Format¶

Format	Loader	Behavior
HTML	`SimpleDirectoryReader`	Uses `html2text` which strips comments and hidden elements (`display:none`, `visibility:hidden`). More aggressive than `get_text()`.
DOCX	`SimpleDirectoryReader`	Extracts `w:t` text content from `word/document.xml`. Does not extract metadata from `docProps/core.xml`, comments, or custom XML.
PDF	`SimpleDirectoryReader`	Text extraction with annotation support. Generally includes `/Contents` annotation text alongside stream text.
TXT	`SimpleDirectoryReader`	Raw content pass-through.
Markdown	`SimpleDirectoryReader`	Raw content pass-through.

LlamaIndex Key Insight

The html2text library strips hidden elements, which makes invisible-div fail under LlamaIndex. However, css-hide uses class-based hiding with zero font size, which html2text does not reliably detect, allowing it to survive.

Unstructured.io¶

Unstructured applies the most aggressive sanitization. It is designed for production data pipelines where clean text extraction is the priority, which means it actively strips many of the hiding vectors that other frameworks miss.

Extraction Behavior by Format¶

Format	Loader	Behavior
HTML	`partition_html`	Strips comments, hidden elements, `aria-hidden` elements, and most CSS-hidden content. The most thorough HTML sanitizer of the four.
DOCX	`partition_docx`	Extracts only visible `w:t` text from `word/document.xml`. Strips comments, metadata, and custom XML entirely. Font-size-zero and white-font runs in `w:t` elements still pass through because the extractor does not evaluate run properties.
PDF	`partition_pdf`	Text extraction only. Strips annotations, metadata, and XMP data. Only content stream text survives.
TXT	`partition_text`	Raw text with Unicode normalization. Zero-width characters are stripped. Homoglyphs and BiDi overrides survive because they use visible Unicode codepoints.
Markdown	`partition_md`	Treats as raw text. HTML comments within Markdown survive because the partition function does not apply HTML stripping to Markdown content.

Unstructured Aggressiveness

Unstructured strips the widest range of hiding techniques. Only techniques that embed payloads within the visible text layer---fontzero, whitefont, homoglyph, and bidi-override---reliably survive.

Haystack¶

Haystack uses format-specific converters (HTMLToDocument, DocxToDocument, PyPDFToDocument, TextFileToDocument). Its HTML handling falls between LlamaIndex and Unstructured in aggressiveness: it strips comments and hidden elements but does not strip aria-hidden content.

Extraction Behavior by Format¶

Format	Converter	Behavior
HTML	`HTMLToDocument`	Uses trafilatura-style extraction. Strips HTML comments and elements with `display:none` / `visibility:hidden`. Does not strip `aria-hidden` content (unlike Unstructured).
DOCX	`DocxToDocument`	Extracts `w:t` text from `word/document.xml` via `python-docx`. Does not extract metadata, comments, or custom XML parts.
PDF	`PyPDFToDocument`	Text extraction via PyPDF. Includes annotation `/Contents` text alongside content stream text. Does not extract XMP metadata.
TXT	`TextFileToDocument`	Raw content pass-through. No transformation or normalization.
Markdown	`TextFileToDocument`	Raw content pass-through. HTML comments within Markdown are preserved.

Haystack Key Insight

Haystack occupies a middle ground: its HTML handling is more aggressive than LangChain (strips hidden elements) but less aggressive than Unstructured (retains aria-hidden). This makes aria-hidden a Haystack-specific opportunity. DOCX extraction is identical to LlamaIndex behavior, and PDF is similar to LangChain.

Survival Matrix¶

The following table shows whether each technique's payload survives extraction by each framework. Results are based on hemlock's validation engine, which simulates the extraction behavior documented above.

Technique	Format	LangChain	LlamaIndex	Unstructured	Haystack
`comment`	HTML	Fail	Fail	Fail	Fail
`invisible-div`	HTML	Pass	Fail	Fail	Fail
`aria-hidden`	HTML	Pass	Fail	Fail	Pass
`css-hide`	HTML	Pass	Fail	Fail	Fail
`metadata`	DOCX	Pass	Fail	Fail	Fail
`fontzero`	DOCX	Pass	Pass	Pass	Pass
`whitefont`	DOCX	Pass	Pass	Pass	Pass
`comment`	DOCX	Fail	Fail	Fail	Fail
`custom-xml`	DOCX	Fail	Fail	Fail	Fail
`annotation`	PDF	Pass	Pass	Fail	Pass
`invisible-text`	PDF	Pass	Pass	Fail	Pass
`javascript`	PDF	Fail	Fail	Fail	Fail
`xmp-metadata`	PDF	Pass	Fail	Fail	Fail
`zero-width`	TXT	Pass	Pass	Fail	Pass
`homoglyph`	TXT	Pass	Pass	Pass	Pass
`bidi-override`	TXT	Pass	Pass	Pass	Pass
`html-comment`	Markdown	Pass	Pass	Fail	Pass

Summary by Framework¶

Framework	Pass	Fail	Pass Rate
LangChain	13	4	76.5%
Haystack	9	8	52.9%
LlamaIndex	8	9	47.1%
Unstructured	4	13	23.5%

Key Takeaways¶

Universal Survivors¶

Four techniques survive all four frameworks:

fontzero (DOCX) --- 1-point font runs are extracted as w:t text by all loaders.
whitefont (DOCX) --- White-on-white text is also within w:t elements and passes through.
homoglyph (TXT) --- Visually similar Unicode codepoints are real characters; no framework strips them.
bidi-override (TXT) --- BiDi control characters alter display order but the underlying text remains.

Framework-Specific Opportunities¶

LangChain only: invisible-div (HTML), css-hide (HTML), metadata (DOCX), xmp-metadata (PDF)
LangChain + Haystack: aria-hidden (HTML)
LangChain + LlamaIndex + Haystack: annotation (PDF), invisible-text (PDF), zero-width (TXT), html-comment (Markdown)

Universally Stripped¶

Four techniques fail across all frameworks:

comment (HTML) --- Every framework strips HTML comments.
comment (DOCX) --- No framework extracts word/comments.xml.
javascript (PDF) --- No framework executes or extracts PDF JavaScript.
custom-xml (DOCX) --- No framework extracts custom XML parts.

Recommendations for Operators¶

Prioritize Universal Survivors

When the target's RAG stack is unknown, prioritize DOCX fontzero/whitefont, TXT homoglyph/bidi-override, and Markdown html-comment. These techniques survive regardless of which framework processes the documents.

Maximize Coverage with Format Diversity

Generate documents across multiple formats using the same payload. A target knowledge base that ingests both DOCX and TXT files gives you two independent survival paths.

Validate Before Deployment

Always run hemlock validate against crafted documents before inserting them into a target knowledge base. Framework behavior can change between library versions, and validation confirms survival against the current simulation.

LangChain-Specific Engagements

If you have confirmed the target uses LangChain, the full range of 12 passing techniques is available. The metadata (DOCX) and xmp-metadata (PDF) techniques are particularly useful because they embed the payload outside the document body entirely.

Haystack-Specific Engagements

Haystack's HTML converter retains aria-hidden content, making this the only framework where aria-hidden survives. Combined with css-hide, this gives two HTML-based options. PDF annotation and invisible-text also survive due to PyPDF's annotation inclusion.

Next Steps¶

Validation Engine Overview --- How the validation pipeline works
Validate API Reference --- Programmatic validation in Go
Techniques Reference --- Detailed documentation for each technique