Framework Comparison¶
Each RAG framework extracts text from documents differently. These differences determine which hiding techniques survive extraction and which are stripped. This page documents the extraction behavior of each framework by format and provides the complete survival matrix for all technique-framework combinations in the original 17-technique set.
LangChain¶
LangChain's document loaders are the most permissive of the four frameworks. They prioritize extracting as much text content as possible, which means many hiding techniques survive.
Extraction Behavior by Format¶
| Format | Loader | Behavior |
|---|---|---|
| HTML | BSHTMLLoader |
Uses BeautifulSoup.get_text() to strip tags. Retains text from display:none elements. Strips HTML comments. |
| DOCX | Docx2txtLoader |
Extracts w:t elements from word/document.xml via python-docx. Also reads docProps/core.xml metadata (title, subject, description). Does not extract comments or custom XML parts. |
PyPDFLoader |
Basic text extraction from content streams. Includes annotations (/Contents fields). Does not reliably extract XMP metadata in all configurations. |
|
| TXT | TextLoader |
Raw content pass-through. No transformation or normalization. |
| Markdown | TextLoader |
Raw content pass-through. HTML comments within Markdown are preserved. |
LangChain Key Insight
LangChain's HTML loader strips comments but does not strip content from hidden elements. This makes invisible-div and css-hide effective techniques. For DOCX, the metadata extraction path makes the metadata technique viable.
LlamaIndex¶
LlamaIndex uses html2text for HTML conversion, which applies more aggressive stripping than LangChain's BeautifulSoup approach. DOCX and PDF extraction is similar but with less metadata coverage.
Extraction Behavior by Format¶
| Format | Loader | Behavior |
|---|---|---|
| HTML | SimpleDirectoryReader |
Uses html2text which strips comments and hidden elements (display:none, visibility:hidden). More aggressive than get_text(). |
| DOCX | SimpleDirectoryReader |
Extracts w:t text content from word/document.xml. Does not extract metadata from docProps/core.xml, comments, or custom XML. |
SimpleDirectoryReader |
Text extraction with annotation support. Generally includes /Contents annotation text alongside stream text. |
|
| TXT | SimpleDirectoryReader |
Raw content pass-through. |
| Markdown | SimpleDirectoryReader |
Raw content pass-through. |
LlamaIndex Key Insight
The html2text library strips hidden elements, which makes invisible-div fail under LlamaIndex. However, css-hide uses class-based hiding with zero font size, which html2text does not reliably detect, allowing it to survive.
Unstructured.io¶
Unstructured applies the most aggressive sanitization. It is designed for production data pipelines where clean text extraction is the priority, which means it actively strips many of the hiding vectors that other frameworks miss.
Extraction Behavior by Format¶
| Format | Loader | Behavior |
|---|---|---|
| HTML | partition_html |
Strips comments, hidden elements, aria-hidden elements, and most CSS-hidden content. The most thorough HTML sanitizer of the four. |
| DOCX | partition_docx |
Extracts only visible w:t text from word/document.xml. Strips comments, metadata, and custom XML entirely. Font-size-zero and white-font runs in w:t elements still pass through because the extractor does not evaluate run properties. |
partition_pdf |
Text extraction only. Strips annotations, metadata, and XMP data. Only content stream text survives. | |
| TXT | partition_text |
Raw text with Unicode normalization. Zero-width characters are stripped. Homoglyphs and BiDi overrides survive because they use visible Unicode codepoints. |
| Markdown | partition_md |
Treats as raw text. HTML comments within Markdown survive because the partition function does not apply HTML stripping to Markdown content. |
Unstructured Aggressiveness
Unstructured strips the widest range of hiding techniques. Only techniques that embed payloads within the visible text layer---fontzero, whitefont, homoglyph, and bidi-override---reliably survive.
Haystack¶
Haystack uses format-specific converters (HTMLToDocument, DocxToDocument, PyPDFToDocument, TextFileToDocument). Its HTML handling falls between LlamaIndex and Unstructured in aggressiveness: it strips comments and hidden elements but does not strip aria-hidden content.
Extraction Behavior by Format¶
| Format | Converter | Behavior |
|---|---|---|
| HTML | HTMLToDocument |
Uses trafilatura-style extraction. Strips HTML comments and elements with display:none / visibility:hidden. Does not strip aria-hidden content (unlike Unstructured). |
| DOCX | DocxToDocument |
Extracts w:t text from word/document.xml via python-docx. Does not extract metadata, comments, or custom XML parts. |
PyPDFToDocument |
Text extraction via PyPDF. Includes annotation /Contents text alongside content stream text. Does not extract XMP metadata. |
|
| TXT | TextFileToDocument |
Raw content pass-through. No transformation or normalization. |
| Markdown | TextFileToDocument |
Raw content pass-through. HTML comments within Markdown are preserved. |
Haystack Key Insight
Haystack occupies a middle ground: its HTML handling is more aggressive than LangChain (strips hidden elements) but less aggressive than Unstructured (retains aria-hidden). This makes aria-hidden a Haystack-specific opportunity. DOCX extraction is identical to LlamaIndex behavior, and PDF is similar to LangChain.
Survival Matrix¶
The following table shows whether each technique's payload survives extraction by each framework. Results are based on hemlock's validation engine, which simulates the extraction behavior documented above.
| Technique | Format | LangChain | LlamaIndex | Unstructured | Haystack |
|---|---|---|---|---|---|
comment |
HTML | Fail | Fail | Fail | Fail |
invisible-div |
HTML | Pass | Fail | Fail | Fail |
aria-hidden |
HTML | Pass | Fail | Fail | Pass |
css-hide |
HTML | Pass | Fail | Fail | Fail |
metadata |
DOCX | Pass | Fail | Fail | Fail |
fontzero |
DOCX | Pass | Pass | Pass | Pass |
whitefont |
DOCX | Pass | Pass | Pass | Pass |
comment |
DOCX | Fail | Fail | Fail | Fail |
custom-xml |
DOCX | Fail | Fail | Fail | Fail |
annotation |
Pass | Pass | Fail | Pass | |
invisible-text |
Pass | Pass | Fail | Pass | |
javascript |
Fail | Fail | Fail | Fail | |
xmp-metadata |
Pass | Fail | Fail | Fail | |
zero-width |
TXT | Pass | Pass | Fail | Pass |
homoglyph |
TXT | Pass | Pass | Pass | Pass |
bidi-override |
TXT | Pass | Pass | Pass | Pass |
html-comment |
Markdown | Pass | Pass | Fail | Pass |
Summary by Framework¶
| Framework | Pass | Fail | Pass Rate |
|---|---|---|---|
| LangChain | 13 | 4 | 76.5% |
| Haystack | 9 | 8 | 52.9% |
| LlamaIndex | 8 | 9 | 47.1% |
| Unstructured | 4 | 13 | 23.5% |
Key Takeaways¶
Universal Survivors¶
Four techniques survive all four frameworks:
fontzero(DOCX) --- 1-point font runs are extracted asw:ttext by all loaders.whitefont(DOCX) --- White-on-white text is also withinw:telements and passes through.homoglyph(TXT) --- Visually similar Unicode codepoints are real characters; no framework strips them.bidi-override(TXT) --- BiDi control characters alter display order but the underlying text remains.
Framework-Specific Opportunities¶
- LangChain only:
invisible-div(HTML),css-hide(HTML),metadata(DOCX),xmp-metadata(PDF) - LangChain + Haystack:
aria-hidden(HTML) - LangChain + LlamaIndex + Haystack:
annotation(PDF),invisible-text(PDF),zero-width(TXT),html-comment(Markdown)
Universally Stripped¶
Four techniques fail across all frameworks:
comment(HTML) --- Every framework strips HTML comments.comment(DOCX) --- No framework extractsword/comments.xml.javascript(PDF) --- No framework executes or extracts PDF JavaScript.custom-xml(DOCX) --- No framework extracts custom XML parts.
Recommendations for Operators¶
Prioritize Universal Survivors
When the target's RAG stack is unknown, prioritize DOCX fontzero/whitefont, TXT homoglyph/bidi-override, and Markdown html-comment. These techniques survive regardless of which framework processes the documents.
Maximize Coverage with Format Diversity
Generate documents across multiple formats using the same payload. A target knowledge base that ingests both DOCX and TXT files gives you two independent survival paths.
Validate Before Deployment
Always run hemlock validate against crafted documents before inserting them into a target knowledge base. Framework behavior can change between library versions, and validation confirms survival against the current simulation.
LangChain-Specific Engagements
If you have confirmed the target uses LangChain, the full range of 12 passing techniques is available. The metadata (DOCX) and xmp-metadata (PDF) techniques are particularly useful because they embed the payload outside the document body entirely.
Haystack-Specific Engagements
Haystack's HTML converter retains aria-hidden content, making this the only framework where aria-hidden survives. Combined with css-hide, this gives two HTML-based options. PDF annotation and invisible-text also survive due to PyPDF's annotation inclusion.
Next Steps¶
- Validation Engine Overview --- How the validation pipeline works
- Validate API Reference --- Programmatic validation in Go
- Techniques Reference --- Detailed documentation for each technique