Skip to content

Framework Comparison

Each RAG framework extracts text from documents differently. These differences determine which hiding techniques survive extraction and which are stripped. This page documents the extraction behavior of each framework by format and provides the complete survival matrix for all technique-framework combinations in the original 17-technique set.


LangChain

LangChain's document loaders are the most permissive of the four frameworks. They prioritize extracting as much text content as possible, which means many hiding techniques survive.

Extraction Behavior by Format

Format Loader Behavior
HTML BSHTMLLoader Uses BeautifulSoup.get_text() to strip tags. Retains text from display:none elements. Strips HTML comments.
DOCX Docx2txtLoader Extracts w:t elements from word/document.xml via python-docx. Also reads docProps/core.xml metadata (title, subject, description). Does not extract comments or custom XML parts.
PDF PyPDFLoader Basic text extraction from content streams. Includes annotations (/Contents fields). Does not reliably extract XMP metadata in all configurations.
TXT TextLoader Raw content pass-through. No transformation or normalization.
Markdown TextLoader Raw content pass-through. HTML comments within Markdown are preserved.

LangChain Key Insight

LangChain's HTML loader strips comments but does not strip content from hidden elements. This makes invisible-div and css-hide effective techniques. For DOCX, the metadata extraction path makes the metadata technique viable.


LlamaIndex

LlamaIndex uses html2text for HTML conversion, which applies more aggressive stripping than LangChain's BeautifulSoup approach. DOCX and PDF extraction is similar but with less metadata coverage.

Extraction Behavior by Format

Format Loader Behavior
HTML SimpleDirectoryReader Uses html2text which strips comments and hidden elements (display:none, visibility:hidden). More aggressive than get_text().
DOCX SimpleDirectoryReader Extracts w:t text content from word/document.xml. Does not extract metadata from docProps/core.xml, comments, or custom XML.
PDF SimpleDirectoryReader Text extraction with annotation support. Generally includes /Contents annotation text alongside stream text.
TXT SimpleDirectoryReader Raw content pass-through.
Markdown SimpleDirectoryReader Raw content pass-through.

LlamaIndex Key Insight

The html2text library strips hidden elements, which makes invisible-div fail under LlamaIndex. However, css-hide uses class-based hiding with zero font size, which html2text does not reliably detect, allowing it to survive.


Unstructured.io

Unstructured applies the most aggressive sanitization. It is designed for production data pipelines where clean text extraction is the priority, which means it actively strips many of the hiding vectors that other frameworks miss.

Extraction Behavior by Format

Format Loader Behavior
HTML partition_html Strips comments, hidden elements, aria-hidden elements, and most CSS-hidden content. The most thorough HTML sanitizer of the four.
DOCX partition_docx Extracts only visible w:t text from word/document.xml. Strips comments, metadata, and custom XML entirely. Font-size-zero and white-font runs in w:t elements still pass through because the extractor does not evaluate run properties.
PDF partition_pdf Text extraction only. Strips annotations, metadata, and XMP data. Only content stream text survives.
TXT partition_text Raw text with Unicode normalization. Zero-width characters are stripped. Homoglyphs and BiDi overrides survive because they use visible Unicode codepoints.
Markdown partition_md Treats as raw text. HTML comments within Markdown survive because the partition function does not apply HTML stripping to Markdown content.

Unstructured Aggressiveness

Unstructured strips the widest range of hiding techniques. Only techniques that embed payloads within the visible text layer---fontzero, whitefont, homoglyph, and bidi-override---reliably survive.


Haystack

Haystack uses format-specific converters (HTMLToDocument, DocxToDocument, PyPDFToDocument, TextFileToDocument). Its HTML handling falls between LlamaIndex and Unstructured in aggressiveness: it strips comments and hidden elements but does not strip aria-hidden content.

Extraction Behavior by Format

Format Converter Behavior
HTML HTMLToDocument Uses trafilatura-style extraction. Strips HTML comments and elements with display:none / visibility:hidden. Does not strip aria-hidden content (unlike Unstructured).
DOCX DocxToDocument Extracts w:t text from word/document.xml via python-docx. Does not extract metadata, comments, or custom XML parts.
PDF PyPDFToDocument Text extraction via PyPDF. Includes annotation /Contents text alongside content stream text. Does not extract XMP metadata.
TXT TextFileToDocument Raw content pass-through. No transformation or normalization.
Markdown TextFileToDocument Raw content pass-through. HTML comments within Markdown are preserved.

Haystack Key Insight

Haystack occupies a middle ground: its HTML handling is more aggressive than LangChain (strips hidden elements) but less aggressive than Unstructured (retains aria-hidden). This makes aria-hidden a Haystack-specific opportunity. DOCX extraction is identical to LlamaIndex behavior, and PDF is similar to LangChain.


Survival Matrix

The following table shows whether each technique's payload survives extraction by each framework. Results are based on hemlock's validation engine, which simulates the extraction behavior documented above.

Technique Format LangChain LlamaIndex Unstructured Haystack
comment HTML Fail Fail Fail Fail
invisible-div HTML Pass Fail Fail Fail
aria-hidden HTML Pass Fail Fail Pass
css-hide HTML Pass Fail Fail Fail
metadata DOCX Pass Fail Fail Fail
fontzero DOCX Pass Pass Pass Pass
whitefont DOCX Pass Pass Pass Pass
comment DOCX Fail Fail Fail Fail
custom-xml DOCX Fail Fail Fail Fail
annotation PDF Pass Pass Fail Pass
invisible-text PDF Pass Pass Fail Pass
javascript PDF Fail Fail Fail Fail
xmp-metadata PDF Pass Fail Fail Fail
zero-width TXT Pass Pass Fail Pass
homoglyph TXT Pass Pass Pass Pass
bidi-override TXT Pass Pass Pass Pass
html-comment Markdown Pass Pass Fail Pass

Summary by Framework

Framework Pass Fail Pass Rate
LangChain 13 4 76.5%
Haystack 9 8 52.9%
LlamaIndex 8 9 47.1%
Unstructured 4 13 23.5%

Key Takeaways

Universal Survivors

Four techniques survive all four frameworks:

  1. fontzero (DOCX) --- 1-point font runs are extracted as w:t text by all loaders.
  2. whitefont (DOCX) --- White-on-white text is also within w:t elements and passes through.
  3. homoglyph (TXT) --- Visually similar Unicode codepoints are real characters; no framework strips them.
  4. bidi-override (TXT) --- BiDi control characters alter display order but the underlying text remains.

Framework-Specific Opportunities

  • LangChain only: invisible-div (HTML), css-hide (HTML), metadata (DOCX), xmp-metadata (PDF)
  • LangChain + Haystack: aria-hidden (HTML)
  • LangChain + LlamaIndex + Haystack: annotation (PDF), invisible-text (PDF), zero-width (TXT), html-comment (Markdown)

Universally Stripped

Four techniques fail across all frameworks:

  1. comment (HTML) --- Every framework strips HTML comments.
  2. comment (DOCX) --- No framework extracts word/comments.xml.
  3. javascript (PDF) --- No framework executes or extracts PDF JavaScript.
  4. custom-xml (DOCX) --- No framework extracts custom XML parts.

Recommendations for Operators

Prioritize Universal Survivors

When the target's RAG stack is unknown, prioritize DOCX fontzero/whitefont, TXT homoglyph/bidi-override, and Markdown html-comment. These techniques survive regardless of which framework processes the documents.

Maximize Coverage with Format Diversity

Generate documents across multiple formats using the same payload. A target knowledge base that ingests both DOCX and TXT files gives you two independent survival paths.

Validate Before Deployment

Always run hemlock validate against crafted documents before inserting them into a target knowledge base. Framework behavior can change between library versions, and validation confirms survival against the current simulation.

LangChain-Specific Engagements

If you have confirmed the target uses LangChain, the full range of 12 passing techniques is available. The metadata (DOCX) and xmp-metadata (PDF) techniques are particularly useful because they embed the payload outside the document body entirely.

Haystack-Specific Engagements

Haystack's HTML converter retains aria-hidden content, making this the only framework where aria-hidden survives. Combined with css-hide, this gives two HTML-based options. PDF annotation and invisible-text also survive due to PyPDF's annotation inclusion.


Next Steps