HTML Techniques¶
hemlock provides ten hiding techniques for HTML documents, targeting the full spectrum from trivial comment injection to CSS-class-based concealment, semantic markup abuse, adversarial chunking, phantom content generation, and visual camouflage. HTML is a common ingestion format for RAG pipelines that scrape web pages, crawl documentation sites, or load exported knowledge base articles.
Technique Overview¶
| Technique | Stealth | Description |
|---|---|---|
comment |
30 | Hidden HTML comment containing payload |
invisible-div |
55 | Invisible div with display:none and offscreen positioning |
aria-hidden |
70 | aria-hidden="true" span with offscreen CSS |
css-hide |
75 | Class-based font-size:0; color:transparent |
microdata |
60 | Payload in schema.org microdata meta tag |
chunk-boundary |
65 | Payload split across <p> tags at chunk-size intervals |
offscreen |
80 | position:absolute;left:-9999px div |
color-transparent |
85 | color:transparent; user-select:none text |
noscript |
60 | Payload in <noscript> tag |
camouflage |
80 | Text behind a 1×1 transparent image via CSS z-index |
comment¶
How It Works¶
The simplest HTML hiding technique. The payload is placed directly inside an HTML comment (<!-- ... -->) within the document body. Comments are part of the DOM but are not rendered by browsers.
hemlock generates a valid HTML5 document with the cover text in a <p> tag and the payload in a comment immediately after:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Document</title>
</head>
<body>
<p>Visible cover text content goes here.</p>
<!-- INJECTED PAYLOAD TEXT -->
</body>
</html>
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | BeautifulSoup.get_text() strips all comments before text extraction |
|
| LlamaIndex | html2text removes comments during conversion |
|
| Unstructured | Comments are stripped in the initial sanitization pass | |
| Haystack | Strips comments during HTML extraction |
Low effectiveness
HTML comments are stripped by every major RAG framework. This technique is useful only as a baseline for testing or when targeting custom pipelines with minimal sanitization.
CLI Example¶
Detection Notes¶
Detection
- Search for
<!--and-->patterns in raw HTML - Any non-trivial text inside comments warrants inspection
- Trivial to detect with regex:
<!--[\s\S]*?-->
invisible-div¶
How It Works¶
The payload is placed inside a <div> element hidden through multiple redundant CSS properties applied as inline styles. The combination ensures the content is invisible in any browser rendering:
<div style="display:none;position:absolute;left:-9999px;height:0;width:0;overflow:hidden">
INJECTED PAYLOAD TEXT
</div>
The key insight is that many HTML-to-text converters perform tag stripping without evaluating CSS. They walk the DOM, extract text nodes, and concatenate them -- regardless of whether the parent element is visually hidden.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Strips tags but retains text from display:none elements |
|
| LlamaIndex | html2text evaluates hidden elements and strips them |
|
| Unstructured | Explicitly strips elements with display:none and visibility:hidden |
|
| Haystack | Strips elements with display:none during extraction |
Why LangChain is vulnerable
LangChain's default HTML loader uses BeautifulSoup.get_text(), which extracts the text content of every element regardless of CSS styling. It performs no CSS evaluation, so display:none content passes through as if it were visible.
CLI Example¶
hemlock craft \
--format html \
--technique invisible-div \
--payload exfiltrate \
--topic "employee onboarding guide" \
--output ./output
Detection Notes¶
Detection
- Search for inline styles containing
display:none,visibility:hidden, or offscreen positioning (left:-9999px) - Inspect elements with
height:0,width:0, oroverflow:hidden - Regex pattern:
style="[^"]*display\s*:\s*none[^"]*"
aria-hidden¶
How It Works¶
The payload is embedded in a <span> element marked with aria-hidden="true" and positioned offscreen via CSS. The aria-hidden attribute signals assistive technologies to ignore the element, and the CSS ensures no visual rendering:
<span aria-hidden="true"
style="position:absolute;left:-9999px;width:1px;height:1px;overflow:hidden">
INJECTED PAYLOAD TEXT
</span>
This technique is more robust against simple sanitizers than invisible-div because some pipelines check for display:none specifically but do not evaluate aria-hidden or offscreen positioning patterns.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Does not evaluate aria-hidden attributes; extracts all text nodes |
|
| LlamaIndex | Strips hidden elements including aria-hidden content |
|
| Unstructured | Explicitly strips aria-hidden="true" elements |
|
| Haystack | Does not strip aria-hidden content; retains text nodes |
CLI Example¶
hemlock craft \
--format html \
--technique aria-hidden \
--payload redirect \
--count 10 \
--output ./output
Detection Notes¶
Detection
- Search for
aria-hidden="true"attributes containing non-accessibility-related text - Flag any
aria-hiddenelements with substantial text content (legitimate uses typically hide decorative icons or duplicate labels) - Offscreen positioning (
left:-9999px) combined with tiny dimensions (1px) is a strong signal
css-hide¶
How It Works¶
The highest-stealth HTML technique. Instead of inline styles, the hiding rules are defined in a <style> block in the document <head>. The payload lives in a <span> with a CSS class that zeroes out font size, makes the color transparent, collapses line height, and sets opacity to 0:
<head>
<style>.hemlock-x{font-size:0;color:transparent;line-height:0;opacity:0;}</style>
</head>
<body>
<p>Visible cover text content goes here.</p>
<span class="hemlock-x">INJECTED PAYLOAD TEXT</span>
</body>
This approach survives sanitizers that strip inline style attributes but do not evaluate <style> blocks against element classes. The text content remains in the DOM as a normal text node.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Strips tags but does not evaluate CSS classes; text content passes through | |
| LlamaIndex | More aggressive stripping catches class-hidden content | |
| Unstructured | Strips elements identified as hidden through CSS analysis | |
| Haystack | Evaluates CSS and strips class-hidden content |
Why this is the strongest HTML technique
Many HTML sanitizers focus on inline style attributes or specific elements like <script> and <style>. The css-hide technique separates the hiding mechanism (CSS class in <style>) from the payload container (plain <span> with a class), making it harder for simple attribute-based filters to catch.
CLI Example¶
Detection Notes¶
Detection
- Parse
<style>blocks and identify classes that setfont-size:0,color:transparent,opacity:0, orline-height:0 - Cross-reference those classes against elements containing text content
- A static analysis tool that resolves CSS selectors against the DOM is required for reliable detection
microdata¶
How It Works¶
Schema.org microdata is a standardized way to annotate HTML content with machine-readable metadata. The payload is embedded in a <meta> tag's content attribute inside a <div> marked with itemscope and itemtype attributes pointing to a schema.org type. The <meta> tag uses itemprop="description" to carry the payload:
<div itemscope itemtype="https://schema.org/Article">
<meta itemprop="description" content="INJECTED PAYLOAD TEXT">
<p>Visible cover text content goes here.</p>
</div>
The microdata structure is semantically valid HTML and appears as legitimate structured data markup. Search engines and web crawlers expect to see schema.org annotations, making this technique blend in with standard web development practices. The <meta> tag with itemprop is not rendered visually by browsers.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | BeautifulSoup.get_text() extracts meta content attributes in some configurations |
|
| LlamaIndex | Meta tag content may be included during HTML-to-text conversion | |
| Haystack | Raw HTML processing includes meta tag content | |
| Unstructured | Strips meta tags during initial HTML sanitization |
Semantic camouflage
Unlike other HTML hiding techniques that rely on CSS visibility tricks, microdata uses a standards-compliant annotation mechanism. Sanitizers that focus on display:none, aria-hidden, or inline styles will miss microdata payloads entirely because the hiding mechanism is semantic rather than visual.
CLI Example¶
hemlock craft \
--format html \
--technique microdata \
--payload override \
--topic "product documentation" \
--output ./output
Detection Notes¶
Detection
- Inspect
<meta>tags withitempropattributes for instruction-like content - Flag microdata
contentattributes that contain imperative language or are unusually long - Check
itemtypeURLs against expected schema.org types for the document context - Regex:
<meta\s+itemprop="[^"]*"\s+content="[^"]*"
chunk-boundary¶
How It Works¶
Positions payload fragments to span likely chunk boundaries used by RAG text splitters. The payload is split across multiple <p> tags separated by approximately 500 characters of benign cover text --- matching the default RecursiveCharacterTextSplitter chunk size used by LangChain.
No single chunk contains the full payload, but when the LLM assembles context from multiple retrieved chunks, the injection may be reconstructed.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | All <p> text is extracted; chunks may each contain a fragment |
|
| LlamaIndex | Raw text extraction preserves all paragraph content | |
| Unstructured | Paragraph text is extracted regardless of chunk splitting | |
| Haystack | Full text extraction preserves fragment positions |
CLI Example¶
hemlock craft --format html --technique chunk-boundary --payload override --count 1 --output ./cb-test
offscreen¶
How It Works¶
Similar to invisible-div but uses position:absolute;left:-9999px positioning instead of display:none. This technique exploits the fact that some sanitizers check for display:none but not offscreen positioning. The payload remains in the DOM and is extracted by text parsers, but is positioned far off the visible viewport.
Also uses additional concealment: font-size:0; overflow:hidden; height:0; width:0; opacity:0 as a redundant hiding layer.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | BeautifulSoup.get_text() reads all DOM text regardless of CSS positioning |
|
| LlamaIndex | html2text uses a stricter rendering model |
|
| Unstructured | Strips hidden elements by inspecting style attributes | |
| Haystack | Basic text extraction does not evaluate CSS positioning |
CLI Example¶
color-transparent¶
How It Works¶
Uses color:transparent; user-select:none to make payload text invisible. Unlike css-hide which uses class-based styling, this technique uses inline styles. The text occupies zero visual space but remains in the DOM for extraction.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Text extraction ignores color and select properties | |
| LlamaIndex | Stricter HTML to text conversion | |
| Unstructured | Style-aware sanitization removes transparent text | |
| Haystack | Basic extraction does not filter by text color |
CLI Example¶
hemlock craft --format html --technique color-transparent --payload override --count 1 --output ./ct-test
noscript¶
How It Works¶
Places the payload inside a <noscript> tag. In JavaScript-enabled browsers, <noscript> content is not rendered. However, RAG document loaders typically parse HTML without JavaScript execution, so the <noscript> content is extracted as regular text.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | BeautifulSoup.get_text() extracts <noscript> content |
|
| LlamaIndex | Text conversion includes <noscript> blocks |
|
| Unstructured | Strips <noscript> elements during sanitization |
|
| Haystack | Basic text extraction includes all element text |
CLI Example¶
camouflage¶
How It Works¶
Positions the payload text behind a small inline image using CSS z-index layering. The payload sits in a <span> at z-index:-1 behind a 1×1 transparent PNG at z-index:1, making it invisible to users but present in the DOM text flow where RAG extractors read it.
<span class="hemlock-camo">
<img src="data:image/png;base64,..." alt="" width="200" height="40">
<span class="hemlock-camo-text">INJECTED PAYLOAD TEXT</span>
</span>
The payload <span> also uses font-size:0; line-height:0; color:transparent; overflow:hidden for additional concealment.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | BeautifulSoup.get_text() extracts all text including hidden spans |
|
| LlamaIndex | Text conversion includes <span> text content |
|
| Unstructured | Tag stripping retains text content from all elements | |
| Haystack | Basic text extraction includes all element text |
CLI Example¶
hemlock craft --format html --technique camouflage --payload override --count 1 --output ./camo-test
Survival Matrix¶
| Technique | Stealth | LangChain | LlamaIndex | Haystack | Unstructured |
|---|---|---|---|---|---|
comment |
30 | ||||
invisible-div |
55 | ||||
aria-hidden |
70 | ||||
css-hide |
75 | ||||
microdata |
60 | ||||
chunk-boundary |
65 | ||||
offscreen |
80 | ||||
color-transparent |
85 | ||||
noscript |
60 | ||||
camouflage |
80 |
New technique categories
The chunk-boundary technique exploits how text splitters divide documents before embedding. The offscreen and color-transparent techniques generate "phantom chunks" that are extracted but invisible. The noscript technique exploits the gap between browser rendering and parser extraction. The camouflage technique uses CSS z-index layering to hide text behind an image — it survives all four frameworks because tag stripping preserves DOM text content.