Skip to content

Techniques Overview

hemlock ships 63 hiding techniques spread across 11 document formats, each designed to embed prompt injection payloads in locations that survive RAG pipeline text extraction. Techniques range from trivial HTML comments (stealth 30) to LSB steganography in PNG images (stealth 90), and their effectiveness varies significantly depending on which framework ingests the document.

This section provides a technical reference for every technique: how it works at the byte level, which extraction pipelines it defeats, and what defenders should look for.

Master Technique Matrix

The table below summarizes every technique, its stealth score, and whether the hidden payload survives extraction by each of the four major RAG frameworks.

Format Technique Stealth Description LangChain LlamaIndex Unstructured Haystack
HTML comment 30 Hidden HTML comment
HTML invisible-div 55 display:none div with offscreen positioning
HTML aria-hidden 70 aria-hidden="true" span with offscreen CSS
HTML css-hide 75 Class-based font-size:0; color:transparent
HTML microdata 60 Payload in schema.org microdata meta tag
HTML chunk-boundary 65 Payload split across <p> tags at chunk-size intervals
HTML offscreen 80 position:absolute;left:-9999px div
HTML color-transparent 85 color:transparent; user-select:none text
HTML noscript 60 Payload in <noscript> tag
DOCX metadata 60 Payload in docProps/core.xml Dublin Core fields
DOCX metadata-distributed 70 Payload split across subject, description, keywords, category
DOCX fontzero 80 1pt font w:r run in document body
DOCX whitefont 70 White text on white background
DOCX comment 50 Word comment in comments.xml
DOCX custom-xml 65 Custom XML data part in the ZIP archive
DOCX chunk-boundary 60 Payload in separate <w:p> elements with filler between
DOCX hidden-paragraph 75 <w:vanish/> paragraph (Word's hidden text flag)
PDF annotation 65 Near-invisible text annotation
PDF invisible-text 75 0.5pt white text in content stream
PDF javascript 40 PDF JavaScript action containing payload
PDF xmp-metadata 60 XMP metadata stream (Subject, Keywords, Author)
PDF xmp-distributed 70 Payload split across multiple XMP properties
PDF chunk-boundary 55 Payload on separate pages
PDF offpage 70 Text placed at coordinates beyond page dimensions
TXT zero-width 85 Zero-width Unicode character encoding
TXT homoglyph 80 Cyrillic lookalike substitution with hidden payload
TXT bidi-override 70 RTL override characters to hide payload direction
TXT chunk-boundary 45 Payload fragments separated by ~512 chars of benign text
Markdown html-comment 35 HTML comments in .md files
Markdown frontmatter 55 Payload in YAML front matter metadata
Markdown link-title 65 Payload distributed across link title attributes
Markdown image-alt 60 Payload distributed across image alt text
Markdown chunk-boundary 50 Payload fragments in separate heading sections
RTF metadata 55 Payload in RTF \info block properties
RTF fontzero 75 Zero-point font group invisible to readers
RTF comment 40 Payload in RTF \*\annotation group
EPUB metadata 60 Payload in OPF metadata (dc:description)
EPUB metadata-distributed 65 Payload split across multiple OPF <meta> tags
EPUB css-hide 70 CSS-hidden span with zero font size
EPUB comment 35 XHTML comment in chapter body
EPUB aria-hidden 65 aria-hidden="true" span in XHTML content
EPUB toc 55 Payload hidden in EPUB table-of-contents <navPoint> labels
CSV extra-column 45 Payload in extra _metadata column
CSV bom-prefix 50 Payload after UTF-8 BOM in Notes field
CSV formula-injection 60 =CONCATENATE() formula reconstructing payload from cell references
JSON metadata-key 55 Payload in _metadata.description JSON key
JSON unicode-escape 70 Payload encoded as \uXXXX sequences
XLSX hidden-sheet 75 Payload on a hidden worksheet
XLSX metadata 60 Payload in docProps/core.xml properties
XLSX comment 50 Payload in cell comment
XLSX fontzero 80 Zero-point font cell invisible to readers
Image text-chunk 55 Payload in PNG tEXt Description chunk
Image xmp-metadata 60 Payload in XMP dc:description via PNG iTXt chunk
Image multi-chunk 65 Payload distributed across Title, Author, Description, Comment
Image steganographic 90 Payload encoded in LSB of pixel RGB channels

Legend: Survives   Stripped   Uncertain / pipeline-dependent

Understanding Stealth Scores

Stealth scores range from 0 to 100 and reflect how difficult a technique is to detect through casual inspection or automated scanning:

Range Rating Meaning
0--39 Low Easily detected by basic sanitizers or manual review
40--59 Moderate Requires format-aware inspection to detect
60--79 High Survives most casual review; needs targeted detection tooling
80--100 Very High Extremely difficult to detect without specialized analysis

Scores are assigned based on visual invisibility, resistance to automated stripping, and the likelihood that a human reviewer examining the raw file would notice the payload.

Stealth is not the same as survival

A technique can have a high stealth score (hard to detect) but low survival (stripped by most frameworks). The zero-width technique scores 85 for stealth because it is invisible in any text editor, yet Unstructured strips it during Unicode normalization. Conversely, fontzero in DOCX survives all four frameworks because every loader extracts w:t elements regardless of font size.

Format Deep Dives

Each format page covers the full technical detail for its techniques, including generated markup, framework-specific extraction behavior, CLI examples, and detection guidance.