Techniques Overview¶
hemlock ships 63 hiding techniques spread across 11 document formats, each designed to embed prompt injection payloads in locations that survive RAG pipeline text extraction. Techniques range from trivial HTML comments (stealth 30) to LSB steganography in PNG images (stealth 90), and their effectiveness varies significantly depending on which framework ingests the document.
This section provides a technical reference for every technique: how it works at the byte level, which extraction pipelines it defeats, and what defenders should look for.
Master Technique Matrix¶
The table below summarizes every technique, its stealth score, and whether the hidden payload survives extraction by each of the four major RAG frameworks.
| Format | Technique | Stealth | Description | LangChain | LlamaIndex | Unstructured | Haystack |
|---|---|---|---|---|---|---|---|
| HTML | comment |
30 | Hidden HTML comment | ||||
| HTML | invisible-div |
55 | display:none div with offscreen positioning |
||||
| HTML | aria-hidden |
70 | aria-hidden="true" span with offscreen CSS |
||||
| HTML | css-hide |
75 | Class-based font-size:0; color:transparent |
||||
| HTML | microdata |
60 | Payload in schema.org microdata meta tag | ||||
| HTML | chunk-boundary |
65 | Payload split across <p> tags at chunk-size intervals |
||||
| HTML | offscreen |
80 | position:absolute;left:-9999px div |
||||
| HTML | color-transparent |
85 | color:transparent; user-select:none text |
||||
| HTML | noscript |
60 | Payload in <noscript> tag |
||||
| DOCX | metadata |
60 | Payload in docProps/core.xml Dublin Core fields |
||||
| DOCX | metadata-distributed |
70 | Payload split across subject, description, keywords, category | ||||
| DOCX | fontzero |
80 | 1pt font w:r run in document body |
||||
| DOCX | whitefont |
70 | White text on white background | ||||
| DOCX | comment |
50 | Word comment in comments.xml |
||||
| DOCX | custom-xml |
65 | Custom XML data part in the ZIP archive | ||||
| DOCX | chunk-boundary |
60 | Payload in separate <w:p> elements with filler between |
||||
| DOCX | hidden-paragraph |
75 | <w:vanish/> paragraph (Word's hidden text flag) |
||||
annotation |
65 | Near-invisible text annotation | |||||
invisible-text |
75 | 0.5pt white text in content stream | |||||
javascript |
40 | PDF JavaScript action containing payload | |||||
xmp-metadata |
60 | XMP metadata stream (Subject, Keywords, Author) | |||||
xmp-distributed |
70 | Payload split across multiple XMP properties | |||||
chunk-boundary |
55 | Payload on separate pages | |||||
offpage |
70 | Text placed at coordinates beyond page dimensions | |||||
| TXT | zero-width |
85 | Zero-width Unicode character encoding | ||||
| TXT | homoglyph |
80 | Cyrillic lookalike substitution with hidden payload | ||||
| TXT | bidi-override |
70 | RTL override characters to hide payload direction | ||||
| TXT | chunk-boundary |
45 | Payload fragments separated by ~512 chars of benign text | ||||
| Markdown | html-comment |
35 | HTML comments in .md files |
||||
| Markdown | frontmatter |
55 | Payload in YAML front matter metadata | ||||
| Markdown | link-title |
65 | Payload distributed across link title attributes | ||||
| Markdown | image-alt |
60 | Payload distributed across image alt text | ||||
| Markdown | chunk-boundary |
50 | Payload fragments in separate heading sections | ||||
| RTF | metadata |
55 | Payload in RTF \info block properties |
||||
| RTF | fontzero |
75 | Zero-point font group invisible to readers | ||||
| RTF | comment |
40 | Payload in RTF \*\annotation group |
||||
| EPUB | metadata |
60 | Payload in OPF metadata (dc:description) |
||||
| EPUB | metadata-distributed |
65 | Payload split across multiple OPF <meta> tags |
||||
| EPUB | css-hide |
70 | CSS-hidden span with zero font size | ||||
| EPUB | comment |
35 | XHTML comment in chapter body | ||||
| EPUB | aria-hidden |
65 | aria-hidden="true" span in XHTML content |
||||
| EPUB | toc |
55 | Payload hidden in EPUB table-of-contents <navPoint> labels |
||||
| CSV | extra-column |
45 | Payload in extra _metadata column |
||||
| CSV | bom-prefix |
50 | Payload after UTF-8 BOM in Notes field | ||||
| CSV | formula-injection |
60 | =CONCATENATE() formula reconstructing payload from cell references |
||||
| JSON | metadata-key |
55 | Payload in _metadata.description JSON key |
||||
| JSON | unicode-escape |
70 | Payload encoded as \uXXXX sequences |
||||
| XLSX | hidden-sheet |
75 | Payload on a hidden worksheet | ||||
| XLSX | metadata |
60 | Payload in docProps/core.xml properties |
||||
| XLSX | comment |
50 | Payload in cell comment | ||||
| XLSX | fontzero |
80 | Zero-point font cell invisible to readers | ||||
| Image | text-chunk |
55 | Payload in PNG tEXt Description chunk |
||||
| Image | xmp-metadata |
60 | Payload in XMP dc:description via PNG iTXt chunk |
||||
| Image | multi-chunk |
65 | Payload distributed across Title, Author, Description, Comment | ||||
| Image | steganographic |
90 | Payload encoded in LSB of pixel RGB channels |
Legend: Survives Stripped Uncertain / pipeline-dependent
Understanding Stealth Scores¶
Stealth scores range from 0 to 100 and reflect how difficult a technique is to detect through casual inspection or automated scanning:
| Range | Rating | Meaning |
|---|---|---|
| 0--39 | Low | Easily detected by basic sanitizers or manual review |
| 40--59 | Moderate | Requires format-aware inspection to detect |
| 60--79 | High | Survives most casual review; needs targeted detection tooling |
| 80--100 | Very High | Extremely difficult to detect without specialized analysis |
Scores are assigned based on visual invisibility, resistance to automated stripping, and the likelihood that a human reviewer examining the raw file would notice the payload.
Stealth is not the same as survival
A technique can have a high stealth score (hard to detect) but low survival (stripped by most frameworks). The zero-width technique scores 85 for stealth because it is invisible in any text editor, yet Unstructured strips it during Unicode normalization. Conversely, fontzero in DOCX survives all four frameworks because every loader extracts w:t elements regardless of font size.
Format Deep Dives¶
Each format page covers the full technical detail for its techniques, including generated markup, framework-specific extraction behavior, CLI examples, and detection guidance.
- HTML Techniques -- 9 techniques targeting web content and HTML-based knowledge bases
- DOCX Techniques -- 8 techniques exploiting the Office Open XML ZIP structure
- PDF Techniques -- 7 techniques leveraging PDF content streams, annotations, and metadata
- TXT Techniques -- 4 Unicode-based techniques for plain text files
- Markdown Techniques -- 5 techniques targeting raw Markdown ingestion
- RTF Techniques -- 3 techniques exploiting legacy Rich Text Format structure
- EPUB Techniques -- 6 techniques targeting EPUB XHTML and OPF metadata
- CSV Techniques -- 3 techniques for comma-separated value data
- JSON Techniques -- 2 techniques exploiting JSON key structures and encoding
- XLSX Techniques -- 4 techniques targeting Excel spreadsheet internals
- Image Techniques -- 4 techniques for PNG metadata injection and steganography