PDF Techniques¶
hemlock provides seven hiding techniques for PDF files, exploiting annotations, content stream rendering, JavaScript actions, and XMP metadata. PDF extraction is notoriously inconsistent across libraries, making it a fertile ground for hiding payloads that survive some pipelines but not others.
PDF internals primer
A PDF file is a tree of objects: pages, fonts, content streams, annotations, and metadata. Text is drawn by operators inside content streams (BT...ET blocks). Annotations, JavaScript actions, and XMP metadata live in separate object trees. Different extraction libraries walk different subsets of this tree, creating the divergent behavior hemlock exploits.
Technique Overview¶
| Technique | Stealth | Description |
|---|---|---|
annotation |
65 | Near-invisible text annotation (1pt white cell) |
invisible-text |
75 | 0.5pt white text in the content stream |
javascript |
40 | Payload embedded in a PDF JavaScript action |
xmp-metadata |
60 | Payload in PDF XMP metadata (Subject, Keywords, Author) |
xmp-distributed |
70 | Payload split across 4 XMP metadata fields |
chunk-boundary |
55 | Fragments on separate pages with filler pages between |
offpage |
70 | Payload at negative coordinates outside visible area |
annotation¶
How It Works¶
The payload is placed as a tiny, white-on-white text cell positioned in the top-left corner of the first page. hemlock uses gofpdf to render a 1pt font cell at coordinates (0.5mm, 0.5mm) with white text color (rgb(255,255,255)):
Position: (0.5, 0.5) mm from top-left
Font: Helvetica 1pt
Color: rgb(255, 255, 255) — white on white
Cell: 1mm x 1mm
Content: INJECTED PAYLOAD TEXT
The visible cover text is rendered normally at 12pt Helvetica in black. The payload cell is imperceptible to human readers but lives in the page content stream where text extractors can find it.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Basic text extraction includes annotations and small text | |
| LlamaIndex | Generally includes annotations alongside text content | |
| Unstructured | Strips annotations and metadata; text-only extraction | |
| Haystack | Includes annotation /Contents text alongside content stream |
CLI Example¶
hemlock craft \
--format pdf \
--technique annotation \
--payload override \
--topic "vendor evaluation criteria" \
--output ./output
Detection Notes¶
Detection
- Extract all text objects from the PDF and flag any with font sizes below 2pt
- Look for text color values matching the page background color
- Use a PDF inspector (e.g.,
qpdf --show-objects) to enumerate all text drawing operators and check for suspicious coordinates or sizes
invisible-text¶
How It Works¶
The highest-stealth PDF technique. The payload is rendered as a standard text object in the page content stream with a 0.5pt font and white color, positioned at the bottom margin of the page (10mm from left, 290mm from top on A4):
Position: (10, 290) mm — bottom margin area
Font: Helvetica 0.5pt
Color: rgb(255, 255, 255) — white on white
Content: INJECTED PAYLOAD TEXT
Unlike annotations, this text is part of the main content stream. Any library that does basic BT/ET text extraction will find it. The tiny font size and white color make it invisible to readers, and the bottom-margin position ensures it does not overlap visible content.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Extracts all text from content streams regardless of size or color | |
| LlamaIndex | Standard text extraction includes all content stream text | |
| Unstructured | More aggressive extraction may filter very small text | |
| Haystack | Extracts all content stream text regardless of render mode |
Content stream vs. annotation
The invisible-text technique is more robust than annotation because the payload lives in the primary content stream rather than a separate annotation object. Some extraction libraries explicitly skip annotations but always process the main content stream.
CLI Example¶
Detection Notes¶
Detection
- Parse the content stream and flag text objects with font sizes below 2pt
- Check for white-on-white text (text color matching background or
rgb(255,255,255)) - Compare visually rendered page content against extracted text; discrepancies indicate hidden text
- Tools:
pdftotext -layout,mutool draw, or custompikepdfscripts
javascript¶
How It Works¶
The payload is embedded inside a PDF JavaScript action. hemlock stores the payload as a JavaScript string variable using gofpdf.SetJavascript():
This JavaScript is stored in the document catalog's OpenAction or Names tree. PDF readers that support JavaScript may execute it on document open, and some extraction libraries parse JavaScript actions as part of their content extraction.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Uncertain; depends on PDF library version and configuration | |
| LlamaIndex | Uncertain; some versions may extract JavaScript content | |
| Unstructured | Does not extract JavaScript actions | |
| Haystack | Does not extract JavaScript actions |
Low reliability
PDF JavaScript extraction is highly inconsistent. Most modern PDF extraction libraries focus on content streams and ignore JavaScript actions. This technique is primarily useful for targeting legacy systems or custom extractors that parse the full PDF object tree.
CLI Example¶
Detection Notes¶
Detection
- Search for
/JSand/JavaScriptkeys in the PDF object tree - Flag any document-level JavaScript actions, especially those containing string literals
- Many organizations block PDF JavaScript entirely; this technique may be caught by existing security policies
- Command:
strings document.pdf | grep -i javascript
xmp-metadata¶
How It Works¶
The payload is embedded in PDF metadata fields (Subject, Keywords, and Author) using gofpdf's metadata API. These fields are stored in the PDF's Info dictionary and optionally in an XMP metadata stream:
/Info <<
/Subject (INJECTED PAYLOAD TEXT)
/Keywords (INJECTED PAYLOAD TEXT)
/Author (INJECTED PAYLOAD TEXT)
>>
Some extraction pipelines read metadata fields as part of their document processing, treating them as additional context for the RAG index.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | May extract metadata depending on loader configuration | |
| LlamaIndex | Metadata extraction is version-dependent | |
| Unstructured | Strips metadata; extracts only visible text content | |
| Haystack | Does not reliably extract XMP metadata |
CLI Example¶
hemlock craft \
--format pdf \
--technique xmp-metadata \
--payload redirect \
--topic "board meeting minutes" \
--output ./output
Detection Notes¶
Detection
- Extract PDF metadata using
pdfinfoorexiftooland inspect Subject, Keywords, and Author fields - Flag documents where metadata fields contain instruction-like text or are unusually long
- Command:
pdfinfo document.pdforexiftool document.pdf
xmp-distributed¶
How It Works¶
The payload is split into four word-based parts and distributed across multiple PDF metadata fields using gofpdf's metadata API:
/Info <<
/Subject (first quarter of payload)
/Keywords (second quarter of payload)
/Author (third quarter of payload)
/Creator (fourth quarter of payload)
>>
Unlike the standard xmp-metadata technique that places the full payload in each field, xmp-distributed ensures no single metadata field contains the complete injection. The visible cover text is rendered normally on page 1.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | extractPDFText reads content stream, not XMP metadata |
|
| LlamaIndex | Reads BT/ET text operators, not metadata | |
| Unstructured | Strips metadata; extracts only visible text content | |
| Haystack | Does not extract XMP metadata |
Targets metadata-aware pipelines
Like xmp-metadata, the distributed variant targets custom extraction pipelines that merge metadata fields. The distribution makes individual fields appear less suspicious to field-level scanners.
CLI Example¶
hemlock craft \
--format pdf \
--technique xmp-distributed \
--payload authority \
--topic "regulatory compliance summary" \
--output ./output
chunk-boundary¶
How It Works¶
The payload is split into three character-based parts, with each fragment rendered on a separate PDF page as 1pt white text. Between fragment pages, filler pages with visible 12pt black reference text are inserted, creating natural page boundaries that PDF text splitters use to segment documents:
Page 1: Cover text (12pt black, visible)
Page 2: Filler text (12pt black, visible)
Page 3: Fragment 1 (1pt white, invisible)
Page 4: Filler text (12pt black, visible)
Page 5: Fragment 2 (1pt white, invisible)
Page 6: Filler text (12pt black, visible)
Page 7: Fragment 3 (1pt white, invisible)
Each fragment lives in the content stream of its page, so any library that performs per-page text extraction will encounter it. The filler pages push fragments across retrieval chunk boundaries.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Extracts text from all pages including hidden text | |
| LlamaIndex | Per-page extraction includes all content stream text | |
| Haystack | Reads all pages sequentially | |
| Unstructured | May filter very small text on individual pages |
CLI Example¶
hemlock craft \
--format pdf \
--technique chunk-boundary \
--payload exfiltrate \
--topic "quarterly revenue analysis" \
--output ./output
offpage¶
How It Works¶
The payload is rendered at coordinates far outside the visible page area (-500, -500), making it invisible even if the PDF is viewed at maximum zoom. The payload is 4pt white text placed at extreme negative coordinates, but it remains part of the page content stream:
Visible page area: (0,0) to (210,297) mm (A4)
Payload position: (-500, -500) mm — far outside visible bounds
Font: Helvetica 4pt
Color: rgb(255, 255, 255) — white
Text extraction libraries process the entire content stream regardless of coordinates, so the payload is captured alongside visible text.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Content stream extraction ignores coordinates | |
| LlamaIndex | Extracts all text operators regardless of position | |
| Haystack | Reads content stream sequentially | |
| Unstructured | Aggressive extraction may skip out-of-bounds text |
Coordinate-based hiding
Unlike invisible-text (which uses tiny font + white color on the visible page), offpage hides the payload by placing it entirely outside the rendered area. This bypasses detection rules that scan for small/white text within page bounds.
CLI Example¶
hemlock craft \
--format pdf \
--technique offpage \
--payload redirect \
--topic "board meeting minutes" \
--output ./output
Detection Notes¶
Detection
- Parse the content stream and flag text objects with coordinates outside page
MediaBoxbounds - Check for
TmorTdoperators with extreme negative values - Compare extracted text against visually rendered page content
Survival Matrix¶
| Technique | Stealth | LangChain | LlamaIndex | Haystack | Unstructured |
|---|---|---|---|---|---|
annotation |
65 | ||||
invisible-text |
75 | ||||
javascript |
40 | ||||
xmp-metadata |
60 | ||||
xmp-distributed |
70 | ||||
chunk-boundary |
55 | ||||
offpage |
70 |
PDF extraction is inherently unreliable
PDF text extraction varies significantly across library versions, configurations, and even document structure. The survival results above represent typical behavior, but specific versions of PyPDF2, pdfminer, pymupdf, or pdfplumber may behave differently. Always validate against your specific target pipeline using hemlock validate.