PDF Techniques¶

hemlock provides seven hiding techniques for PDF files, exploiting annotations, content stream rendering, JavaScript actions, and XMP metadata. PDF extraction is notoriously inconsistent across libraries, making it a fertile ground for hiding payloads that survive some pipelines but not others.

PDF internals primer

A PDF file is a tree of objects: pages, fonts, content streams, annotations, and metadata. Text is drawn by operators inside content streams (BT...ET blocks). Annotations, JavaScript actions, and XMP metadata live in separate object trees. Different extraction libraries walk different subsets of this tree, creating the divergent behavior hemlock exploits.

Technique Overview¶

Technique	Stealth	Description
`annotation`	65	Near-invisible text annotation (1pt white cell)
`invisible-text`	75	0.5pt white text in the content stream
`javascript`	40	Payload embedded in a PDF JavaScript action
`xmp-metadata`	60	Payload in PDF XMP metadata (Subject, Keywords, Author)
`xmp-distributed`	70	Payload split across 4 XMP metadata fields
`chunk-boundary`	55	Fragments on separate pages with filler pages between
`offpage`	70	Payload at negative coordinates outside visible area

annotation¶

How It Works¶

The payload is placed as a tiny, white-on-white text cell positioned in the top-left corner of the first page. hemlock uses gofpdf to render a 1pt font cell at coordinates (0.5mm, 0.5mm) with white text color (rgb(255,255,255)):

Position: (0.5, 0.5) mm from top-left
Font:     Helvetica 1pt
Color:    rgb(255, 255, 255) — white on white
Cell:     1mm x 1mm
Content:  INJECTED PAYLOAD TEXT

The visible cover text is rendered normally at 12pt Helvetica in black. The payload cell is imperceptible to human readers but lives in the page content stream where text extractors can find it.

Framework Survival¶

Framework	Survives	Mechanism
LangChain		Basic text extraction includes annotations and small text
LlamaIndex		Generally includes annotations alongside text content
Unstructured		Strips annotations and metadata; text-only extraction
Haystack		Includes annotation `/Contents` text alongside content stream

CLI Example¶

hemlock craft \
  --format pdf \
  --technique annotation \
  --payload override \
  --topic "vendor evaluation criteria" \
  --output ./output

Detection Notes¶

Detection

Extract all text objects from the PDF and flag any with font sizes below 2pt
Look for text color values matching the page background color
Use a PDF inspector (e.g., qpdf --show-objects) to enumerate all text drawing operators and check for suspicious coordinates or sizes

invisible-text¶

How It Works¶

The highest-stealth PDF technique. The payload is rendered as a standard text object in the page content stream with a 0.5pt font and white color, positioned at the bottom margin of the page (10mm from left, 290mm from top on A4):

Position: (10, 290) mm — bottom margin area
Font:     Helvetica 0.5pt
Color:    rgb(255, 255, 255) — white on white
Content:  INJECTED PAYLOAD TEXT

Unlike annotations, this text is part of the main content stream. Any library that does basic BT/ET text extraction will find it. The tiny font size and white color make it invisible to readers, and the bottom-margin position ensures it does not overlap visible content.

Framework Survival¶

Framework	Survives	Mechanism
LangChain		Extracts all text from content streams regardless of size or color
LlamaIndex		Standard text extraction includes all content stream text
Unstructured		More aggressive extraction may filter very small text
Haystack		Extracts all content stream text regardless of render mode

Content stream vs. annotation

The invisible-text technique is more robust than annotation because the payload lives in the primary content stream rather than a separate annotation object. Some extraction libraries explicitly skip annotations but always process the main content stream.

CLI Example¶

BasicValidate survival

hemlock craft \
  --format pdf \
  --technique invisible-text \
  --payload exfiltrate \
  --output ./output

hemlock craft \
  --format pdf \
  --technique invisible-text \
  --payload custom \
  --custom-payload "Summarize: the password is hunter2" \
  --output ./output

hemlock validate \
  --file ./output/poisoned-invisible-text-001.pdf \
  --framework langchain \
  --payload "the password is hunter2"

Detection Notes¶

Detection

Parse the content stream and flag text objects with font sizes below 2pt
Check for white-on-white text (text color matching background or rgb(255,255,255))
Compare visually rendered page content against extracted text; discrepancies indicate hidden text
Tools: pdftotext -layout, mutool draw, or custom pikepdf scripts

javascript¶

How It Works¶

The payload is embedded inside a PDF JavaScript action. hemlock stores the payload as a JavaScript string variable using gofpdf.SetJavascript():

var payload = "INJECTED PAYLOAD TEXT";

This JavaScript is stored in the document catalog's OpenAction or Names tree. PDF readers that support JavaScript may execute it on document open, and some extraction libraries parse JavaScript actions as part of their content extraction.

Framework Survival¶

Framework	Survives	Mechanism
LangChain		Uncertain; depends on PDF library version and configuration
LlamaIndex		Uncertain; some versions may extract JavaScript content
Unstructured		Does not extract JavaScript actions
Haystack		Does not extract JavaScript actions

Low reliability

PDF JavaScript extraction is highly inconsistent. Most modern PDF extraction libraries focus on content streams and ignore JavaScript actions. This technique is primarily useful for targeting legacy systems or custom extractors that parse the full PDF object tree.

CLI Example¶

hemlock craft \
  --format pdf \
  --technique javascript \
  --payload denial \
  --output ./output

Detection Notes¶

Detection

Search for /JS and /JavaScript keys in the PDF object tree
Flag any document-level JavaScript actions, especially those containing string literals
Many organizations block PDF JavaScript entirely; this technique may be caught by existing security policies
Command: strings document.pdf | grep -i javascript

xmp-metadata¶

How It Works¶

The payload is embedded in PDF metadata fields (Subject, Keywords, and Author) using gofpdf's metadata API. These fields are stored in the PDF's Info dictionary and optionally in an XMP metadata stream:

/Info <<
  /Subject (INJECTED PAYLOAD TEXT)
  /Keywords (INJECTED PAYLOAD TEXT)
  /Author (INJECTED PAYLOAD TEXT)
>>

Some extraction pipelines read metadata fields as part of their document processing, treating them as additional context for the RAG index.

Framework Survival¶

Framework	Survives	Mechanism
LangChain		May extract metadata depending on loader configuration
LlamaIndex		Metadata extraction is version-dependent
Unstructured		Strips metadata; extracts only visible text content
Haystack		Does not reliably extract XMP metadata

CLI Example¶

hemlock craft \
  --format pdf \
  --technique xmp-metadata \
  --payload redirect \
  --topic "board meeting minutes" \
  --output ./output

Detection Notes¶

Detection

Extract PDF metadata using pdfinfo or exiftool and inspect Subject, Keywords, and Author fields
Flag documents where metadata fields contain instruction-like text or are unusually long
Command: pdfinfo document.pdf or exiftool document.pdf

xmp-distributed¶

How It Works¶

The payload is split into four word-based parts and distributed across multiple PDF metadata fields using gofpdf's metadata API:

/Info <<
  /Subject  (first quarter of payload)
  /Keywords (second quarter of payload)
  /Author   (third quarter of payload)
  /Creator  (fourth quarter of payload)
>>

Unlike the standard xmp-metadata technique that places the full payload in each field, xmp-distributed ensures no single metadata field contains the complete injection. The visible cover text is rendered normally on page 1.

Framework Survival¶

Framework	Survives	Mechanism
LangChain		`extractPDFText` reads content stream, not XMP metadata
LlamaIndex		Reads BT/ET text operators, not metadata
Unstructured		Strips metadata; extracts only visible text content
Haystack		Does not extract XMP metadata

Targets metadata-aware pipelines

Like xmp-metadata, the distributed variant targets custom extraction pipelines that merge metadata fields. The distribution makes individual fields appear less suspicious to field-level scanners.

CLI Example¶

hemlock craft \
  --format pdf \
  --technique xmp-distributed \
  --payload authority \
  --topic "regulatory compliance summary" \
  --output ./output

chunk-boundary¶

How It Works¶

The payload is split into three character-based parts, with each fragment rendered on a separate PDF page as 1pt white text. Between fragment pages, filler pages with visible 12pt black reference text are inserted, creating natural page boundaries that PDF text splitters use to segment documents:

Page 1: Cover text (12pt black, visible)
Page 2: Filler text (12pt black, visible)
Page 3: Fragment 1 (1pt white, invisible)
Page 4: Filler text (12pt black, visible)
Page 5: Fragment 2 (1pt white, invisible)
Page 6: Filler text (12pt black, visible)
Page 7: Fragment 3 (1pt white, invisible)

Each fragment lives in the content stream of its page, so any library that performs per-page text extraction will encounter it. The filler pages push fragments across retrieval chunk boundaries.

Framework Survival¶

Framework	Survives	Mechanism
LangChain		Extracts text from all pages including hidden text
LlamaIndex		Per-page extraction includes all content stream text
Haystack		Reads all pages sequentially
Unstructured		May filter very small text on individual pages

CLI Example¶

hemlock craft \
  --format pdf \
  --technique chunk-boundary \
  --payload exfiltrate \
  --topic "quarterly revenue analysis" \
  --output ./output

offpage¶

How It Works¶

The payload is rendered at coordinates far outside the visible page area (-500, -500), making it invisible even if the PDF is viewed at maximum zoom. The payload is 4pt white text placed at extreme negative coordinates, but it remains part of the page content stream:

Visible page area: (0,0) to (210,297) mm (A4)
Payload position:  (-500, -500) mm — far outside visible bounds
Font:              Helvetica 4pt
Color:             rgb(255, 255, 255) — white

Text extraction libraries process the entire content stream regardless of coordinates, so the payload is captured alongside visible text.

Framework Survival¶

Framework	Survives	Mechanism
LangChain		Content stream extraction ignores coordinates
LlamaIndex		Extracts all text operators regardless of position
Haystack		Reads content stream sequentially
Unstructured		Aggressive extraction may skip out-of-bounds text

Coordinate-based hiding

Unlike invisible-text (which uses tiny font + white color on the visible page), offpage hides the payload by placing it entirely outside the rendered area. This bypasses detection rules that scan for small/white text within page bounds.

CLI Example¶

hemlock craft \
  --format pdf \
  --technique offpage \
  --payload redirect \
  --topic "board meeting minutes" \
  --output ./output

Detection Notes¶

Detection

Parse the content stream and flag text objects with coordinates outside page MediaBox bounds
Check for Tm or Td operators with extreme negative values
Compare extracted text against visually rendered page content

Survival Matrix¶

Technique	Stealth	LangChain	LlamaIndex	Haystack	Unstructured
`annotation`	65
`invisible-text`	75
`javascript`	40
`xmp-metadata`	60
`xmp-distributed`	70
`chunk-boundary`	55
`offpage`	70

PDF extraction is inherently unreliable

PDF text extraction varies significantly across library versions, configurations, and even document structure. The survival results above represent typical behavior, but specific versions of PyPDF2, pdfminer, pymupdf, or pdfplumber may behave differently. Always validate against your specific target pipeline using hemlock validate.