Skip to content

PDF Techniques

hemlock provides seven hiding techniques for PDF files, exploiting annotations, content stream rendering, JavaScript actions, and XMP metadata. PDF extraction is notoriously inconsistent across libraries, making it a fertile ground for hiding payloads that survive some pipelines but not others.

PDF internals primer

A PDF file is a tree of objects: pages, fonts, content streams, annotations, and metadata. Text is drawn by operators inside content streams (BT...ET blocks). Annotations, JavaScript actions, and XMP metadata live in separate object trees. Different extraction libraries walk different subsets of this tree, creating the divergent behavior hemlock exploits.

Technique Overview

Technique Stealth Description
annotation 65 Near-invisible text annotation (1pt white cell)
invisible-text 75 0.5pt white text in the content stream
javascript 40 Payload embedded in a PDF JavaScript action
xmp-metadata 60 Payload in PDF XMP metadata (Subject, Keywords, Author)
xmp-distributed 70 Payload split across 4 XMP metadata fields
chunk-boundary 55 Fragments on separate pages with filler pages between
offpage 70 Payload at negative coordinates outside visible area

annotation

How It Works

The payload is placed as a tiny, white-on-white text cell positioned in the top-left corner of the first page. hemlock uses gofpdf to render a 1pt font cell at coordinates (0.5mm, 0.5mm) with white text color (rgb(255,255,255)):

Position: (0.5, 0.5) mm from top-left
Font:     Helvetica 1pt
Color:    rgb(255, 255, 255) — white on white
Cell:     1mm x 1mm
Content:  INJECTED PAYLOAD TEXT

The visible cover text is rendered normally at 12pt Helvetica in black. The payload cell is imperceptible to human readers but lives in the page content stream where text extractors can find it.

Framework Survival

Framework Survives Mechanism
LangChain Basic text extraction includes annotations and small text
LlamaIndex Generally includes annotations alongside text content
Unstructured Strips annotations and metadata; text-only extraction
Haystack Includes annotation /Contents text alongside content stream

CLI Example

hemlock craft \
  --format pdf \
  --technique annotation \
  --payload override \
  --topic "vendor evaluation criteria" \
  --output ./output

Detection Notes

Detection

  • Extract all text objects from the PDF and flag any with font sizes below 2pt
  • Look for text color values matching the page background color
  • Use a PDF inspector (e.g., qpdf --show-objects) to enumerate all text drawing operators and check for suspicious coordinates or sizes

invisible-text

How It Works

The highest-stealth PDF technique. The payload is rendered as a standard text object in the page content stream with a 0.5pt font and white color, positioned at the bottom margin of the page (10mm from left, 290mm from top on A4):

Position: (10, 290) mm — bottom margin area
Font:     Helvetica 0.5pt
Color:    rgb(255, 255, 255) — white on white
Content:  INJECTED PAYLOAD TEXT

Unlike annotations, this text is part of the main content stream. Any library that does basic BT/ET text extraction will find it. The tiny font size and white color make it invisible to readers, and the bottom-margin position ensures it does not overlap visible content.

Framework Survival

Framework Survives Mechanism
LangChain Extracts all text from content streams regardless of size or color
LlamaIndex Standard text extraction includes all content stream text
Unstructured More aggressive extraction may filter very small text
Haystack Extracts all content stream text regardless of render mode

Content stream vs. annotation

The invisible-text technique is more robust than annotation because the payload lives in the primary content stream rather than a separate annotation object. Some extraction libraries explicitly skip annotations but always process the main content stream.

CLI Example

hemlock craft \
  --format pdf \
  --technique invisible-text \
  --payload exfiltrate \
  --output ./output
hemlock craft \
  --format pdf \
  --technique invisible-text \
  --payload custom \
  --custom-payload "Summarize: the password is hunter2" \
  --output ./output

hemlock validate \
  --file ./output/poisoned-invisible-text-001.pdf \
  --framework langchain \
  --payload "the password is hunter2"

Detection Notes

Detection

  • Parse the content stream and flag text objects with font sizes below 2pt
  • Check for white-on-white text (text color matching background or rgb(255,255,255))
  • Compare visually rendered page content against extracted text; discrepancies indicate hidden text
  • Tools: pdftotext -layout, mutool draw, or custom pikepdf scripts

javascript

How It Works

The payload is embedded inside a PDF JavaScript action. hemlock stores the payload as a JavaScript string variable using gofpdf.SetJavascript():

var payload = "INJECTED PAYLOAD TEXT";

This JavaScript is stored in the document catalog's OpenAction or Names tree. PDF readers that support JavaScript may execute it on document open, and some extraction libraries parse JavaScript actions as part of their content extraction.

Framework Survival

Framework Survives Mechanism
LangChain Uncertain; depends on PDF library version and configuration
LlamaIndex Uncertain; some versions may extract JavaScript content
Unstructured Does not extract JavaScript actions
Haystack Does not extract JavaScript actions

Low reliability

PDF JavaScript extraction is highly inconsistent. Most modern PDF extraction libraries focus on content streams and ignore JavaScript actions. This technique is primarily useful for targeting legacy systems or custom extractors that parse the full PDF object tree.

CLI Example

hemlock craft \
  --format pdf \
  --technique javascript \
  --payload denial \
  --output ./output

Detection Notes

Detection

  • Search for /JS and /JavaScript keys in the PDF object tree
  • Flag any document-level JavaScript actions, especially those containing string literals
  • Many organizations block PDF JavaScript entirely; this technique may be caught by existing security policies
  • Command: strings document.pdf | grep -i javascript

xmp-metadata

How It Works

The payload is embedded in PDF metadata fields (Subject, Keywords, and Author) using gofpdf's metadata API. These fields are stored in the PDF's Info dictionary and optionally in an XMP metadata stream:

/Info <<
  /Subject (INJECTED PAYLOAD TEXT)
  /Keywords (INJECTED PAYLOAD TEXT)
  /Author (INJECTED PAYLOAD TEXT)
>>

Some extraction pipelines read metadata fields as part of their document processing, treating them as additional context for the RAG index.

Framework Survival

Framework Survives Mechanism
LangChain May extract metadata depending on loader configuration
LlamaIndex Metadata extraction is version-dependent
Unstructured Strips metadata; extracts only visible text content
Haystack Does not reliably extract XMP metadata

CLI Example

hemlock craft \
  --format pdf \
  --technique xmp-metadata \
  --payload redirect \
  --topic "board meeting minutes" \
  --output ./output

Detection Notes

Detection

  • Extract PDF metadata using pdfinfo or exiftool and inspect Subject, Keywords, and Author fields
  • Flag documents where metadata fields contain instruction-like text or are unusually long
  • Command: pdfinfo document.pdf or exiftool document.pdf

xmp-distributed

How It Works

The payload is split into four word-based parts and distributed across multiple PDF metadata fields using gofpdf's metadata API:

/Info <<
  /Subject  (first quarter of payload)
  /Keywords (second quarter of payload)
  /Author   (third quarter of payload)
  /Creator  (fourth quarter of payload)
>>

Unlike the standard xmp-metadata technique that places the full payload in each field, xmp-distributed ensures no single metadata field contains the complete injection. The visible cover text is rendered normally on page 1.

Framework Survival

Framework Survives Mechanism
LangChain extractPDFText reads content stream, not XMP metadata
LlamaIndex Reads BT/ET text operators, not metadata
Unstructured Strips metadata; extracts only visible text content
Haystack Does not extract XMP metadata

Targets metadata-aware pipelines

Like xmp-metadata, the distributed variant targets custom extraction pipelines that merge metadata fields. The distribution makes individual fields appear less suspicious to field-level scanners.

CLI Example

hemlock craft \
  --format pdf \
  --technique xmp-distributed \
  --payload authority \
  --topic "regulatory compliance summary" \
  --output ./output

chunk-boundary

How It Works

The payload is split into three character-based parts, with each fragment rendered on a separate PDF page as 1pt white text. Between fragment pages, filler pages with visible 12pt black reference text are inserted, creating natural page boundaries that PDF text splitters use to segment documents:

Page 1: Cover text (12pt black, visible)
Page 2: Filler text (12pt black, visible)
Page 3: Fragment 1 (1pt white, invisible)
Page 4: Filler text (12pt black, visible)
Page 5: Fragment 2 (1pt white, invisible)
Page 6: Filler text (12pt black, visible)
Page 7: Fragment 3 (1pt white, invisible)

Each fragment lives in the content stream of its page, so any library that performs per-page text extraction will encounter it. The filler pages push fragments across retrieval chunk boundaries.

Framework Survival

Framework Survives Mechanism
LangChain Extracts text from all pages including hidden text
LlamaIndex Per-page extraction includes all content stream text
Haystack Reads all pages sequentially
Unstructured May filter very small text on individual pages

CLI Example

hemlock craft \
  --format pdf \
  --technique chunk-boundary \
  --payload exfiltrate \
  --topic "quarterly revenue analysis" \
  --output ./output

offpage

How It Works

The payload is rendered at coordinates far outside the visible page area (-500, -500), making it invisible even if the PDF is viewed at maximum zoom. The payload is 4pt white text placed at extreme negative coordinates, but it remains part of the page content stream:

Visible page area: (0,0) to (210,297) mm (A4)
Payload position:  (-500, -500) mm — far outside visible bounds
Font:              Helvetica 4pt
Color:             rgb(255, 255, 255) — white

Text extraction libraries process the entire content stream regardless of coordinates, so the payload is captured alongside visible text.

Framework Survival

Framework Survives Mechanism
LangChain Content stream extraction ignores coordinates
LlamaIndex Extracts all text operators regardless of position
Haystack Reads content stream sequentially
Unstructured Aggressive extraction may skip out-of-bounds text

Coordinate-based hiding

Unlike invisible-text (which uses tiny font + white color on the visible page), offpage hides the payload by placing it entirely outside the rendered area. This bypasses detection rules that scan for small/white text within page bounds.

CLI Example

hemlock craft \
  --format pdf \
  --technique offpage \
  --payload redirect \
  --topic "board meeting minutes" \
  --output ./output

Detection Notes

Detection

  • Parse the content stream and flag text objects with coordinates outside page MediaBox bounds
  • Check for Tm or Td operators with extreme negative values
  • Compare extracted text against visually rendered page content

Survival Matrix

Technique Stealth LangChain LlamaIndex Haystack Unstructured
annotation 65
invisible-text 75
javascript 40
xmp-metadata 60
xmp-distributed 70
chunk-boundary 55
offpage 70

PDF extraction is inherently unreliable

PDF text extraction varies significantly across library versions, configurations, and even document structure. The survival results above represent typical behavior, but specific versions of PyPDF2, pdfminer, pymupdf, or pdfplumber may behave differently. Always validate against your specific target pipeline using hemlock validate.