Skip to content

EPUB Techniques

hemlock provides six hiding techniques for EPUB documents. EPUB files are ZIP archives containing XHTML chapters, making them susceptible to the same HTML-based injection techniques plus OPF metadata hiding. EPUB is common in digital publishing pipelines and knowledge base exports.

Technique Overview

Technique Stealth Description
metadata 60 Payload in OPF metadata (dc:description)
css-hide 70 CSS-hidden span with zero font size
comment 35 XHTML comment in chapter body
aria-hidden 65 aria-hidden="true" span in XHTML content
metadata-distributed 65 Payload split across 4 OPF Dublin Core fields
toc 55 Payload in NCX table-of-contents navPoint label

metadata

How It Works

The payload is embedded in the OEBPS/content.opf file within Dublin Core metadata fields (dc:description). This metadata is not rendered in e-readers but is extracted by metadata-aware loaders.

Framework Survival

Framework Survives Mechanism
LangChain Extracts OPF metadata alongside chapter text
LlamaIndex Reads XHTML chapters only
Unstructured Reads XHTML chapters only, strips metadata
Haystack Reads XHTML chapters only

CLI Example

hemlock craft --format epub --technique metadata --payload override --output ./output

css-hide

How It Works

The payload is placed in a <span> with a CSS class that sets font-size:0 and color:transparent. The text is present in the XHTML DOM but invisible to readers.

Framework Survival

Framework Survives Mechanism
LangChain Retains display:none/CSS-hidden content
LlamaIndex html2text strips hidden elements
Unstructured Aggressively strips hidden elements
Haystack Strips hidden elements

CLI Example

hemlock craft --format epub --technique css-hide --payload override --output ./output

comment

How It Works

The payload is placed inside an XHTML comment (<!-- ... -->) within a chapter file. Comments are part of the markup but not rendered.

Framework Survival

Framework Survives Mechanism
LangChain HTML comments stripped during tag stripping
LlamaIndex html2text removes comments
Unstructured Comments stripped in sanitization pass
Haystack Comments stripped

CLI Example

hemlock craft --format epub --technique comment --payload override --output ./output

aria-hidden

How It Works

The payload is placed in a <span aria-hidden="true"> element in the XHTML chapter. This accessibility attribute signals the content should be hidden from assistive technology, but some extractors do not strip it.

Framework Survival

Framework Survives Mechanism
LangChain Does not strip aria-hidden content
LlamaIndex html2text strips hidden elements
Unstructured Specifically strips aria-hidden elements
Haystack Does not strip aria-hidden (unlike Unstructured)

CLI Example

hemlock craft --format epub --technique aria-hidden --payload override --output ./output

metadata-distributed

How It Works

The payload is split into four word-based parts and distributed across multiple OPF Dublin Core metadata fields in OEBPS/content.opf:

<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
  <dc:description>First quarter of payload</dc:description>
  <dc:subject>Second quarter of payload</dc:subject>
  <dc:rights>Third quarter of payload</dc:rights>
  <dc:publisher>Fourth quarter of payload</dc:publisher>
</metadata>

No single metadata element contains the complete payload. The visible cover text is stored as XHTML paragraphs in the chapter file.

Framework Survival

Framework Survives Mechanism
LangChain Reads XHTML chapters, not all OPF metadata fields
LlamaIndex Reads XHTML chapters only
Unstructured Reads XHTML chapters only, strips metadata
Haystack Reads XHTML chapters only

CLI Example

hemlock craft --format epub --technique metadata-distributed --payload authority --output ./output

toc

How It Works

The payload is hidden in the EPUB NCX (Navigation Center eXtended) table-of-contents file. hemlock creates a OEBPS/toc.ncx with the payload embedded in a <navLabel> text element:

<navMap>
  <navPoint id="navpoint-1" playOrder="1">
    <navLabel><text>Chapter 1</text></navLabel>
    <content src="chapter1.xhtml"/>
  </navPoint>
  <navPoint id="navpoint-2" playOrder="2">
    <navLabel><text>INJECTED PAYLOAD TEXT</text></navLabel>
    <content src="chapter1.xhtml"/>
  </navPoint>
</navMap>

EPUB readers display NCX navigation labels in sidebar panels, but the text is unlikely to draw attention among many navigation entries. Text extractors that parse NCX XML directly will encounter the payload.

Framework Survival

Framework Survives Mechanism
LangChain Reads XHTML chapters, not NCX navigation
LlamaIndex Reads XHTML chapters only
Unstructured Reads XHTML chapters only
Haystack Reads XHTML chapters only

NCX targets custom extractors

No major RAG framework extracts NCX navigation labels by default. This technique targets custom EPUB processing pipelines that parse the full package structure, including navigation metadata.

CLI Example

hemlock craft --format epub --technique toc --payload override --output ./output

Survival Matrix

Technique Stealth LangChain LlamaIndex Haystack Unstructured
metadata 60
css-hide 70
comment 35
aria-hidden 65
metadata-distributed 65
toc 55

EPUB extraction is chapter-focused

Most RAG frameworks extract only XHTML chapter content from EPUBs, ignoring OPF metadata and NCX navigation. The metadata, css-hide, and aria-hidden techniques target the subset of frameworks that process these additional elements.