EPUB Techniques¶
hemlock provides six hiding techniques for EPUB documents. EPUB files are ZIP archives containing XHTML chapters, making them susceptible to the same HTML-based injection techniques plus OPF metadata hiding. EPUB is common in digital publishing pipelines and knowledge base exports.
Technique Overview¶
| Technique | Stealth | Description |
|---|---|---|
metadata |
60 | Payload in OPF metadata (dc:description) |
css-hide |
70 | CSS-hidden span with zero font size |
comment |
35 | XHTML comment in chapter body |
aria-hidden |
65 | aria-hidden="true" span in XHTML content |
metadata-distributed |
65 | Payload split across 4 OPF Dublin Core fields |
toc |
55 | Payload in NCX table-of-contents navPoint label |
metadata¶
How It Works¶
The payload is embedded in the OEBPS/content.opf file within Dublin Core metadata fields (dc:description). This metadata is not rendered in e-readers but is extracted by metadata-aware loaders.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Extracts OPF metadata alongside chapter text | |
| LlamaIndex | Reads XHTML chapters only | |
| Unstructured | Reads XHTML chapters only, strips metadata | |
| Haystack | Reads XHTML chapters only |
CLI Example¶
css-hide¶
How It Works¶
The payload is placed in a <span> with a CSS class that sets font-size:0 and color:transparent. The text is present in the XHTML DOM but invisible to readers.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Retains display:none/CSS-hidden content | |
| LlamaIndex | html2text strips hidden elements | |
| Unstructured | Aggressively strips hidden elements | |
| Haystack | Strips hidden elements |
CLI Example¶
comment¶
How It Works¶
The payload is placed inside an XHTML comment (<!-- ... -->) within a chapter file. Comments are part of the markup but not rendered.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | HTML comments stripped during tag stripping | |
| LlamaIndex | html2text removes comments | |
| Unstructured | Comments stripped in sanitization pass | |
| Haystack | Comments stripped |
CLI Example¶
aria-hidden¶
How It Works¶
The payload is placed in a <span aria-hidden="true"> element in the XHTML chapter. This accessibility attribute signals the content should be hidden from assistive technology, but some extractors do not strip it.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Does not strip aria-hidden content | |
| LlamaIndex | html2text strips hidden elements | |
| Unstructured | Specifically strips aria-hidden elements | |
| Haystack | Does not strip aria-hidden (unlike Unstructured) |
CLI Example¶
metadata-distributed¶
How It Works¶
The payload is split into four word-based parts and distributed across multiple OPF Dublin Core metadata fields in OEBPS/content.opf:
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:description>First quarter of payload</dc:description>
<dc:subject>Second quarter of payload</dc:subject>
<dc:rights>Third quarter of payload</dc:rights>
<dc:publisher>Fourth quarter of payload</dc:publisher>
</metadata>
No single metadata element contains the complete payload. The visible cover text is stored as XHTML paragraphs in the chapter file.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Reads XHTML chapters, not all OPF metadata fields | |
| LlamaIndex | Reads XHTML chapters only | |
| Unstructured | Reads XHTML chapters only, strips metadata | |
| Haystack | Reads XHTML chapters only |
CLI Example¶
toc¶
How It Works¶
The payload is hidden in the EPUB NCX (Navigation Center eXtended) table-of-contents file. hemlock creates a OEBPS/toc.ncx with the payload embedded in a <navLabel> text element:
<navMap>
<navPoint id="navpoint-1" playOrder="1">
<navLabel><text>Chapter 1</text></navLabel>
<content src="chapter1.xhtml"/>
</navPoint>
<navPoint id="navpoint-2" playOrder="2">
<navLabel><text>INJECTED PAYLOAD TEXT</text></navLabel>
<content src="chapter1.xhtml"/>
</navPoint>
</navMap>
EPUB readers display NCX navigation labels in sidebar panels, but the text is unlikely to draw attention among many navigation entries. Text extractors that parse NCX XML directly will encounter the payload.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Reads XHTML chapters, not NCX navigation | |
| LlamaIndex | Reads XHTML chapters only | |
| Unstructured | Reads XHTML chapters only | |
| Haystack | Reads XHTML chapters only |
NCX targets custom extractors
No major RAG framework extracts NCX navigation labels by default. This technique targets custom EPUB processing pipelines that parse the full package structure, including navigation metadata.
CLI Example¶
Survival Matrix¶
| Technique | Stealth | LangChain | LlamaIndex | Haystack | Unstructured |
|---|---|---|---|---|---|
metadata |
60 | ||||
css-hide |
70 | ||||
comment |
35 | ||||
aria-hidden |
65 | ||||
metadata-distributed |
65 | ||||
toc |
55 |
EPUB extraction is chapter-focused
Most RAG frameworks extract only XHTML chapter content from EPUBs, ignoring OPF metadata and NCX navigation. The metadata, css-hide, and aria-hidden techniques target the subset of frameworks that process these additional elements.