RTF Techniques¶
hemlock provides five hiding techniques for Rich Text Format documents. RTF is a legacy format still encountered in enterprise document pipelines, email attachments, and older knowledge management systems.
Technique Overview¶
| Technique | Stealth | Description |
|---|---|---|
metadata |
55 | Payload in RTF \info block properties |
fontzero |
75 | Zero-point font group invisible to readers |
comment |
40 | Payload in RTF \*\annotation group |
fonttable |
65 | Payload as font name in \fonttbl group |
white-text |
70 | White-on-white text via color table |
metadata¶
How It Works¶
The payload is embedded inside the RTF {\info ...} block, which holds document metadata properties. This block is parsed by metadata-aware extractors but not rendered as visible text.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Extracts body text and info block metadata | |
| LlamaIndex | Extracts body text only | |
| Unstructured | Extracts body text only, strips metadata | |
| Haystack | Extracts body text only |
CLI Example¶
fontzero¶
How It Works¶
The payload is placed in a font group with zero-point font size ({\fs0 ...}). The text exists in the RTF body but is invisible at zero font size. Basic extractors that read all text content will capture it.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Extracts all body text regardless of font size | |
| LlamaIndex | Extracts all body text regardless of font size | |
| Unstructured | Extracts all body text regardless of font size | |
| Haystack | Extracts all body text regardless of font size |
CLI Example¶
comment¶
How It Works¶
The payload is embedded as an RTF annotation group ({\*\annotation ...}). Annotations are typically not rendered by RTF viewers but may be extracted by document processing tools.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Annotation groups not extracted | |
| LlamaIndex | Annotation groups not extracted | |
| Unstructured | Annotation groups stripped | |
| Haystack | Annotation groups not extracted |
CLI Example¶
fonttable¶
How It Works¶
The payload is embedded as a font name entry in the RTF {\fonttbl} group. Font table entries define the fonts available in the document but are not rendered as body text. The payload appears as {\f1 PAYLOAD;} in the font definitions. Most text extractors skip font table contents entirely.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Font table entries not extracted as text | |
| LlamaIndex | Font table entries not extracted | |
| Unstructured | Font table entries stripped | |
| Haystack | Font table entries not extracted |
CLI Example¶
white-text¶
How It Works¶
The payload is placed in the RTF body using {\cf1 ...} where color index 1 is defined as white (255,255,255) in the color table. The text is invisible in rendered output (white on white background) but present in all text extractors that ignore formatting attributes.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Extracts all body text regardless of color | |
| LlamaIndex | Extracts all body text regardless of color | |
| Unstructured | Extracts all body text regardless of color | |
| Haystack | Extracts all body text regardless of color |
CLI Example¶
Survival Matrix¶
| Technique | Stealth | LangChain | LlamaIndex | Haystack | Unstructured |
|---|---|---|---|---|---|
metadata |
55 | ||||
fontzero |
75 | ||||
comment |
40 | ||||
fonttable |
65 | ||||
white-text |
70 |
RTF extraction varies by technique type
Body text techniques (fontzero, white-text) survive all frameworks since extractors process all text runs. Structural groups (fonttable) and metadata (metadata, comment) have inconsistent survival depending on how each framework's RTF parser handles non-body content.