TXT Techniques¶
hemlock provides five hiding techniques for plain text files, including four Unicode-based methods and one structural technique. These exploit the gap between what humans see in a text editor and what byte-level processing extracts. Plain text is the simplest format, but Unicode provides a surprisingly rich attack surface.
Why plain text matters
Many RAG pipelines ingest .txt files as raw content with no parsing step. This means every byte reaches the embedding model. The challenge is hiding the payload so it is invisible to human reviewers while remaining intact after any Unicode normalization the framework performs.
Technique Overview¶
| Technique | Stealth | Description |
|---|---|---|
zero-width |
85 | Payload encoded as zero-width Unicode characters |
homoglyph |
80 | Cyrillic lookalike substitution with hidden payload |
bidi-override |
70 | RTL override characters to hide payload direction |
diacritical |
85 | Payload as combining diacritical marks on cover text |
chunk-boundary |
45 | Payload fragments separated by ~512 chars of filler |
zero-width¶
How It Works¶
The highest-stealth technique across all formats. The payload is binary-encoded using four zero-width Unicode characters that are completely invisible in any text editor or renderer:
| Character | Codepoint | Role |
|---|---|---|
| ZW Space | U+200B |
Binary 0 |
| ZW Non-Joiner | U+200C |
Binary 1 |
| ZW Joiner | U+200D |
Byte separator |
| ZW No-Break Space | U+FEFF |
Start/end marker |
Each byte of the payload is encoded as an 8-character binary string using ZW Space for 0 and ZW Non-Joiner for 1, with ZW Joiner separating bytes. The encoded sequence is wrapped in ZW No-Break Space markers and inserted at the midpoint of the cover text:
The resulting file looks identical to the original cover text in every viewer. The payload is only visible when inspecting raw bytes or codepoints.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Raw passthrough; no Unicode normalization | |
| LlamaIndex | Raw passthrough; no Unicode normalization | |
| Unstructured | Strips zero-width characters during Unicode normalization | |
| Haystack | Raw passthrough; no Unicode normalization |
Unstructured's zero-width stripping
Unstructured explicitly strips the following zero-width characters: U+200B, U+200C, U+200D, U+FEFF, U+200E, U+200F, U+2060, U+2061, U+2062, U+2063, U+2064, and U+180E. This completely destroys the encoded payload.
CLI Example¶
Detection Notes¶
Detection
- Scan for zero-width Unicode codepoints (
U+200BthroughU+200D,U+FEFF) in text files - Flag files where the byte count significantly exceeds the visible character count
- Python one-liner:
len(text.encode()) != len(text)is a quick heuristic (though not definitive for all encodings) - Use
cat -vor a hex editor to reveal invisible characters - Command:
xxd file.txt | grep -E 'e2 80 (8b|8c|8d)|ef bb bf'
homoglyph¶
How It Works¶
This technique replaces selected ASCII characters in the cover text with visually identical Unicode characters from the Cyrillic block. The payload is then appended as a zero-width-encoded suffix.
hemlock's homoglyph mapping replaces 19 characters:
| ASCII | Cyrillic Replacement | Codepoint |
|---|---|---|
a |
Cyrillic a | U+0430 |
c |
Cyrillic s | U+0441 |
e |
Cyrillic ie | U+0435 |
o |
Cyrillic o | U+043E |
p |
Cyrillic er | U+0440 |
s |
Cyrillic dze | U+0455 |
x |
Cyrillic ha | U+0445 |
y |
Cyrillic u | U+0443 |
A |
Cyrillic A | U+0410 |
B |
Cyrillic Ve | U+0412 |
C |
Cyrillic Es | U+0421 |
E |
Cyrillic Ie | U+0415 |
H |
Cyrillic En | U+041D |
K |
Cyrillic Ka | U+041A |
M |
Cyrillic Em | U+041C |
O |
Cyrillic O | U+041E |
P |
Cyrillic Er | U+0420 |
T |
Cyrillic Te | U+0422 |
X |
Cyrillic Ha | U+0425 |
The homoglyph substitution serves two purposes: it acts as a secondary payload channel (the substitution itself can encode information), and it provides a carrier for the zero-width-encoded primary payload appended after a U+FEFF marker.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Raw passthrough; no Unicode normalization or homoglyph detection | |
| LlamaIndex | Raw passthrough; homoglyphs are valid Unicode characters | |
| Unstructured | Strips zero-width chars but homoglyphs survive as valid text | |
| Haystack | Raw passthrough; homoglyphs are valid Unicode characters |
Why homoglyphs survive Unstructured
Unstructured's Unicode normalization strips zero-width characters but does not perform confusable detection or Unicode NFKC normalization that would collapse Cyrillic lookalikes to ASCII. The homoglyph-substituted text passes through as valid Unicode, and the zero-width-encoded payload suffix is stripped. However, the homoglyph substitutions themselves survive.
CLI Example¶
hemlock craft \
--format txt \
--technique homoglyph \
--payload exfiltrate \
--topic "API documentation" \
--output ./output
Detection Notes¶
Detection
- Run Unicode confusable detection (ICU
uspoofor Pythonconfusableslibrary) to flag mixed-script text - Check for Cyrillic codepoints in documents that should be purely ASCII or Latin
- Flag documents where identical-looking characters have different codepoints
- Python:
any(ord(c) > 0x0400 and ord(c) < 0x04FF for c in text)detects Cyrillic presence in nominally Latin text
bidi-override¶
How It Works¶
The payload is reversed and inserted between Unicode bidirectional override characters at the midpoint of the cover text. The Right-to-Left Override (U+202E) character forces subsequent text to render right-to-left, and the Pop Directional Formatting (U+202C) character restores normal direction:
Because the payload is reversed in the byte stream and the RLO character causes right-to-left rendering, the visual appearance in most text renderers shows garbled or reordered text. In many editors, the bidi control characters themselves are invisible, and the payload text may render in an unexpected location or be hidden within the surrounding text flow.
The raw byte stream, however, contains the full payload (reversed). When a text extractor reads the file as a raw byte sequence without bidi processing, the reversed payload is present and can be matched.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Raw passthrough; bidi control characters are preserved | |
| LlamaIndex | Raw passthrough; no bidi processing | |
| Unstructured | Zero-width stripping does not remove bidi overrides (U+202E, U+202C) |
|
| Haystack | Raw passthrough; bidi control characters are preserved |
Bidi survival in Unstructured
Unstructured's zero-width character stripping targets specific codepoints (U+200B--U+200D, U+FEFF, etc.) but does not strip bidi control characters (U+202E, U+202C). The payload content survives intact, though in reversed byte order.
CLI Example¶
hemlock craft \
--format txt \
--technique bidi-override \
--payload redirect \
--count 3 \
--output ./output
Detection Notes¶
Detection
- Scan for bidi control characters:
U+202E(RLO),U+202C(PDF),U+202A(LRE),U+202B(RLE) - Flag any text file containing directional override characters in a context where mixed-direction text is not expected
- Command:
grep -P '\x{202E}|\x{202C}' file.txt(with PCRE support) - Bidi control characters in English-language documents are almost always suspicious
diacritical¶
How It Works¶
The payload is binary-encoded as Unicode combining diacritical marks (U+0300–U+036F) stacked on cover text characters. Each payload byte is encoded as 8 combining marks (grave accent = bit 0, acute accent = bit 1), with circumflex marks as byte separators. Tilde and macron marks delimit the encoded region.
| Character | Codepoint | Role |
|---|---|---|
| Combining Grave | U+0300 |
Binary 0 |
| Combining Acute | U+0301 |
Binary 1 |
| Combining Circumflex | U+0302 |
Byte separator |
| Combining Tilde | U+0303 |
Start marker |
| Combining Macron | U+0304 |
End marker |
The encoded sequence is inserted at the midpoint of the cover text, immediately after a base character. Since combining marks render as diacritics stacked on the preceding character, the result looks like slightly accented text in most editors — but many renderers collapse or hide the marks entirely.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Raw passthrough; combining marks preserved | |
| LlamaIndex | Raw passthrough; no Unicode normalization | |
| Unstructured | Strips all combining marks in U+0300–U+036F range | |
| Haystack | Raw passthrough; no Unicode normalization |
Unstructured's combining mark stripping
Unstructured explicitly strips the entire combining diacritical marks range (U+0300–U+036F) during Unicode normalization. This completely destroys the encoded payload. The technique survives all other frameworks via raw byte passthrough.
CLI Example¶
Detection Notes¶
Detection
- Scan for high density of combining marks (U+0300–U+036F) in text that should not contain them
- Flag text where combining marks appear in clusters of 8+ consecutive marks
- Compare the ratio of combining marks to base characters — legitimate diacritics rarely exceed 1:1
- Command:
python3 -c "import unicodedata; print(sum(1 for c in open('file.txt').read() if unicodedata.category(c) == 'Mn'))"
chunk-boundary¶
How It Works¶
The payload is split into three character-based parts and written as plain text with approximately 512 characters of benign filler text between each fragment. The filler consists of generic reference material text that looks natural in any knowledge base document:
[Cover text]
[Fragment 1 of payload]
General reference material compiled from verified sources.
This section contains supplementary information about standards
and regulatory requirements that support the primary content...
[~512 characters of filler]
[Fragment 2 of payload]
[~512 characters of filler]
[Fragment 3 of payload]
The filler blocks are designed to push fragments across chunk boundaries used by text splitters (typically 500-1000 character windows), distributing the payload across multiple retrieval chunks.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Raw passthrough; plain text read as-is | |
| LlamaIndex | Raw passthrough; no processing | |
| Unstructured | Raw passthrough; plain text survives all extraction | |
| Haystack | Raw passthrough |
CLI Example¶
hemlock craft \
--format txt \
--technique chunk-boundary \
--payload override \
--topic "employee handbook FAQ" \
--output ./output
Detection Notes¶
Detection
- Look for repeated filler text patterns between content sections
- Flag documents where text sections appear structurally repetitive
- Compare the semantic content of sections — payload fragments may be topically inconsistent with surrounding filler
Survival Matrix¶
| Technique | Stealth | LangChain | LlamaIndex | Haystack | Unstructured |
|---|---|---|---|---|---|
zero-width |
85 | ||||
homoglyph |
80 | ||||
bidi-override |
70 | ||||
diacritical |
85 | ||||
chunk-boundary |
45 |
TXT techniques and Unstructured
homoglyph, bidi-override, and chunk-boundary survive all four frameworks. If you know the target uses Unstructured, avoid zero-width and diacritical — both will be completely stripped (zero-width characters and combining marks respectively).