Markdown Techniques¶
hemlock provides five hiding techniques for Markdown files. While Markdown is a lightweight format with limited structural complexity, its prevalence in knowledge bases, documentation sites, and RAG pipelines makes it a relevant attack surface.
Technique Overview¶
| Technique | Stealth | Description |
|---|---|---|
html-comment |
35 | Payload hidden in HTML comments within Markdown |
frontmatter |
55 | Payload injected in YAML front matter metadata |
link-title |
65 | Payload distributed across link title attributes |
image-alt |
60 | Payload distributed across image alt text attributes |
chunk-boundary |
50 | Fragments in separate ## heading sections with filler |
html-comment¶
How It Works¶
HTML comments are valid syntax inside Markdown documents. Markdown renderers (GitHub, MkDocs, Jekyll, etc.) strip comments during HTML conversion, but raw file readers -- including most RAG loaders in their default configuration -- read the file as-is and pass comments through to the embedding model.
hemlock distributes the payload across multiple HTML comments interspersed between paragraphs of cover text. The payload is split into roughly equal chunks, with one chunk placed before the first paragraph and subsequent chunks inserted between paragraphs:
<!-- First chunk of payload text -->
This is the first paragraph of visible cover text that discusses
the topic naturally.
<!-- Second chunk of payload text -->
This is the second paragraph continuing the discussion with
additional legitimate content.
<!-- Third chunk of payload text -->
Final paragraph wrapping up the visible document content.
The splitting strategy serves two purposes: it makes the comments look more natural if someone inspects the raw file, and it distributes the payload across the document so that chunking strategies that split on paragraph boundaries are more likely to capture at least part of the injection in each chunk.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Raw passthrough; Markdown files are read as plain text | |
| LlamaIndex | Raw passthrough; no Markdown parsing or comment stripping | |
| Unstructured | Strips zero-width characters but more importantly applies text-level processing that may strip comments | |
| Haystack | Raw passthrough; Markdown treated as plain text |
Raw passthrough is the key
LangChain and LlamaIndex treat .md files identically to .txt files -- the entire file content, including HTML comments, is passed through as the document text. There is no Markdown-to-HTML conversion step in the default loaders. This is why a simple technique like HTML comments is highly effective.
Rendered vs. raw
If the target pipeline renders Markdown to HTML before extraction (e.g., using a Markdown library followed by an HTML loader), the comments will be stripped during the HTML conversion step. This technique specifically targets pipelines that ingest raw Markdown files.
CLI Example¶
Detection Notes¶
Detection
- Search for
<!--and-->patterns in Markdown source files - Flag HTML comments that contain instruction-like text, URLs, or encoded content
- Regex:
<!--[\s\S]*?--> - In CI/CD pipelines, add a lint rule that flags HTML comments in documentation Markdown files
- Compare rendered output against raw source; any content present in raw but absent in rendered is suspicious
frontmatter¶
How It Works¶
Many Markdown-based systems (Jekyll, Hugo, MkDocs, Obsidian) use YAML front matter—a block delimited by --- at the top of the file—for metadata such as title, date, and description. RAG document loaders that ingest raw Markdown files typically include the front matter in extracted text, even though it is not rendered as body content.
hemlock injects the payload into the description field of a YAML front matter block, followed by the cover text as the document body:
---
description: "INJECTED PAYLOAD TEXT"
---
This is the first paragraph of visible cover text that discusses
the topic naturally.
This is the second paragraph continuing the discussion with
additional legitimate content.
The front matter block looks entirely natural in any Markdown file that uses metadata headers. The payload appears as a standard description field that a human reviewer would likely skim over.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Raw passthrough; front matter included in extracted text | |
| LlamaIndex | Raw passthrough; may separately parse front matter as metadata but still includes it | |
| Haystack | Raw passthrough; YAML block treated as document content | |
| Unstructured | May strip or separately parse front matter metadata |
CLI Example¶
hemlock craft \
--format markdown \
--technique frontmatter \
--payload override \
--topic "project documentation" \
--output ./output
Detection Notes¶
Detection
- Inspect YAML front matter fields for instruction-like text, especially
description,summary, orabstractfields - Flag front matter values that contain imperative language or are unusually long
- Regex:
^---\s*\n[\s\S]*?description:\s*"[^"]*ignore|disregard|system[\s\S]*?"\s*\n[\s\S]*?---
link-title¶
How It Works¶
Markdown link syntax supports an optional title attribute: [text](url "title"). The title is rendered as a tooltip on hover in HTML but is present in the raw Markdown source. RAG loaders that ingest raw Markdown include the title text in the extracted content.
hemlock distributes the payload across multiple link title attributes interspersed with the cover text. The payload is split into roughly equal chunks using a splitPayload() helper, with each chunk placed in the title attribute of a separate link:
This is the first paragraph of visible cover text discussing
the topic in a natural way.
For more details, see [resource](https://example.com/ref-1 "first chunk of payload").
This is the second paragraph continuing the discussion with
more context about the topic.
Additional reading: [documentation](https://example.com/ref-2 "second chunk of payload").
The links look like ordinary reference links with helpful tooltips. A human reviewer would need to inspect each title attribute and reassemble the chunks to discover the payload.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Raw passthrough; title attributes included in text | |
| LlamaIndex | Raw passthrough; link syntax preserved in extracted text | |
| Haystack | Raw passthrough; full Markdown syntax ingested | |
| Unstructured | Title attributes survive text extraction |
High survival rate
Link title attributes are part of standard Markdown syntax and are not considered hidden content by any major RAG framework. This makes link-title one of the most robust Markdown techniques.
CLI Example¶
hemlock craft \
--format markdown \
--technique link-title \
--payload exfiltrate \
--topic "API reference documentation" \
--output ./output
Detection Notes¶
Detection
- Extract all link title attributes and inspect their concatenated content for instruction-like text
- Flag links where title text does not match the link context or contains imperative language
- Regex for link titles:
\[.*?\]\(.*?\s+"(.*?)"\)
image-alt¶
How It Works¶
Markdown image syntax uses alt text to describe the image: . Alt text is included in the raw Markdown source and extracted by RAG loaders. Since images in documentation frequently have descriptive alt text, payloads hidden here blend in naturally.
hemlock distributes the payload across multiple image alt text attributes, with each chunk placed on a separate image reference:
This is the first paragraph of visible cover text discussing
the topic in detail.

This is the second paragraph continuing the discussion with
additional relevant context.

The images reference plausible filenames, and the alt text appears to be image descriptions. A reviewer would need to concatenate all alt text values to recover the full payload.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Raw passthrough; alt text included in extracted content | |
| LlamaIndex | Raw passthrough; image syntax preserved in text | |
| Haystack | Raw passthrough; full Markdown syntax ingested | |
| Unstructured | Alt text survives as document content |
Universal survival
Like link-title, image alt text is standard Markdown syntax that no framework strips. The technique survives all major RAG loaders.
CLI Example¶
hemlock craft \
--format markdown \
--technique image-alt \
--payload redirect \
--topic "product documentation" \
--output ./output
Detection Notes¶
Detection
- Extract all image alt text values and inspect their concatenated content
- Flag alt text that contains instruction-like language rather than image descriptions
- Regex for image alt text:
!\[(.*?)\]\(.*?\) - Compare alt text content to the surrounding document topic; mismatches are suspicious
chunk-boundary¶
How It Works¶
The payload is split into three character-based parts and embedded in separate Markdown heading sections. Each fragment appears within its own ## Section block, with filler paragraphs of benign reference text between sections:
# Reference Document
Introduction paragraph with cover text about the topic...
## Section 1
First third of payload text.
Supplementary reference material compiled from verified sources.
This section contains supplementary information about standards
and regulatory requirements...
## Section 2
Second third of payload text.
Additional supplementary reference material...
## Section 3
Third third of payload text.
Markdown heading boundaries (##) are natural break points for heading-aware text splitters (e.g., LangChain's MarkdownHeaderTextSplitter), so payload fragments are distributed across separate retrieval chunks.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Raw passthrough; heading sections preserved | |
| LlamaIndex | Raw passthrough; all text included | |
| Haystack | Raw passthrough; full Markdown syntax ingested | |
| Unstructured | Heading sections preserved as content |
CLI Example¶
hemlock craft \
--format markdown \
--technique chunk-boundary \
--payload exfiltrate \
--topic "API reference documentation" \
--output ./output
Detection Notes¶
Detection
- Flag documents where heading sections contain topically inconsistent content
- Look for repeated filler text patterns between sections
- Check if heading-delimited sections contain instruction-like language
Survival Matrix¶
| Technique | Stealth | LangChain | LlamaIndex | Haystack | Unstructured |
|---|---|---|---|---|---|
html-comment |
35 | ||||
frontmatter |
55 | ||||
link-title |
65 | ||||
image-alt |
60 | ||||
chunk-boundary |
50 |
Consider TXT techniques for Markdown files
Since most RAG frameworks treat Markdown as plain text, the TXT techniques (zero-width, homoglyph, bidi-override) also work on .md files. If you need higher stealth than HTML comments provide, generate a .md file using the txt format with zero-width or homoglyph encoding and rename the output file extension. The payload will survive LangChain and LlamaIndex identically.