Skip to content

CSV Techniques

hemlock provides five hiding techniques for CSV files. CSV is a ubiquitous data format in data pipelines, spreadsheet exports, and ETL workflows. Because CSV is plain text, hiding payloads requires structural tricks rather than format-level concealment.

Technique Overview

Technique Stealth Description
extra-column 45 Payload in extra _metadata column
bom-prefix 50 Payload after UTF-8 BOM in Notes field
formula-injection 60 =CONCATENATE() formula reassembles payload fragments
quoted-field 55 Payload in multiline quoted CSV field
header-inject 50 Payload as a CSV header cell name

extra-column

How It Works

The payload is placed as a cell value under an additional _metadata column appended to each row. The cover text is split into sections across the primary columns. Many CSV consumers ignore unexpected columns, so the payload may pass through unnoticed.

Framework Survival

Framework Survives Mechanism
LangChain Raw text pass-through, all columns visible
LlamaIndex Raw text pass-through
Unstructured Raw text pass-through
Haystack Raw text pass-through

CLI Example

hemlock craft --format csv --technique extra-column --payload override --output ./output

bom-prefix

How It Works

The CSV file begins with a UTF-8 BOM (\xEF\xBB\xBF) followed by the standard CSV content. The payload is embedded in a Notes field. The BOM is invisible in most text editors and may cause some parsers to behave unexpectedly.

Framework Survival

Framework Survives Mechanism
LangChain Raw text pass-through
LlamaIndex Raw text pass-through
Unstructured Raw text pass-through
Haystack Raw text pass-through

CLI Example

hemlock craft --format csv --technique bom-prefix --payload override --output ./output

formula-injection

How It Works

The payload is split into four character-based parts and distributed across reference columns (Ref_A through Ref_D). A =CONCATENATE() formula in the Notes column reconstructs the full payload from these fragments:

ID,Title,Content,Notes,Ref_A,Ref_B,Ref_C,Ref_D
1,"Reference Document","Cover text...","=CONCATENATE(E2,F2,G2,H2)","first quarter","second quarter","third quarter","fourth quarter"
2,"Supporting Data","Additional content...","","","","",""

Spreadsheet-based RAG extractors that evaluate formulas will reconstruct the full payload. Even without formula evaluation, raw CSV text extraction will capture the individual fragments in the reference columns.

Framework Survival

Framework Survives Mechanism
LangChain Raw CSV text includes formula strings and all columns
Unstructured Raw text pass-through includes all fields
Haystack Raw text pass-through
LlamaIndex CSV extraction returns empty for formula-based content

Formula evaluation vs. raw extraction

The =CONCATENATE() formula is a secondary attack vector — the primary payload delivery is through the raw reference column values that survive all text-based extractors. Spreadsheet applications (Excel, Google Sheets) that evaluate the formula will reconstruct the payload in the Notes cell.

CLI Example

hemlock craft --format csv --technique formula-injection --payload exfiltrate --output ./output

Detection Notes

Detection

  • Flag CSV cells beginning with =, +, -, or @ as potential formula injection
  • Inspect reference/metadata columns for instruction-like content
  • Check CONCATENATE or similar formula references that reconstruct content from multiple cells

quoted-field

How It Works

The payload is hidden inside a multiline quoted CSV field. The payload text is placed on internal newlines within a single cell, separated from visible content by blank lines. CSV parsers that correctly handle RFC 4180 quoted fields will include the full cell content in extraction; simpler line-based parsers may split or lose the embedded text.

Framework Survival

Framework Survives Mechanism
LangChain Raw text pass-through includes full quoted fields
LlamaIndex May not correctly preserve multiline quoted fields
Unstructured Raw text pass-through
Haystack Raw text pass-through

CLI Example

hemlock craft --format csv --technique quoted-field --payload override --output ./output

header-inject

How It Works

The payload is used as the name of a CSV header column. Most consumers use headers as column labels rather than data, so the payload is visible in raw text extraction but may be ignored by structured CSV importers that only process data rows.

Framework Survival

Framework Survives Mechanism
LangChain Raw text pass-through includes header row
LlamaIndex May use headers as schema, not content
Unstructured Raw text pass-through
Haystack Raw text pass-through

CLI Example

hemlock craft --format csv --technique header-inject --payload override --output ./output

Survival Matrix

Technique Stealth LangChain LlamaIndex Haystack Unstructured
extra-column 45
bom-prefix 50
formula-injection 60
quoted-field 55
header-inject 50

CSV is universally extracted

CSV files are plain text, so most techniques survive all major RAG frameworks. The differences arise in how frameworks handle formula evaluation, multiline fields, and column selection.