CSV Techniques¶
hemlock provides five hiding techniques for CSV files. CSV is a ubiquitous data format in data pipelines, spreadsheet exports, and ETL workflows. Because CSV is plain text, hiding payloads requires structural tricks rather than format-level concealment.
Technique Overview¶
| Technique | Stealth | Description |
|---|---|---|
extra-column |
45 | Payload in extra _metadata column |
bom-prefix |
50 | Payload after UTF-8 BOM in Notes field |
formula-injection |
60 | =CONCATENATE() formula reassembles payload fragments |
quoted-field |
55 | Payload in multiline quoted CSV field |
header-inject |
50 | Payload as a CSV header cell name |
extra-column¶
How It Works¶
The payload is placed as a cell value under an additional _metadata column appended to each row. The cover text is split into sections across the primary columns. Many CSV consumers ignore unexpected columns, so the payload may pass through unnoticed.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Raw text pass-through, all columns visible | |
| LlamaIndex | Raw text pass-through | |
| Unstructured | Raw text pass-through | |
| Haystack | Raw text pass-through |
CLI Example¶
bom-prefix¶
How It Works¶
The CSV file begins with a UTF-8 BOM (\xEF\xBB\xBF) followed by the standard CSV content. The payload is embedded in a Notes field. The BOM is invisible in most text editors and may cause some parsers to behave unexpectedly.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Raw text pass-through | |
| LlamaIndex | Raw text pass-through | |
| Unstructured | Raw text pass-through | |
| Haystack | Raw text pass-through |
CLI Example¶
formula-injection¶
How It Works¶
The payload is split into four character-based parts and distributed across reference columns (Ref_A through Ref_D). A =CONCATENATE() formula in the Notes column reconstructs the full payload from these fragments:
ID,Title,Content,Notes,Ref_A,Ref_B,Ref_C,Ref_D
1,"Reference Document","Cover text...","=CONCATENATE(E2,F2,G2,H2)","first quarter","second quarter","third quarter","fourth quarter"
2,"Supporting Data","Additional content...","","","","",""
Spreadsheet-based RAG extractors that evaluate formulas will reconstruct the full payload. Even without formula evaluation, raw CSV text extraction will capture the individual fragments in the reference columns.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Raw CSV text includes formula strings and all columns | |
| Unstructured | Raw text pass-through includes all fields | |
| Haystack | Raw text pass-through | |
| LlamaIndex | CSV extraction returns empty for formula-based content |
Formula evaluation vs. raw extraction
The =CONCATENATE() formula is a secondary attack vector — the primary payload delivery is through the raw reference column values that survive all text-based extractors. Spreadsheet applications (Excel, Google Sheets) that evaluate the formula will reconstruct the payload in the Notes cell.
CLI Example¶
Detection Notes¶
Detection
- Flag CSV cells beginning with
=,+,-, or@as potential formula injection - Inspect reference/metadata columns for instruction-like content
- Check
CONCATENATEor similar formula references that reconstruct content from multiple cells
quoted-field¶
How It Works¶
The payload is hidden inside a multiline quoted CSV field. The payload text is placed on internal newlines within a single cell, separated from visible content by blank lines. CSV parsers that correctly handle RFC 4180 quoted fields will include the full cell content in extraction; simpler line-based parsers may split or lose the embedded text.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Raw text pass-through includes full quoted fields | |
| LlamaIndex | May not correctly preserve multiline quoted fields | |
| Unstructured | Raw text pass-through | |
| Haystack | Raw text pass-through |
CLI Example¶
header-inject¶
How It Works¶
The payload is used as the name of a CSV header column. Most consumers use headers as column labels rather than data, so the payload is visible in raw text extraction but may be ignored by structured CSV importers that only process data rows.
Framework Survival¶
| Framework | Survives | Mechanism |
|---|---|---|
| LangChain | Raw text pass-through includes header row | |
| LlamaIndex | May use headers as schema, not content | |
| Unstructured | Raw text pass-through | |
| Haystack | Raw text pass-through |
CLI Example¶
Survival Matrix¶
| Technique | Stealth | LangChain | LlamaIndex | Haystack | Unstructured |
|---|---|---|---|---|---|
extra-column |
45 | ||||
bom-prefix |
50 | ||||
formula-injection |
60 | ||||
quoted-field |
55 | ||||
header-inject |
50 |
CSV is universally extracted
CSV files are plain text, so most techniques survive all major RAG frameworks. The differences arise in how frameworks handle formula evaluation, multiline fields, and column selection.