Architecture¶
This page documents hemlock's internal architecture, package layout, data flow, and the reasoning behind key design decisions. It also provides step-by-step guides for extending hemlock with new formats and techniques.
Package Layout¶
graph TD
subgraph "cmd/hemlock"
CLI["CLI (Cobra commands)<br/>craft, batch, validate,<br/>list-techniques, list-payloads"]
end
subgraph "pkg/craft"
Craft["craft.go<br/>Craft(), ListTechniques()"]
Opts["options.go<br/>CraftOptions"]
Doc["document.go<br/>Document, TechniqueInfo"]
Cover["covertext.go<br/>Cover text templates"]
Opt["optimize.go<br/>CEM optimizer"]
OptGen["optimize_genetic.go<br/>Genetic (DIGA) optimizer"]
OptWB["optimize_whitebox.go<br/>Whitebox gradient optimizer"]
ScoreInj["score_injection.go<br/>Reward model HTTP client"]
Score["score.go<br/>Stealth scoring"]
end
subgraph "pkg/payloads"
Pay["payloads.go<br/>ListPayloads(), GetPayload(),<br/>ResolvePayload()"]
OV["override.go"]
EX["exfiltrate.go"]
RE["redirect.go"]
DE["denial.go"]
MS["multistage.go"]
AU["authority.go"]
AD["adaptive.go"]
end
subgraph "pkg/formats"
HTML["html/<br/>comment, invisible-div,<br/>aria-hidden, css-hide,<br/>microdata, chunk-boundary,<br/>offscreen, color-transparent,<br/>noscript"]
DOCX["docx/<br/>metadata, fontzero,<br/>whitefont, comment, custom-xml,<br/>metadata-distributed,<br/>chunk-boundary, hidden-paragraph"]
PDF["pdf/<br/>annotation, invisible-text,<br/>javascript, xmp-metadata,<br/>xmp-distributed, chunk-boundary,<br/>offpage"]
TXT["txt/<br/>zero-width, homoglyph,<br/>bidi-override, chunk-boundary"]
MD["markdown/<br/>html-comment, frontmatter,<br/>link-title, image-alt,<br/>chunk-boundary"]
RTF["rtf/<br/>metadata, fontzero,<br/>comment"]
EPUB["epub/<br/>metadata, css-hide,<br/>comment, aria-hidden,<br/>metadata-distributed, toc"]
CSV["csv/<br/>extra-column, bom-prefix,<br/>formula-injection"]
JSON["json/<br/>metadata-key,<br/>unicode-escape"]
XLSX["xlsx/<br/>hidden-sheet, metadata,<br/>comment, fontzero"]
IMG["image/<br/>text-chunk, xmp-metadata,<br/>multi-chunk, steganographic"]
end
subgraph "pkg/validate"
Val["validate.go<br/>Validate(), ValidateFile()"]
LC["langchain.go"]
LI["llamaindex.go"]
UN["unstructured.go"]
HS["haystack.go"]
Help["helpers.go<br/>28 shared extraction utils"]
end
CLI --> Craft
Craft --> Opts
Craft --> Doc
Craft --> Cover
Craft --> Pay
Craft --> HTML & DOCX & PDF & TXT & MD & RTF & EPUB & CSV & JSON & XLSX & IMG
CLI --> Val
Pay --> OV & EX & RE & DE & MS & AU
Val --> LC & LI & UN & HS
LC --> Help
LI --> Help
UN --> Help
HS --> Help
Directory Structure¶
hemlock/
+-- cmd/
| +-- hemlock/ # CLI entry point (Cobra root + subcommands)
+-- pkg/
| +-- craft/ # High-level orchestration
| | +-- craft.go # Craft(), ListTechniques()
| | +-- options.go # CraftOptions struct
| | +-- document.go # Document, TechniqueInfo structs
| | +-- covertext.go # Cover text generation
| | +-- optimize.go # CEM optimizer (with injection score blending)
| | +-- optimize_genetic.go # Genetic (DIGA) optimizer
| | +-- optimize_whitebox.go # Whitebox gradient optimizer
| | +-- score_injection.go # Reward model HTTP client
| | +-- score.go # Stealth scoring
| +-- payloads/ # Payload registry
| | +-- payloads.go # ListPayloads(), GetPayload(), ResolvePayload()
| | +-- override.go # Override category (10 variants)
| | +-- exfiltrate.go # Exfiltrate category (10 variants)
| | +-- redirect.go # Redirect category (10 variants)
| | +-- denial.go # Denial category (10 variants)
| | +-- multistage.go # Multistage category (20 variants)
| | +-- authority.go # Authority category (10 variants)
| | +-- adaptive.go # Model-family payload adaptation
| +-- embed/ # Embedding providers
| | +-- embed.go # Provider interface
| | +-- ollama.go # Ollama embedding provider
| | +-- openai.go # OpenAI embedding provider
| +-- formats/ # Per-format document generators (11 formats, 57 techniques)
| | +-- html/ # HTML generation (10 techniques)
| | +-- docx/ # DOCX generation (8 techniques)
| | +-- pdf/ # PDF generation (7 techniques)
| | +-- txt/ # TXT generation (5 techniques)
| | +-- markdown/ # Markdown generation (5 techniques)
| | +-- rtf/ # RTF generation (3 techniques)
| | +-- epub/ # EPUB generation (6 techniques)
| | +-- csv/ # CSV generation (3 techniques)
| | +-- json/ # JSON generation (2 techniques)
| | +-- xlsx/ # XLSX generation (4 techniques)
| | +-- image/ # Image/PNG generation (4 techniques)
| +-- validate/ # Framework extraction simulation
| +-- validate.go # Validate(), ValidateFile(), framework dispatch
| +-- langchain.go # LangChain extraction simulation
| +-- llamaindex.go # LlamaIndex extraction simulation
| +-- unstructured.go # Unstructured.io extraction simulation
| +-- haystack.go # Haystack extraction simulation
| +-- helpers.go # 28 shared HTML/XML/PDF/RTF/EPUB parsing utilities
+-- test/
| +-- integration/ # End-to-end craft-then-validate tests
+-- docs/ # MkDocs documentation
+-- testdata/ # Test fixtures
+-- Makefile # Build, test, docs targets
+-- mkdocs.yml # Documentation site configuration
Data Flow¶
Document Generation¶
sequenceDiagram
participant User
participant CLI
participant Craft
participant Payloads
participant Format
participant Disk
User->>CLI: hemlock craft --format docx --payload override
CLI->>Craft: Craft(CraftOptions)
Craft->>Craft: Apply defaults (count=5, framework=generic)
Craft->>Craft: Generate cover text from topic
Craft->>Craft: Resolve technique list
loop For each technique
loop For each variant (0..count-1)
Craft->>Payloads: ResolvePayload(category, custom, variantIndex)
Payloads-->>Craft: Payload text
Craft->>Format: Generate(payload, coverText, technique)
Format-->>Craft: Document bytes
Craft->>Craft: Build Document struct
end
end
alt OutputDir is set
Craft->>Disk: Write files to OutputDir
end
Craft-->>CLI: []Document
CLI->>User: Summary output
Batch + Validate Pipeline¶
The batch command generates documents across all formats and techniques, writing a .hemlock-manifest.json alongside them. The validate --dir command reads the manifest and validates every file against all four frameworks:
sequenceDiagram
participant User
participant CLI
participant Craft
participant Validate
participant Disk
User->>CLI: hemlock batch --payload override --output-dir ./out
CLI->>Craft: Craft() per format
Craft->>Disk: Write files + .hemlock-manifest.json
User->>CLI: hemlock validate --dir ./out
CLI->>Disk: Read .hemlock-manifest.json
loop For each file in manifest
loop For each framework
CLI->>Validate: ValidateFile(file, framework, payload)
Validate-->>CLI: ValidationResult
end
end
CLI->>User: JSON results
Validation Engine¶
The validation engine simulates how four RAG frameworks extract text from documents. Each framework file (langchain.go, llamaindex.go, unstructured.go, haystack.go) contains a dispatch function that routes by file format, then calls format-specific extraction logic.
Framework Extraction Behavior¶
Validated against live pipelines at 92.3% accuracy (131/142 comparisons match). Key framework-specific behaviors discovered through testing:
| Behavior | LangChain | LlamaIndex | Unstructured | Haystack |
|---|---|---|---|---|
| HTML parsing | Strips tags, decodes entities | Returns raw HTML (no parsing) | Strips tags, preserves hidden/aria-hidden text | Strips tags, strips aria-hidden |
| DOCX extraction | w:t elements only (no metadata) |
w:t elements + metadata |
Full XML text content | DOCXToDocument converter |
| PDF text | BT/ET blocks + FlateDecode decompression | BT/ET blocks + FlateDecode decompression | BT/ET blocks only (pdfminer filters invisible text) | BT/ET blocks + FlateDecode decompression |
| RTF extraction | Raw string content | Raw string content | Body text, strips annotation/info groups | Raw string content |
| EPUB handling | Chapter XHTML via langchainHTML | Chapter XHTML via llamaindexEPUBHTML (stricter) | Chapter XHTML via unstructuredHTML | Returns empty (no native converter) |
| XLSX handling | Shared strings only (no metadata) | Returns empty (PandasExcelReader crashes) | XML text content | Returns empty (no native converter) |
| CSV handling | Text content | Returns empty (PandasCSVReader fails) | Text content | Text content |
Helper Functions¶
The helpers.go file contains 28 shared utilities organized by domain:
- HTML:
stripHTMLTags,stripHTMLComments,exposeHTMLComments,decodeHTMLEntities,normalizeWhitespace,stripHiddenElements,stripAriaHidden,stripElementsWithAttribute,extractTagName - XML/DOCX:
extractWTElements,extractXMLTextContent,extractTElements,readZipFile - PDF:
extractPDFText(with FlateDecode decompression),extractPDFTextSimple(raw BT/ET only),decompressPDFStreams,extractPDFStringOperands,extractParenthesized,findMatchingParen - RTF:
extractRTFBodyText,extractRTFMetadata,extractRTFAnnotations,stripRTFGroup - EPUB:
extractEPUBParts,extractEPUBChapterText - XLSX:
extractXLSXSharedStrings,extractXLSXComments,extractXLSXMetadata
Design Decisions¶
Why No DOCX Library?¶
DOCX files are ZIP archives containing XML. hemlock constructs these archives directly using Go's archive/zip and string templates for the XML parts. This approach:
- Eliminates a dependency. No need for a Go DOCX library that may not support the low-level XML manipulation needed for techniques like
custom-xmlandfontzero. - Provides precise control. Techniques like
fontzerorequire placing a<w:r>run with specific<w:rPr>properties. Direct XML construction makes this straightforward. - Keeps the output minimal. The generated DOCX files contain only the XML parts needed, without template bloat.
Why gofpdf?¶
PDF is a complex binary format that is impractical to generate from scratch. gofpdf is a mature, well-tested Go library for PDF creation. It is the only external dependency outside the CLI framework (Cobra).
Why Cobra?¶
Cobra is the standard CLI framework in the Go ecosystem. It provides subcommand routing, flag parsing, help generation, and shell completion. The library API does not depend on Cobra---it is only used in the cmd/hemlock package.
Why Simulated Validation?¶
Running the real Python frameworks (LangChain, LlamaIndex, Unstructured, Haystack) would require Python, virtual environments, and their dependency trees. hemlock's validation engine replicates the extraction behavior in pure Go, which:
- Keeps hemlock a single static binary. No Python runtime needed.
- Runs anywhere Go compiles. Cross-platform without Python portability concerns.
- Executes instantly. No interpreter startup or library import overhead.
The tradeoff is that edge cases may diverge from the real frameworks. The Confidence field in ValidationResult communicates this uncertainty. Accuracy is verified against live pipelines via the hemlock-lab test harness.
Why Separate Format Packages?¶
Each format package is independent and self-contained. This design:
- Allows each format to evolve without affecting others.
- Makes it easy to add new formats without modifying existing code.
- Enables direct usage of format packages when the
craftorchestration layer is not needed.
Why HTTP for Injection Scoring?¶
The joint optimization reward model runs as a Python FastAPI server (reward_server.py) that the Go optimizers query via HTTP POST. The alternative would be embedding the ML model directly in Go via ONNX Runtime or CGo. The HTTP approach was chosen because:
- The ML stack stays in Python. PyTorch, scikit-learn, and sentence-transformers have no mature Go equivalents. Maintaining two ML implementations would double the maintenance burden.
- Overhead is negligible. A local HTTP call adds ~10ms per candidate, compared to ~200ms for embedding computation and ~2s for Ollama model inference. The optimizer is not bottlenecked by scoring.
- The server can be shared. Multiple hemlock instances or optimizer runs can query the same reward server.
- Backward compatibility. When
InjectionWeightis 0 (the default), no HTTP calls are made. The reward server is only needed when joint optimization is explicitly enabled.
Known Limitations¶
The remaining 7.7% accuracy gap comes from two sources:
- TXT Unicode techniques (10 drifts):
bidi-override,homoglyph, andzero-widthuse Unicode transformations. hemlock'svalidateuses semantic matching (e.g., reversing bidi text, decoding zero-width sequences) while the raw extraction test uses literal string search. The validator is correct---the test methodology diverges. - EPUB css-hide on LlamaIndex (1 drift): CSS class-based hiding (
<style>.x{font-size:0}</style>+<span class="x">payload</span>) requires parsing<style>blocks and matching class selectors, which is beyond the scope of string-based extraction simulation.
Adding a New Format¶
To add a new document format:
Step 1: Create the Package¶
pkg/formats/newformat/
+-- newformat.go # Techniques() and Generate()
+-- technique1.go # First technique implementation
+-- newformat_test.go # Tests
Step 2: Implement the Interface¶
Every format package exports the same two functions:
package newformat
import "fmt"
func Techniques() []string {
return []string{"technique-name"}
}
func Generate(payload, coverText, technique string) ([]byte, error) {
switch technique {
case "technique-name":
return generateTechnique(payload, coverText)
default:
return nil, fmt.Errorf("newformat: unknown technique %q", technique)
}
}
Step 3: Register in craft.go¶
Add the format to the generators map in pkg/craft/craft.go:
import "github.com/professor-moody/hemlock/pkg/formats/newformat"
var generators = map[string]formatGenerator{
// ... existing formats ...
"newformat": {newformat.Techniques, newformat.Generate, ".ext"},
}
Step 4: Add Stealth Scores and Descriptions¶
Add entries to stealthScore() and techniqueDescription() in pkg/craft/craft.go for the new format and techniques.
Step 5: Add Validation Support¶
Add extraction functions in pkg/validate/ for each framework:
langchain.go: Add the case toextractLangChainllamaindex.go: Add the case toextractLlamaIndexunstructured.go: Add the case toextractUnstructuredhaystack.go: Add the case toextractHaystack
Update detectFormat() in validate.go to recognize the new extension.
Step 6: Add Tests and Documentation¶
- Write tests in
pkg/formats/newformat/newformat_test.go - Add a technique documentation page in
docs/techniques/newformat.md - Update the nav in
mkdocs.yml
Adding a New Technique¶
To add a new hiding technique to an existing format (e.g., data-attribute for HTML):
Step 1: Create the Implementation File¶
// pkg/formats/html/dataattr.go
package html
import "fmt"
func generateDataAttribute(payload, coverText string) ([]byte, error) {
doc := fmt.Sprintf(`<!DOCTYPE html>
<html>
<body data-context="%s">
<p>%s</p>
</body>
</html>`, payload, coverText)
return []byte(doc), nil
}
Step 2: Register in the Format Package¶
Update Techniques() and Generate() in pkg/formats/html/html.go:
func Techniques() []string {
return []string{"comment", "invisible-div", "aria-hidden", "css-hide", "data-attribute"}
}
func Generate(payload, coverText, technique string) ([]byte, error) {
switch technique {
// ... existing cases ...
case "data-attribute":
return generateDataAttribute(payload, coverText)
default:
return nil, fmt.Errorf("html: unknown technique %q", technique)
}
}
Step 3: Add Metadata in craft.go¶
Add stealth score and description entries in pkg/craft/craft.go.
Step 4: Update Validation¶
If the new technique has different extraction behavior, update the relevant framework simulation functions in pkg/validate/.
Step 5: Test¶
Write tests that generate a document with the new technique and validate it against all four frameworks.
Next Steps¶
- Contributing Guide --- Code style, testing requirements, and contribution workflow
- Go API Reference --- Package documentation and usage examples