Skip to content

Architecture

This page documents hemlock's internal architecture, package layout, data flow, and the reasoning behind key design decisions. It also provides step-by-step guides for extending hemlock with new formats and techniques.


Package Layout

graph TD
    subgraph "cmd/hemlock"
        CLI["CLI (Cobra commands)<br/>craft, batch, validate,<br/>list-techniques, list-payloads"]
    end

    subgraph "pkg/craft"
        Craft["craft.go<br/>Craft(), ListTechniques()"]
        Opts["options.go<br/>CraftOptions"]
        Doc["document.go<br/>Document, TechniqueInfo"]
        Cover["covertext.go<br/>Cover text templates"]
        Opt["optimize.go<br/>CEM optimizer"]
        OptGen["optimize_genetic.go<br/>Genetic (DIGA) optimizer"]
        OptWB["optimize_whitebox.go<br/>Whitebox gradient optimizer"]
        ScoreInj["score_injection.go<br/>Reward model HTTP client"]
        Score["score.go<br/>Stealth scoring"]
    end

    subgraph "pkg/payloads"
        Pay["payloads.go<br/>ListPayloads(), GetPayload(),<br/>ResolvePayload()"]
        OV["override.go"]
        EX["exfiltrate.go"]
        RE["redirect.go"]
        DE["denial.go"]
        MS["multistage.go"]
        AU["authority.go"]
        AD["adaptive.go"]
    end

    subgraph "pkg/formats"
        HTML["html/<br/>comment, invisible-div,<br/>aria-hidden, css-hide,<br/>microdata, chunk-boundary,<br/>offscreen, color-transparent,<br/>noscript"]
        DOCX["docx/<br/>metadata, fontzero,<br/>whitefont, comment, custom-xml,<br/>metadata-distributed,<br/>chunk-boundary, hidden-paragraph"]
        PDF["pdf/<br/>annotation, invisible-text,<br/>javascript, xmp-metadata,<br/>xmp-distributed, chunk-boundary,<br/>offpage"]
        TXT["txt/<br/>zero-width, homoglyph,<br/>bidi-override, chunk-boundary"]
        MD["markdown/<br/>html-comment, frontmatter,<br/>link-title, image-alt,<br/>chunk-boundary"]
        RTF["rtf/<br/>metadata, fontzero,<br/>comment"]
        EPUB["epub/<br/>metadata, css-hide,<br/>comment, aria-hidden,<br/>metadata-distributed, toc"]
        CSV["csv/<br/>extra-column, bom-prefix,<br/>formula-injection"]
        JSON["json/<br/>metadata-key,<br/>unicode-escape"]
        XLSX["xlsx/<br/>hidden-sheet, metadata,<br/>comment, fontzero"]
        IMG["image/<br/>text-chunk, xmp-metadata,<br/>multi-chunk, steganographic"]
    end

    subgraph "pkg/validate"
        Val["validate.go<br/>Validate(), ValidateFile()"]
        LC["langchain.go"]
        LI["llamaindex.go"]
        UN["unstructured.go"]
        HS["haystack.go"]
        Help["helpers.go<br/>28 shared extraction utils"]
    end

    CLI --> Craft
    Craft --> Opts
    Craft --> Doc
    Craft --> Cover
    Craft --> Pay
    Craft --> HTML & DOCX & PDF & TXT & MD & RTF & EPUB & CSV & JSON & XLSX & IMG
    CLI --> Val
    Pay --> OV & EX & RE & DE & MS & AU
    Val --> LC & LI & UN & HS
    LC --> Help
    LI --> Help
    UN --> Help
    HS --> Help

Directory Structure

hemlock/
+-- cmd/
|   +-- hemlock/          # CLI entry point (Cobra root + subcommands)
+-- pkg/
|   +-- craft/            # High-level orchestration
|   |   +-- craft.go      # Craft(), ListTechniques()
|   |   +-- options.go    # CraftOptions struct
|   |   +-- document.go   # Document, TechniqueInfo structs
|   |   +-- covertext.go  # Cover text generation
|   |   +-- optimize.go   # CEM optimizer (with injection score blending)
|   |   +-- optimize_genetic.go  # Genetic (DIGA) optimizer
|   |   +-- optimize_whitebox.go # Whitebox gradient optimizer
|   |   +-- score_injection.go   # Reward model HTTP client
|   |   +-- score.go      # Stealth scoring
|   +-- payloads/         # Payload registry
|   |   +-- payloads.go   # ListPayloads(), GetPayload(), ResolvePayload()
|   |   +-- override.go   # Override category (10 variants)
|   |   +-- exfiltrate.go # Exfiltrate category (10 variants)
|   |   +-- redirect.go   # Redirect category (10 variants)
|   |   +-- denial.go     # Denial category (10 variants)
|   |   +-- multistage.go # Multistage category (20 variants)
|   |   +-- authority.go  # Authority category (10 variants)
|   |   +-- adaptive.go   # Model-family payload adaptation
|   +-- embed/            # Embedding providers
|   |   +-- embed.go      # Provider interface
|   |   +-- ollama.go     # Ollama embedding provider
|   |   +-- openai.go     # OpenAI embedding provider
|   +-- formats/          # Per-format document generators (11 formats, 57 techniques)
|   |   +-- html/         # HTML generation (10 techniques)
|   |   +-- docx/         # DOCX generation (8 techniques)
|   |   +-- pdf/          # PDF generation (7 techniques)
|   |   +-- txt/          # TXT generation (5 techniques)
|   |   +-- markdown/     # Markdown generation (5 techniques)
|   |   +-- rtf/          # RTF generation (3 techniques)
|   |   +-- epub/         # EPUB generation (6 techniques)
|   |   +-- csv/          # CSV generation (3 techniques)
|   |   +-- json/         # JSON generation (2 techniques)
|   |   +-- xlsx/         # XLSX generation (4 techniques)
|   |   +-- image/        # Image/PNG generation (4 techniques)
|   +-- validate/         # Framework extraction simulation
|       +-- validate.go   # Validate(), ValidateFile(), framework dispatch
|       +-- langchain.go  # LangChain extraction simulation
|       +-- llamaindex.go # LlamaIndex extraction simulation
|       +-- unstructured.go # Unstructured.io extraction simulation
|       +-- haystack.go   # Haystack extraction simulation
|       +-- helpers.go    # 28 shared HTML/XML/PDF/RTF/EPUB parsing utilities
+-- test/
|   +-- integration/      # End-to-end craft-then-validate tests
+-- docs/                 # MkDocs documentation
+-- testdata/             # Test fixtures
+-- Makefile              # Build, test, docs targets
+-- mkdocs.yml            # Documentation site configuration

Data Flow

Document Generation

sequenceDiagram
    participant User
    participant CLI
    participant Craft
    participant Payloads
    participant Format
    participant Disk

    User->>CLI: hemlock craft --format docx --payload override
    CLI->>Craft: Craft(CraftOptions)
    Craft->>Craft: Apply defaults (count=5, framework=generic)
    Craft->>Craft: Generate cover text from topic
    Craft->>Craft: Resolve technique list

    loop For each technique
        loop For each variant (0..count-1)
            Craft->>Payloads: ResolvePayload(category, custom, variantIndex)
            Payloads-->>Craft: Payload text
            Craft->>Format: Generate(payload, coverText, technique)
            Format-->>Craft: Document bytes
            Craft->>Craft: Build Document struct
        end
    end

    alt OutputDir is set
        Craft->>Disk: Write files to OutputDir
    end

    Craft-->>CLI: []Document
    CLI->>User: Summary output

Batch + Validate Pipeline

The batch command generates documents across all formats and techniques, writing a .hemlock-manifest.json alongside them. The validate --dir command reads the manifest and validates every file against all four frameworks:

sequenceDiagram
    participant User
    participant CLI
    participant Craft
    participant Validate
    participant Disk

    User->>CLI: hemlock batch --payload override --output-dir ./out
    CLI->>Craft: Craft() per format
    Craft->>Disk: Write files + .hemlock-manifest.json

    User->>CLI: hemlock validate --dir ./out
    CLI->>Disk: Read .hemlock-manifest.json
    loop For each file in manifest
        loop For each framework
            CLI->>Validate: ValidateFile(file, framework, payload)
            Validate-->>CLI: ValidationResult
        end
    end
    CLI->>User: JSON results

Validation Engine

The validation engine simulates how four RAG frameworks extract text from documents. Each framework file (langchain.go, llamaindex.go, unstructured.go, haystack.go) contains a dispatch function that routes by file format, then calls format-specific extraction logic.

Framework Extraction Behavior

Validated against live pipelines at 92.3% accuracy (131/142 comparisons match). Key framework-specific behaviors discovered through testing:

Behavior LangChain LlamaIndex Unstructured Haystack
HTML parsing Strips tags, decodes entities Returns raw HTML (no parsing) Strips tags, preserves hidden/aria-hidden text Strips tags, strips aria-hidden
DOCX extraction w:t elements only (no metadata) w:t elements + metadata Full XML text content DOCXToDocument converter
PDF text BT/ET blocks + FlateDecode decompression BT/ET blocks + FlateDecode decompression BT/ET blocks only (pdfminer filters invisible text) BT/ET blocks + FlateDecode decompression
RTF extraction Raw string content Raw string content Body text, strips annotation/info groups Raw string content
EPUB handling Chapter XHTML via langchainHTML Chapter XHTML via llamaindexEPUBHTML (stricter) Chapter XHTML via unstructuredHTML Returns empty (no native converter)
XLSX handling Shared strings only (no metadata) Returns empty (PandasExcelReader crashes) XML text content Returns empty (no native converter)
CSV handling Text content Returns empty (PandasCSVReader fails) Text content Text content

Helper Functions

The helpers.go file contains 28 shared utilities organized by domain:

  • HTML: stripHTMLTags, stripHTMLComments, exposeHTMLComments, decodeHTMLEntities, normalizeWhitespace, stripHiddenElements, stripAriaHidden, stripElementsWithAttribute, extractTagName
  • XML/DOCX: extractWTElements, extractXMLTextContent, extractTElements, readZipFile
  • PDF: extractPDFText (with FlateDecode decompression), extractPDFTextSimple (raw BT/ET only), decompressPDFStreams, extractPDFStringOperands, extractParenthesized, findMatchingParen
  • RTF: extractRTFBodyText, extractRTFMetadata, extractRTFAnnotations, stripRTFGroup
  • EPUB: extractEPUBParts, extractEPUBChapterText
  • XLSX: extractXLSXSharedStrings, extractXLSXComments, extractXLSXMetadata

Design Decisions

Why No DOCX Library?

DOCX files are ZIP archives containing XML. hemlock constructs these archives directly using Go's archive/zip and string templates for the XML parts. This approach:

  • Eliminates a dependency. No need for a Go DOCX library that may not support the low-level XML manipulation needed for techniques like custom-xml and fontzero.
  • Provides precise control. Techniques like fontzero require placing a <w:r> run with specific <w:rPr> properties. Direct XML construction makes this straightforward.
  • Keeps the output minimal. The generated DOCX files contain only the XML parts needed, without template bloat.

Why gofpdf?

PDF is a complex binary format that is impractical to generate from scratch. gofpdf is a mature, well-tested Go library for PDF creation. It is the only external dependency outside the CLI framework (Cobra).

Why Cobra?

Cobra is the standard CLI framework in the Go ecosystem. It provides subcommand routing, flag parsing, help generation, and shell completion. The library API does not depend on Cobra---it is only used in the cmd/hemlock package.

Why Simulated Validation?

Running the real Python frameworks (LangChain, LlamaIndex, Unstructured, Haystack) would require Python, virtual environments, and their dependency trees. hemlock's validation engine replicates the extraction behavior in pure Go, which:

  • Keeps hemlock a single static binary. No Python runtime needed.
  • Runs anywhere Go compiles. Cross-platform without Python portability concerns.
  • Executes instantly. No interpreter startup or library import overhead.

The tradeoff is that edge cases may diverge from the real frameworks. The Confidence field in ValidationResult communicates this uncertainty. Accuracy is verified against live pipelines via the hemlock-lab test harness.

Why Separate Format Packages?

Each format package is independent and self-contained. This design:

  • Allows each format to evolve without affecting others.
  • Makes it easy to add new formats without modifying existing code.
  • Enables direct usage of format packages when the craft orchestration layer is not needed.

Why HTTP for Injection Scoring?

The joint optimization reward model runs as a Python FastAPI server (reward_server.py) that the Go optimizers query via HTTP POST. The alternative would be embedding the ML model directly in Go via ONNX Runtime or CGo. The HTTP approach was chosen because:

  • The ML stack stays in Python. PyTorch, scikit-learn, and sentence-transformers have no mature Go equivalents. Maintaining two ML implementations would double the maintenance burden.
  • Overhead is negligible. A local HTTP call adds ~10ms per candidate, compared to ~200ms for embedding computation and ~2s for Ollama model inference. The optimizer is not bottlenecked by scoring.
  • The server can be shared. Multiple hemlock instances or optimizer runs can query the same reward server.
  • Backward compatibility. When InjectionWeight is 0 (the default), no HTTP calls are made. The reward server is only needed when joint optimization is explicitly enabled.

Known Limitations

The remaining 7.7% accuracy gap comes from two sources:

  • TXT Unicode techniques (10 drifts): bidi-override, homoglyph, and zero-width use Unicode transformations. hemlock's validate uses semantic matching (e.g., reversing bidi text, decoding zero-width sequences) while the raw extraction test uses literal string search. The validator is correct---the test methodology diverges.
  • EPUB css-hide on LlamaIndex (1 drift): CSS class-based hiding (<style>.x{font-size:0}</style> + <span class="x">payload</span>) requires parsing <style> blocks and matching class selectors, which is beyond the scope of string-based extraction simulation.

Adding a New Format

To add a new document format:

Step 1: Create the Package

pkg/formats/newformat/
+-- newformat.go      # Techniques() and Generate()
+-- technique1.go     # First technique implementation
+-- newformat_test.go # Tests

Step 2: Implement the Interface

Every format package exports the same two functions:

package newformat

import "fmt"

func Techniques() []string {
    return []string{"technique-name"}
}

func Generate(payload, coverText, technique string) ([]byte, error) {
    switch technique {
    case "technique-name":
        return generateTechnique(payload, coverText)
    default:
        return nil, fmt.Errorf("newformat: unknown technique %q", technique)
    }
}

Step 3: Register in craft.go

Add the format to the generators map in pkg/craft/craft.go:

import "github.com/professor-moody/hemlock/pkg/formats/newformat"

var generators = map[string]formatGenerator{
    // ... existing formats ...
    "newformat": {newformat.Techniques, newformat.Generate, ".ext"},
}

Step 4: Add Stealth Scores and Descriptions

Add entries to stealthScore() and techniqueDescription() in pkg/craft/craft.go for the new format and techniques.

Step 5: Add Validation Support

Add extraction functions in pkg/validate/ for each framework:

  • langchain.go: Add the case to extractLangChain
  • llamaindex.go: Add the case to extractLlamaIndex
  • unstructured.go: Add the case to extractUnstructured
  • haystack.go: Add the case to extractHaystack

Update detectFormat() in validate.go to recognize the new extension.

Step 6: Add Tests and Documentation

  • Write tests in pkg/formats/newformat/newformat_test.go
  • Add a technique documentation page in docs/techniques/newformat.md
  • Update the nav in mkdocs.yml

Adding a New Technique

To add a new hiding technique to an existing format (e.g., data-attribute for HTML):

Step 1: Create the Implementation File

// pkg/formats/html/dataattr.go
package html

import "fmt"

func generateDataAttribute(payload, coverText string) ([]byte, error) {
    doc := fmt.Sprintf(`<!DOCTYPE html>
<html>
<body data-context="%s">
<p>%s</p>
</body>
</html>`, payload, coverText)
    return []byte(doc), nil
}

Step 2: Register in the Format Package

Update Techniques() and Generate() in pkg/formats/html/html.go:

func Techniques() []string {
    return []string{"comment", "invisible-div", "aria-hidden", "css-hide", "data-attribute"}
}

func Generate(payload, coverText, technique string) ([]byte, error) {
    switch technique {
    // ... existing cases ...
    case "data-attribute":
        return generateDataAttribute(payload, coverText)
    default:
        return nil, fmt.Errorf("html: unknown technique %q", technique)
    }
}

Step 3: Add Metadata in craft.go

Add stealth score and description entries in pkg/craft/craft.go.

Step 4: Update Validation

If the new technique has different extraction behavior, update the relevant framework simulation functions in pkg/validate/.

Step 5: Test

Write tests that generate a document with the new technique and validate it against all four frameworks.


Next Steps