Skip to content

Corpus Management

A fuzzer is only as good as its seed corpus. Crucible takes a synthetic-first approach: generate seeds that exercise every parser path before supplementing with real-world models.

Synthetic-First Philosophy

Core Principle

Generate minimal, targeted seed files that cover every code path in the GGUF parser. Real models are large, slow to mutate, and exercise only the common case. Synthetic seeds are small, fast, and designed to stress edge cases from day one.

Real-world GGUF models exercise the happy path — standard architectures, common quantization types, typical vocabulary sizes. But bugs hide in the corners: unusual value types, rare tensor configurations, extreme alignment values. Synthetic seeds fill those gaps.

flowchart TD
    A[Corpus Strategy] --> B[Synthetic Seeds]
    A --> C[Real Model Seeds]
    B --> D[crucible-gen]
    D --> E[Minimal Valid Files]
    D --> F[Per-Value-Type Seeds]
    D --> G[Per-Tensor-Type Seeds]
    D --> H[Edge Case Seeds]
    C --> I[HuggingFace Downloads]
    I --> J[Small Quantized Models]
    E --> K[Corpus Directory]
    F --> K
    G --> K
    H --> K
    J --> K

What crucible-gen Produces

The crucible-gen tool creates a comprehensive set of synthetic GGUF files. Each seed is minimal — just large enough to exercise its target code path.

Minimal Valid Files

The baseline: the smallest possible valid GGUF file (seed_000.gguf). Contains only the GGUF header with zero metadata keys and zero tensors. Every fuzzing run starts from a structurally valid file.

Per-Value-Type Seeds

One seed for each of the 14 metadata value types (seed_003.gguf through seed_016.gguf), ensuring the parser's type dispatch is fully covered:

Seed Value Type Purpose
seed_003.gguf UINT8 Single-byte unsigned integer handling
seed_004.gguf INT8 Single-byte signed integer handling
seed_005.gguf UINT16 Two-byte unsigned integer handling
seed_006.gguf INT16 Two-byte signed integer handling
seed_007.gguf FLOAT16 Half-precision float handling
seed_008.gguf UINT32 Four-byte unsigned integer handling
seed_009.gguf INT32 Four-byte signed integer handling
seed_010.gguf FLOAT32 Single-precision float handling
seed_011.gguf BOOL Boolean value handling
seed_012.gguf STRING Variable-length string handling
seed_013.gguf ARRAY Array handling
seed_014.gguf UINT64 Eight-byte unsigned integer handling
seed_015.gguf INT64 Eight-byte signed integer handling
seed_016.gguf FLOAT64 Double-precision float handling

Per-Tensor-Type Seeds

One seed for each common GGML quantization type (seed_017.gguf through seed_021.gguf), exercising the data size calculations for each block format:

Seed Tensor Type Block Size Type Size
seed_017.gguf F32 1 4 bytes
seed_018.gguf F16 1 2 bytes
seed_019.gguf Q4_0 32 18 bytes
seed_020.gguf Q4_1 32 20 bytes
seed_021.gguf Q8_0 32 34 bytes

Edge Case Seeds

Seeds designed to push boundary conditions:

A single seed (seed_022.gguf) with an array-of-arrays value — two levels of nesting with uint32 inner elements. Exercises recursive array parsing.

Seeds with general.alignment set to unusual but valid values:

Alignment Purpose
1 No padding at all
8 Sub-default alignment
16 Half-default alignment
32 Default alignment (baseline)
64 Double-default alignment
128 Large alignment boundary

Tensors with unusual but valid dimension configurations:

  • 1x1 — Single-element two-dimensional tensor
  • 1x1x1x1 — Maximum dimension count, minimum size
  • 4096 — One-dimensional tensor at a typical hidden layer size

Files containing multiple tensors with different quantization types in a single file, exercising per-tensor type dispatch and offset calculations.

Real Model Seeds

Synthetic seeds cover structure. Real models cover semantics — actual vocabulary sizes, real architecture metadata, production tensor shapes.

Recommended Models

Download small quantized models from HuggingFace to supplement the synthetic corpus. Models around 100MB strike the best balance between coverage and mutation speed.

# Example: download a small quantized model
huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF \
    tinyllama-1.1b-chat-v1.0.Q2_K.gguf \
    --local-dir ./corpus/seeds/

Larger models are slower to parse, mutate, and serialize on every iteration. Keep real model seeds as small as possible.

CVE-Targeted Seeds

The --talos flag generates reconstruction seeds based on known vulnerability patterns from Cisco Talos and other researchers. These are written to corpus/reconstructed/:

crucible-gen --talos
Seed CVE Pattern What It Exercises
talos-1912-array-string-overflow.gguf TALOS-2024-1912 Array string length overflow
talos-1913-string-length-wrap.gguf TALOS-2024-1913 String length integer wrap
talos-1914-ndims-oob.gguf TALOS-2024-1914 Out-of-bounds dimension count
talos-1915-tensor-count-overflow.gguf TALOS-2024-1915 Tensor count integer overflow
talos-1916-kv-count-overflow.gguf TALOS-2024-1916 Metadata KV count overflow
databricks-array-size-wrap.gguf Databricks Array size integer wrap
databricks-type-index-oob.gguf Databricks Type index out-of-bounds

These seeds prime the fuzzer to explore code paths around known bug patterns, increasing the chance of finding regressions or variants of previously patched vulnerabilities.

Corpus Loading

The LoadCorpus function scans a directory for .gguf files and provides thread-safe random selection:

Corpus Usage
corpus, err := corpus.LoadCorpus("./corpus/seeds/")  // (1)!
if err != nil {
    log.Fatal(err)
}

seed := corpus.Random(rng)  // (2)!
  1. Recursively loads all .gguf files from the directory into memory
  2. Thread-safe random selection using the provided *rand.Rand instance

The Corpus struct holds parsed GGUF files in memory for fast access. Random selection is performed with the fuzzer's seeded RNG to maintain deterministic reproducibility.

Minimization

Hash-Based Deduplication

Every file in the corpus is identified by its SHA-256 hash. When new crash inputs are added back to the corpus, duplicates are detected and skipped automatically.

corpus/
├── seeds/
│   ├── seed_000.gguf
│   ├── seed_012.gguf
│   └── tinyllama-q2.gguf
└── crashes/
    ├── abc123de.gguf      ← named by hash prefix
    └── f9871a02.gguf

Coverage-Guided Minimization

MinimizeWithCoverage uses an LLVM-instrumented harness to collect per-seed edge coverage and then applies a greedy set-cover algorithm to select the smallest seed set that covers all observed edges. Seeds that produce zero coverage (e.g. due to early crashes or instrumentation gaps) are preserved after hash-based dedup rather than silently dropped.

result, err := corpus.MinimizeWithCoverage(seeds, "./harness-binary")

Falls back to hash-based dedup when no harness is provided.

Fuzzer Dictionary

The file corpus/gguf.dict contains 260+ known-interesting byte patterns organized into categories that the mutation engine can insert into string fields and raw data sections:

corpus/gguf.dict (excerpt)
# Magic and structural patterns
"GGUF"
"\x46\x55\x47\x47"

# Common metadata keys (60+ entries)
"general.architecture"
"general.alignment"
"general.name"
"tokenizer.ggml.model"
"tokenizer.ggml.tokens"
"llama.attention.head_count"
"llama.rope.freq_base"

# Architecture values (106+ entries)
"llama"
"falcon"
"gpt2"
"mpt"
"phi3"
"qwen2"
"gemma"
"command-r"

# ggml_type enum values (26 entries)
"\x00\x00\x00\x00"  # GGML_TYPE_F32
"\x01\x00\x00\x00"  # GGML_TYPE_F16
"\x02\x00\x00\x00"  # GGML_TYPE_Q4_0
"\x1e\x00\x00\x00"  # GGML_TYPE_COUNT (sentinel)

# Boundary values
"\xff\xff\xff\xff"
"\xff\xff\xff\xff\xff\xff\xff\xff"

The dictionary categories include:

Category Count Purpose
Architecture values 106 Every general.architecture variant (llama, falcon, gpt2, phi3, etc.)
Metadata keys 60+ KV store keys across all architectures and tokenizer types
ggml_type enums 26 All quantization type values including boundary sentinels
GGUF value types 13 Metadata value type indicators
Structural patterns 20+ Magic bytes, alignment, boundary values, format strings
Injection patterns 15+ Path traversal, format strings, integer boundaries

The dictionary supplements random mutation with targeted patterns known to trigger specific bug classes — type confusion from invalid enum values, parsing errors from unexpected architecture strings, and integer boundary conditions in tensor size calculations.

Additional Dictionaries

Crucible includes format-specific dictionaries for non-GGUF attack surfaces:

Dictionary Harnesses Content
grammar.dict grammar, grammar-compile GBNF operators, rule syntax, grammar keywords
jinja.dict jinja, chat-template Jinja syntax tokens, template directives, filter names
json-schema.dict json-schema JSON Schema keywords ($ref, oneOf, pattern, etc.)
pytorch.dict torchscript, torch-load PyTorch magic bytes, TorchScript opcodes, pickle tokens
rpc.dict rpc, rpc-commands, rpc-race RPC command opcodes, protocol framing bytes
server.dict server HTTP methods, JSON content types, API endpoint paths
tflite.dict tflite TFLite FlatBuffer identifiers, operator codes

Each dictionary is automatically selected by the --dict flag when the harness name or corpus directory matches the target surface.