Corpus Management¶
A fuzzer is only as good as its seed corpus. Crucible takes a synthetic-first approach: generate seeds that exercise every parser path before supplementing with real-world models.
Synthetic-First Philosophy¶
Core Principle
Generate minimal, targeted seed files that cover every code path in the GGUF parser. Real models are large, slow to mutate, and exercise only the common case. Synthetic seeds are small, fast, and designed to stress edge cases from day one.
Real-world GGUF models exercise the happy path — standard architectures, common quantization types, typical vocabulary sizes. But bugs hide in the corners: unusual value types, rare tensor configurations, extreme alignment values. Synthetic seeds fill those gaps.
flowchart TD
A[Corpus Strategy] --> B[Synthetic Seeds]
A --> C[Real Model Seeds]
B --> D[crucible-gen]
D --> E[Minimal Valid Files]
D --> F[Per-Value-Type Seeds]
D --> G[Per-Tensor-Type Seeds]
D --> H[Edge Case Seeds]
C --> I[HuggingFace Downloads]
I --> J[Small Quantized Models]
E --> K[Corpus Directory]
F --> K
G --> K
H --> K
J --> K What crucible-gen Produces¶
The crucible-gen tool creates a comprehensive set of synthetic GGUF files. Each seed is minimal — just large enough to exercise its target code path.
Minimal Valid Files¶
The baseline: the smallest possible valid GGUF file (seed_000.gguf). Contains only the GGUF header with zero metadata keys and zero tensors. Every fuzzing run starts from a structurally valid file.
Per-Value-Type Seeds¶
One seed for each of the 14 metadata value types (seed_003.gguf through seed_016.gguf), ensuring the parser's type dispatch is fully covered:
| Seed | Value Type | Purpose |
|---|---|---|
seed_003.gguf | UINT8 | Single-byte unsigned integer handling |
seed_004.gguf | INT8 | Single-byte signed integer handling |
seed_005.gguf | UINT16 | Two-byte unsigned integer handling |
seed_006.gguf | INT16 | Two-byte signed integer handling |
seed_007.gguf | FLOAT16 | Half-precision float handling |
seed_008.gguf | UINT32 | Four-byte unsigned integer handling |
seed_009.gguf | INT32 | Four-byte signed integer handling |
seed_010.gguf | FLOAT32 | Single-precision float handling |
seed_011.gguf | BOOL | Boolean value handling |
seed_012.gguf | STRING | Variable-length string handling |
seed_013.gguf | ARRAY | Array handling |
seed_014.gguf | UINT64 | Eight-byte unsigned integer handling |
seed_015.gguf | INT64 | Eight-byte signed integer handling |
seed_016.gguf | FLOAT64 | Double-precision float handling |
Per-Tensor-Type Seeds¶
One seed for each common GGML quantization type (seed_017.gguf through seed_021.gguf), exercising the data size calculations for each block format:
| Seed | Tensor Type | Block Size | Type Size |
|---|---|---|---|
seed_017.gguf | F32 | 1 | 4 bytes |
seed_018.gguf | F16 | 1 | 2 bytes |
seed_019.gguf | Q4_0 | 32 | 18 bytes |
seed_020.gguf | Q4_1 | 32 | 20 bytes |
seed_021.gguf | Q8_0 | 32 | 34 bytes |
Edge Case Seeds¶
Seeds designed to push boundary conditions:
A single seed (seed_022.gguf) with an array-of-arrays value — two levels of nesting with uint32 inner elements. Exercises recursive array parsing.
Seeds with general.alignment set to unusual but valid values:
| Alignment | Purpose |
|---|---|
| 1 | No padding at all |
| 8 | Sub-default alignment |
| 16 | Half-default alignment |
| 32 | Default alignment (baseline) |
| 64 | Double-default alignment |
| 128 | Large alignment boundary |
Tensors with unusual but valid dimension configurations:
- 1x1 — Single-element two-dimensional tensor
- 1x1x1x1 — Maximum dimension count, minimum size
- 4096 — One-dimensional tensor at a typical hidden layer size
Files containing multiple tensors with different quantization types in a single file, exercising per-tensor type dispatch and offset calculations.
Real Model Seeds¶
Synthetic seeds cover structure. Real models cover semantics — actual vocabulary sizes, real architecture metadata, production tensor shapes.
Recommended Models
Download small quantized models from HuggingFace to supplement the synthetic corpus. Models around 100MB strike the best balance between coverage and mutation speed.
# Example: download a small quantized model
huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF \
tinyllama-1.1b-chat-v1.0.Q2_K.gguf \
--local-dir ./corpus/seeds/
Larger models are slower to parse, mutate, and serialize on every iteration. Keep real model seeds as small as possible.
CVE-Targeted Seeds¶
The --talos flag generates reconstruction seeds based on known vulnerability patterns from Cisco Talos and other researchers. These are written to corpus/reconstructed/:
| Seed | CVE Pattern | What It Exercises |
|---|---|---|
talos-1912-array-string-overflow.gguf | TALOS-2024-1912 | Array string length overflow |
talos-1913-string-length-wrap.gguf | TALOS-2024-1913 | String length integer wrap |
talos-1914-ndims-oob.gguf | TALOS-2024-1914 | Out-of-bounds dimension count |
talos-1915-tensor-count-overflow.gguf | TALOS-2024-1915 | Tensor count integer overflow |
talos-1916-kv-count-overflow.gguf | TALOS-2024-1916 | Metadata KV count overflow |
databricks-array-size-wrap.gguf | Databricks | Array size integer wrap |
databricks-type-index-oob.gguf | Databricks | Type index out-of-bounds |
These seeds prime the fuzzer to explore code paths around known bug patterns, increasing the chance of finding regressions or variants of previously patched vulnerabilities.
Corpus Loading¶
The LoadCorpus function scans a directory for .gguf files and provides thread-safe random selection:
corpus, err := corpus.LoadCorpus("./corpus/seeds/") // (1)!
if err != nil {
log.Fatal(err)
}
seed := corpus.Random(rng) // (2)!
- Recursively loads all
.gguffiles from the directory into memory - Thread-safe random selection using the provided
*rand.Randinstance
The Corpus struct holds parsed GGUF files in memory for fast access. Random selection is performed with the fuzzer's seeded RNG to maintain deterministic reproducibility.
Minimization¶
Hash-Based Deduplication¶
Every file in the corpus is identified by its SHA-256 hash. When new crash inputs are added back to the corpus, duplicates are detected and skipped automatically.
corpus/
├── seeds/
│ ├── seed_000.gguf
│ ├── seed_012.gguf
│ └── tinyllama-q2.gguf
└── crashes/
├── abc123de.gguf ← named by hash prefix
└── f9871a02.gguf
Coverage-Guided Minimization¶
MinimizeWithCoverage uses an LLVM-instrumented harness to collect per-seed edge coverage and then applies a greedy set-cover algorithm to select the smallest seed set that covers all observed edges. Seeds that produce zero coverage (e.g. due to early crashes or instrumentation gaps) are preserved after hash-based dedup rather than silently dropped.
Falls back to hash-based dedup when no harness is provided.
Fuzzer Dictionary¶
The file corpus/gguf.dict contains 260+ known-interesting byte patterns organized into categories that the mutation engine can insert into string fields and raw data sections:
# Magic and structural patterns
"GGUF"
"\x46\x55\x47\x47"
# Common metadata keys (60+ entries)
"general.architecture"
"general.alignment"
"general.name"
"tokenizer.ggml.model"
"tokenizer.ggml.tokens"
"llama.attention.head_count"
"llama.rope.freq_base"
# Architecture values (106+ entries)
"llama"
"falcon"
"gpt2"
"mpt"
"phi3"
"qwen2"
"gemma"
"command-r"
# ggml_type enum values (26 entries)
"\x00\x00\x00\x00" # GGML_TYPE_F32
"\x01\x00\x00\x00" # GGML_TYPE_F16
"\x02\x00\x00\x00" # GGML_TYPE_Q4_0
"\x1e\x00\x00\x00" # GGML_TYPE_COUNT (sentinel)
# Boundary values
"\xff\xff\xff\xff"
"\xff\xff\xff\xff\xff\xff\xff\xff"
The dictionary categories include:
| Category | Count | Purpose |
|---|---|---|
| Architecture values | 106 | Every general.architecture variant (llama, falcon, gpt2, phi3, etc.) |
| Metadata keys | 60+ | KV store keys across all architectures and tokenizer types |
| ggml_type enums | 26 | All quantization type values including boundary sentinels |
| GGUF value types | 13 | Metadata value type indicators |
| Structural patterns | 20+ | Magic bytes, alignment, boundary values, format strings |
| Injection patterns | 15+ | Path traversal, format strings, integer boundaries |
The dictionary supplements random mutation with targeted patterns known to trigger specific bug classes — type confusion from invalid enum values, parsing errors from unexpected architecture strings, and integer boundary conditions in tensor size calculations.
Additional Dictionaries¶
Crucible includes format-specific dictionaries for non-GGUF attack surfaces:
| Dictionary | Harnesses | Content |
|---|---|---|
grammar.dict | grammar, grammar-compile | GBNF operators, rule syntax, grammar keywords |
jinja.dict | jinja, chat-template | Jinja syntax tokens, template directives, filter names |
json-schema.dict | json-schema | JSON Schema keywords ($ref, oneOf, pattern, etc.) |
pytorch.dict | torchscript, torch-load | PyTorch magic bytes, TorchScript opcodes, pickle tokens |
rpc.dict | rpc, rpc-commands, rpc-race | RPC command opcodes, protocol framing bytes |
server.dict | server | HTTP methods, JSON content types, API endpoint paths |
tflite.dict | tflite | TFLite FlatBuffer identifiers, operator codes |
Each dictionary is automatically selected by the --dict flag when the harness name or corpus directory matches the target surface.