Corpus Management¶

A fuzzer is only as good as its seed corpus. Crucible takes a synthetic-first approach: generate seeds that exercise every parser path before supplementing with real-world models.

Synthetic-First Philosophy¶

Core Principle

Generate minimal, targeted seed files that cover every code path in the GGUF parser. Real models are large, slow to mutate, and exercise only the common case. Synthetic seeds are small, fast, and designed to stress edge cases from day one.

Real-world GGUF models exercise the happy path — standard architectures, common quantization types, typical vocabulary sizes. But bugs hide in the corners: unusual value types, rare tensor configurations, extreme alignment values. Synthetic seeds fill those gaps.

flowchart TD
    A[Corpus Strategy] --> B[Synthetic Seeds]
    A --> C[Real Model Seeds]
    B --> D[crucible-gen]
    D --> E[Minimal Valid Files]
    D --> F[Per-Value-Type Seeds]
    D --> G[Per-Tensor-Type Seeds]
    D --> H[Edge Case Seeds]
    C --> I[HuggingFace Downloads]
    I --> J[Small Quantized Models]
    E --> K[Corpus Directory]
    F --> K
    G --> K
    H --> K
    J --> K

What crucible-gen Produces¶

The crucible-gen tool creates a comprehensive set of synthetic GGUF files. Each seed is minimal — just large enough to exercise its target code path.

Minimal Valid Files¶

The baseline: the smallest possible valid GGUF file (seed_000.gguf). Contains only the GGUF header with zero metadata keys and zero tensors. Every fuzzing run starts from a structurally valid file.

Per-Value-Type Seeds¶

One seed for each of the 14 metadata value types (seed_003.gguf through seed_016.gguf), ensuring the parser's type dispatch is fully covered:

Seed	Value Type	Purpose
`seed_003.gguf`	UINT8	Single-byte unsigned integer handling
`seed_004.gguf`	INT8	Single-byte signed integer handling
`seed_005.gguf`	UINT16	Two-byte unsigned integer handling
`seed_006.gguf`	INT16	Two-byte signed integer handling
`seed_007.gguf`	FLOAT16	Half-precision float handling
`seed_008.gguf`	UINT32	Four-byte unsigned integer handling
`seed_009.gguf`	INT32	Four-byte signed integer handling
`seed_010.gguf`	FLOAT32	Single-precision float handling
`seed_011.gguf`	BOOL	Boolean value handling
`seed_012.gguf`	STRING	Variable-length string handling
`seed_013.gguf`	ARRAY	Array handling
`seed_014.gguf`	UINT64	Eight-byte unsigned integer handling
`seed_015.gguf`	INT64	Eight-byte signed integer handling
`seed_016.gguf`	FLOAT64	Double-precision float handling

Per-Tensor-Type Seeds¶

One seed for each common GGML quantization type (seed_017.gguf through seed_021.gguf), exercising the data size calculations for each block format:

Seed	Tensor Type	Block Size	Type Size
`seed_017.gguf`	F32	1	4 bytes
`seed_018.gguf`	F16	1	2 bytes
`seed_019.gguf`	Q4_0	32	18 bytes
`seed_020.gguf`	Q4_1	32	20 bytes
`seed_021.gguf`	Q8_0	32	34 bytes

Edge Case Seeds¶

Seeds designed to push boundary conditions:

Nested ArraysAlignment VariantsDimension ExtremesMulti-Tensor Mixed Types

A single seed (seed_022.gguf) with an array-of-arrays value — two levels of nesting with uint32 inner elements. Exercises recursive array parsing.

Seeds with general.alignment set to unusual but valid values:

Alignment	Purpose
1	No padding at all
8	Sub-default alignment
16	Half-default alignment
32	Default alignment (baseline)
64	Double-default alignment
128	Large alignment boundary

Tensors with unusual but valid dimension configurations:

1x1 — Single-element two-dimensional tensor
1x1x1x1 — Maximum dimension count, minimum size
4096 — One-dimensional tensor at a typical hidden layer size

Files containing multiple tensors with different quantization types in a single file, exercising per-tensor type dispatch and offset calculations.

Real Model Seeds¶

Synthetic seeds cover structure. Real models cover semantics — actual vocabulary sizes, real architecture metadata, production tensor shapes.

Recommended Models

Download small quantized models from HuggingFace to supplement the synthetic corpus. Models around 100MB strike the best balance between coverage and mutation speed.

# Example: download a small quantized model
huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF \
    tinyllama-1.1b-chat-v1.0.Q2_K.gguf \
    --local-dir ./corpus/seeds/

Larger models are slower to parse, mutate, and serialize on every iteration. Keep real model seeds as small as possible.

CVE-Targeted Seeds¶

The --talos flag generates reconstruction seeds based on known vulnerability patterns from Cisco Talos and other researchers. These are written to corpus/reconstructed/:

crucible-gen --talos

Seed	CVE Pattern	What It Exercises
`talos-1912-array-string-overflow.gguf`	TALOS-2024-1912	Array string length overflow
`talos-1913-string-length-wrap.gguf`	TALOS-2024-1913	String length integer wrap
`talos-1914-ndims-oob.gguf`	TALOS-2024-1914	Out-of-bounds dimension count
`talos-1915-tensor-count-overflow.gguf`	TALOS-2024-1915	Tensor count integer overflow
`talos-1916-kv-count-overflow.gguf`	TALOS-2024-1916	Metadata KV count overflow
`databricks-array-size-wrap.gguf`	Databricks	Array size integer wrap
`databricks-type-index-oob.gguf`	Databricks	Type index out-of-bounds

These seeds prime the fuzzer to explore code paths around known bug patterns, increasing the chance of finding regressions or variants of previously patched vulnerabilities.

Corpus Loading¶

The LoadCorpus function scans a directory for .gguf files and provides thread-safe random selection:

Corpus Usage

corpus, err := corpus.LoadCorpus("./corpus/seeds/")  // (1)!
if err != nil {
    log.Fatal(err)
}

seed := corpus.Random(rng)  // (2)!

Recursively loads all .gguf files from the directory into memory
Thread-safe random selection using the provided *rand.Rand instance

The Corpus struct holds parsed GGUF files in memory for fast access. Random selection is performed with the fuzzer's seeded RNG to maintain deterministic reproducibility.

Minimization¶

Hash-Based Deduplication¶

Every file in the corpus is identified by its SHA-256 hash. When new crash inputs are added back to the corpus, duplicates are detected and skipped automatically.

corpus/
├── seeds/
│   ├── seed_000.gguf
│   ├── seed_012.gguf
│   └── tinyllama-q2.gguf
└── crashes/
    ├── abc123de.gguf      ← named by hash prefix
    └── f9871a02.gguf

Coverage-Guided Minimization¶

MinimizeWithCoverage uses an LLVM-instrumented harness to collect per-seed edge coverage and then applies a greedy set-cover algorithm to select the smallest seed set that covers all observed edges. Seeds that produce zero coverage (e.g. due to early crashes or instrumentation gaps) are preserved after hash-based dedup rather than silently dropped.

result, err := corpus.MinimizeWithCoverage(seeds, "./harness-binary")

Falls back to hash-based dedup when no harness is provided.

Fuzzer Dictionary¶

The file corpus/gguf.dict contains 260+ known-interesting byte patterns organized into categories that the mutation engine can insert into string fields and raw data sections:

corpus/gguf.dict (excerpt)

# Magic and structural patterns
"GGUF"
"\x46\x55\x47\x47"

# Common metadata keys (60+ entries)
"general.architecture"
"general.alignment"
"general.name"
"tokenizer.ggml.model"
"tokenizer.ggml.tokens"
"llama.attention.head_count"
"llama.rope.freq_base"

# Architecture values (106+ entries)
"llama"
"falcon"
"gpt2"
"mpt"
"phi3"
"qwen2"
"gemma"
"command-r"

# ggml_type enum values (26 entries)
"\x00\x00\x00\x00"  # GGML_TYPE_F32
"\x01\x00\x00\x00"  # GGML_TYPE_F16
"\x02\x00\x00\x00"  # GGML_TYPE_Q4_0
"\x1e\x00\x00\x00"  # GGML_TYPE_COUNT (sentinel)

# Boundary values
"\xff\xff\xff\xff"
"\xff\xff\xff\xff\xff\xff\xff\xff"

The dictionary categories include:

Category	Count	Purpose
Architecture values	106	Every `general.architecture` variant (llama, falcon, gpt2, phi3, etc.)
Metadata keys	60+	KV store keys across all architectures and tokenizer types
ggml_type enums	26	All quantization type values including boundary sentinels
GGUF value types	13	Metadata value type indicators
Structural patterns	20+	Magic bytes, alignment, boundary values, format strings
Injection patterns	15+	Path traversal, format strings, integer boundaries

The dictionary supplements random mutation with targeted patterns known to trigger specific bug classes — type confusion from invalid enum values, parsing errors from unexpected architecture strings, and integer boundary conditions in tensor size calculations.

Additional Dictionaries¶

Crucible includes format-specific dictionaries for non-GGUF attack surfaces:

Dictionary	Harnesses	Content
`grammar.dict`	grammar, grammar-compile	GBNF operators, rule syntax, grammar keywords
`jinja.dict`	jinja, chat-template	Jinja syntax tokens, template directives, filter names
`json-schema.dict`	json-schema	JSON Schema keywords (`$ref`, `oneOf`, `pattern`, etc.)
`pytorch.dict`	torchscript, torch-load	PyTorch magic bytes, TorchScript opcodes, pickle tokens
`rpc.dict`	rpc, rpc-commands, rpc-race	RPC command opcodes, protocol framing bytes
`server.dict`	server	HTTP methods, JSON content types, API endpoint paths
`tflite.dict`	tflite	TFLite FlatBuffer identifiers, operator codes

Each dictionary is automatically selected by the --dict flag when the harness name or corpus directory matches the target surface.