GGUF Format¶
The GGUF (GPT-Generated Unified Format) is the standard binary format for storing quantized LLM weights. It is used by llama.cpp, Ollama, LM Studio, GPT4All, koboldcpp, and dozens of other tools in the local AI ecosystem.
Understanding this format is essential for structure-aware fuzzing — every field is a potential attack surface.
Binary Layout¶
block-beta
columns 1
A["Header (24 bytes)\nmagic | version | tensor_count | metadata_kv_count"]
B["Metadata KV Pairs\nkey_length | key | value_type | value\n(repeated metadata_kv_count times)"]
C["Tensor Info\nname_length | name | n_dims | dims[] | type | offset\n(repeated tensor_count times)"]
D["Alignment Padding\n(zero bytes to alignment boundary)"]
E["Tensor Data\n(raw weight data at specified offsets)"] Header¶
The header is always exactly 24 bytes, all fields little-endian:
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 4 | magic | Always GGUF (0x47475546) |
| 4 | 4 | version | Format version (currently 3) |
| 8 | 8 | tensor_count | Number of tensors in the file |
| 16 | 8 | metadata_kv_count | Number of metadata key-value pairs |
Metadata Value Types¶
Each metadata entry stores a key (UTF-8 string) and a typed value:
| ID | Type | Size (bytes) | Go Type |
|---|---|---|---|
| 0 | UINT8 | 1 | uint8 |
| 1 | INT8 | 1 | int8 |
| 2 | UINT16 | 2 | uint16 |
| 3 | INT16 | 2 | int16 |
| 4 | FLOAT16 | 2 | uint16 (raw bits) |
| 5 | UINT32 | 4 | uint32 |
| 6 | INT32 | 4 | int32 |
| 7 | FLOAT32 | 4 | float32 |
| 8 | BOOL | 1 | bool |
| 9 | STRING | variable | string |
| 10 | ARRAY | variable | ArrayValue |
| 11 | UINT64 | 8 | uint64 |
| 12 | INT64 | 8 | int64 |
| 13 | FLOAT64 | 8 | float64 |
Array Values¶
Arrays contain a nested type tag and count:
element_type: uint32 // One of the value types above
count: uint64 // Number of elements
elements: variable // count × element data
Tensor Info¶
Each tensor entry describes a weight matrix:
| Field | Type | Description |
|---|---|---|
name_length | uint64 | Length of tensor name in bytes |
name | utf8 | Tensor name (e.g., blk.0.attn_q.weight) |
n_dims | uint32 | Number of dimensions (1-4 per spec) |
dims | uint64[n_dims] | Size of each dimension |
type | uint32 | GGML quantization type |
offset | uint64 | Byte offset from start of tensor data section |
The total data size for a tensor is calculated as:
n_elements = dims[0] × dims[1] × ... × dims[n_dims-1]
n_blocks = ceil(n_elements / block_size)
data_size = n_blocks × type_size
Alignment¶
Padding zeros are inserted between the tensor info section and tensor data to align to a boundary:
- Default: 32 bytes
- Configurable: via
general.alignmentmetadata key
Critical Attack Surface
The general.alignment key directly controls padding calculations. Setting it to 0 causes division-by-zero. Setting it to UINT32_MAX causes massive memory allocation. This is where Cisco Talos found heap buffer overflows (CVSS 8.8) in llama.cpp.
Common Metadata Keys¶
| Key | Type | Description |
|---|---|---|
general.architecture | STRING | Model architecture (llama, falcon, etc.) |
general.name | STRING | Human-readable model name |
general.file_type | UINT32 | Quantization file type |
general.alignment | UINT32 | Byte alignment for padding |
general.quantization_version | UINT32 | Quantization version |
tokenizer.ggml.model | STRING | Tokenizer model type |
tokenizer.ggml.tokens | ARRAY[STRING] | Token vocabulary |
tokenizer.ggml.scores | ARRAY[FLOAT32] | Token scores |
Byte Order¶
All multi-byte fields are little-endian. The magic bytes are stored as 0x47 0x47 0x55 0x46 on disk, which reads sequentially as "GGUF" (G=0x47, G=0x47, U=0x55, F=0x46).