GGUF Format¶

The GGUF (GPT-Generated Unified Format) is the standard binary format for storing quantized LLM weights. It is used by llama.cpp, Ollama, LM Studio, GPT4All, koboldcpp, and dozens of other tools in the local AI ecosystem.

Understanding this format is essential for structure-aware fuzzing — every field is a potential attack surface.

Binary Layout¶

block-beta
    columns 1
    A["Header (24 bytes)\nmagic | version | tensor_count | metadata_kv_count"]
    B["Metadata KV Pairs\nkey_length | key | value_type | value\n(repeated metadata_kv_count times)"]
    C["Tensor Info\nname_length | name | n_dims | dims[] | type | offset\n(repeated tensor_count times)"]
    D["Alignment Padding\n(zero bytes to alignment boundary)"]
    E["Tensor Data\n(raw weight data at specified offsets)"]

The header is always exactly 24 bytes, all fields little-endian:

Offset	Size	Field	Description
0	4	`magic`	Always `GGUF` (0x47475546)
4	4	`version`	Format version (currently `3`)
8	8	`tensor_count`	Number of tensors in the file
16	8	`metadata_kv_count`	Number of metadata key-value pairs

Metadata Value Types¶

Each metadata entry stores a key (UTF-8 string) and a typed value:

ID	Type	Size (bytes)	Go Type
0	UINT8	1	`uint8`
1	INT8	1	`int8`
2	UINT16	2	`uint16`
3	INT16	2	`int16`
4	FLOAT16	2	`uint16` (raw bits)
5	UINT32	4	`uint32`
6	INT32	4	`int32`
7	FLOAT32	4	`float32`
8	BOOL	1	`bool`
9	STRING	variable	`string`
10	ARRAY	variable	`ArrayValue`
11	UINT64	8	`uint64`
12	INT64	8	`int64`
13	FLOAT64	8	`float64`

Array Values¶

Arrays contain a nested type tag and count:

element_type: uint32    // One of the value types above
count:        uint64    // Number of elements
elements:     variable  // count × element data

Tensor Info¶

Each tensor entry describes a weight matrix:

Field	Type	Description
`name_length`	uint64	Length of tensor name in bytes
`name`	utf8	Tensor name (e.g., `blk.0.attn_q.weight`)
`n_dims`	uint32	Number of dimensions (1-4 per spec)
`dims`	uint64[n_dims]	Size of each dimension
`type`	uint32	GGML quantization type
`offset`	uint64	Byte offset from start of tensor data section

The total data size for a tensor is calculated as:

n_elements = dims[0] × dims[1] × ... × dims[n_dims-1]
n_blocks = ceil(n_elements / block_size)
data_size = n_blocks × type_size

Alignment¶

Padding zeros are inserted between the tensor info section and tensor data to align to a boundary:

Default: 32 bytes
Configurable: via general.alignment metadata key

Critical Attack Surface

The general.alignment key directly controls padding calculations. Setting it to 0 causes division-by-zero. Setting it to UINT32_MAX causes massive memory allocation. This is where Cisco Talos found heap buffer overflows (CVSS 8.8) in llama.cpp.

Common Metadata Keys¶

Key	Type	Description
`general.architecture`	STRING	Model architecture (llama, falcon, etc.)
`general.name`	STRING	Human-readable model name
`general.file_type`	UINT32	Quantization file type
`general.alignment`	UINT32	Byte alignment for padding
`general.quantization_version`	UINT32	Quantization version
`tokenizer.ggml.model`	STRING	Tokenizer model type
`tokenizer.ggml.tokens`	ARRAY[STRING]	Token vocabulary
`tokenizer.ggml.scores`	ARRAY[FLOAT32]	Token scores

Byte Order¶

All multi-byte fields are little-endian. The magic bytes are stored as 0x47 0x47 0x55 0x46 on disk, which reads sequentially as "GGUF" (G=0x47, G=0x47, U=0x55, F=0x46).