Skip to content

GGUF Format

The GGUF (GPT-Generated Unified Format) is the standard binary format for storing quantized LLM weights. It is used by llama.cpp, Ollama, LM Studio, GPT4All, koboldcpp, and dozens of other tools in the local AI ecosystem.

Understanding this format is essential for structure-aware fuzzing — every field is a potential attack surface.

Binary Layout

block-beta
    columns 1
    A["Header (24 bytes)\nmagic | version | tensor_count | metadata_kv_count"]
    B["Metadata KV Pairs\nkey_length | key | value_type | value\n(repeated metadata_kv_count times)"]
    C["Tensor Info\nname_length | name | n_dims | dims[] | type | offset\n(repeated tensor_count times)"]
    D["Alignment Padding\n(zero bytes to alignment boundary)"]
    E["Tensor Data\n(raw weight data at specified offsets)"]

The header is always exactly 24 bytes, all fields little-endian:

Offset Size Field Description
0 4 magic Always GGUF (0x47475546)
4 4 version Format version (currently 3)
8 8 tensor_count Number of tensors in the file
16 8 metadata_kv_count Number of metadata key-value pairs

Metadata Value Types

Each metadata entry stores a key (UTF-8 string) and a typed value:

ID Type Size (bytes) Go Type
0 UINT8 1 uint8
1 INT8 1 int8
2 UINT16 2 uint16
3 INT16 2 int16
4 FLOAT16 2 uint16 (raw bits)
5 UINT32 4 uint32
6 INT32 4 int32
7 FLOAT32 4 float32
8 BOOL 1 bool
9 STRING variable string
10 ARRAY variable ArrayValue
11 UINT64 8 uint64
12 INT64 8 int64
13 FLOAT64 8 float64

Array Values

Arrays contain a nested type tag and count:

element_type: uint32    // One of the value types above
count:        uint64    // Number of elements
elements:     variable  // count × element data

Tensor Info

Each tensor entry describes a weight matrix:

Field Type Description
name_length uint64 Length of tensor name in bytes
name utf8 Tensor name (e.g., blk.0.attn_q.weight)
n_dims uint32 Number of dimensions (1-4 per spec)
dims uint64[n_dims] Size of each dimension
type uint32 GGML quantization type
offset uint64 Byte offset from start of tensor data section

The total data size for a tensor is calculated as:

n_elements = dims[0] × dims[1] × ... × dims[n_dims-1]
n_blocks = ceil(n_elements / block_size)
data_size = n_blocks × type_size

Alignment

Padding zeros are inserted between the tensor info section and tensor data to align to a boundary:

  • Default: 32 bytes
  • Configurable: via general.alignment metadata key

Critical Attack Surface

The general.alignment key directly controls padding calculations. Setting it to 0 causes division-by-zero. Setting it to UINT32_MAX causes massive memory allocation. This is where Cisco Talos found heap buffer overflows (CVSS 8.8) in llama.cpp.

Common Metadata Keys

Key Type Description
general.architecture STRING Model architecture (llama, falcon, etc.)
general.name STRING Human-readable model name
general.file_type UINT32 Quantization file type
general.alignment UINT32 Byte alignment for padding
general.quantization_version UINT32 Quantization version
tokenizer.ggml.model STRING Tokenizer model type
tokenizer.ggml.tokens ARRAY[STRING] Token vocabulary
tokenizer.ggml.scores ARRAY[FLOAT32] Token scores

Byte Order

All multi-byte fields are little-endian. The magic bytes are stored as 0x47 0x47 0x55 0x46 on disk, which reads sequentially as "GGUF" (G=0x47, G=0x47, U=0x55, F=0x46).