A Review of Llama 4 Architecture

Published at

2025/12/23

Last edited time

2025/12/23 03:46

Created

2025/12/23 02:26

Section

LLM

Status

Done

Series

A Review of Llama 4

The new Llama 4 models represent a significant architectural advancement as implementation of mixture of experts (MoE) architecture. This design choice enables substantial improvements in both inference efficiency and deployment flexibility.

The Llama 4 Maverick models exemplify this approach, featuring 17B active parameters while maintaining 400B total parameters. The architecture employs an alternating pattern of dense and MoE layers to optimize inference performance. Within the MoE layers, the system utilizes 128 routed experts alongside a shared expert. During inference, each token is processed by both the shared expert and exactly one of the 128 routed experts, ensuring that only a subset of parameters are activated at any given time.

This selective activation strategy yields significant practical benefits. By reducing the number of active parameters during inference, the architecture substantially lowers both serving costs and latency. The efficiency gains are remarkable: Llama 4 Maverick can run on a single NVIDIA H100 DGX host for straightforward deployment scenarios, while also supporting distributed inference configurations for maximum computational efficiency in production environments.

Vision Encoder of Llama 4

The Llama 4 vision encoder also represents a advancement in multimodal processing capabilities. Built upon the MetaCLIP architecture, this encoder was specifically trained in conjunction with a frozen Llama language model to ensure optimal adaptation between visual and textual representations.

This specialized training approach enables more effective integration of visual information into the language model's processing pipeline, resulting in improved performance on tasks requiring joint understanding of images and text. By fine-tuning the vision encoder while keeping the language model frozen, the architecture maintains the strong linguistic capabilities of Llama while developing robust visual understanding.

Components of Llama 4

RoPE

[2104.09864] RoFormer: Enhanced Transformer with Rotary Position Embedding

Transformer models rely on self-attention, which is inherently position-agnostic. Without explicit positional information, tokens are treated as an unordered set. Traditional positional encodings address this by adding position-dependent vectors to token embeddings, but this approach has limitations:

•

Poor extrapolation to longer sequence lengths

•

Weak modeling of relative positions

•

Incompatibility with certain efficient attention variants

Rotary Positional Embedding (RoPE) was proposed to overcome these issues by encoding position information directly into the attention mechanism itself, rather than into the embeddings additively.

RoPE injects positional information by rotating the query and key vectors in a position-dependent manner before computing attention.

In 2D Case,

But generally,

The key insight: the attention score depends only on the relative position

•

Benefits of RoPE

◦

Computational efficient realization of rotary matrix multiplication

Since the rotation matrix is highly sparse and block-diagonal,

By leveraging this sparsity, the rotation can be implemented without explicitly constructing the matrix. Instead, the vector is split into pairs of dimensions, and each pair is rotated using simple element-wise operations involving sine and cosine functions. Concretely, the rotated vector can be expressed as a combination of:

▪

the original vector scaled by \(\cos(m\theta_i)\), and

▪

a version of the vector with swapped and sign-flipped components scaled by \(\sin(m\theta_i)\).

◦

Long-term decay of RoPE

RoPE also contributes to faster and more stable training convergence through its mathematically structured way of encoding relative positions. When queries and keys are grouped into pairs of dimensions, the RoPE-modified inner product can be interpreted as a complex number multiplication. This reformulation reveals that the attention score is a weighted sum of complex exponentials of the relative position \((m - n)\)

Using this representation, the inner product can be transformed (via Abel transformation) into a form that reveals a long-range decay property: as the distance between tokens increases, high-frequency components contribute less. By choosing rotation frequencies

\theta_i = 10000^{-2i/d}

the average magnitude of these terms decreases smoothly with distance.

This has two important consequences:

▪

Inductive bias aligned with language: nearby tokens naturally influence each other more than distant ones.

▪

Optimization stability: attention scores produce less noise for long-range interactions, keeping gradients well-behaved during training.

Empirically, this structured decay leads to faster convergence and better generalization, especially on long-context tasks. Unlike ad-hoc relative position biases, RoPE achieves this as a direct consequence of its geometric formulation—without extra parameters or heuristics.

Grouped Query Attention

[2305.13245] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Grouped Query Attention (GQA) is an optimization technique that reduces memory and computational costs in multi-head attention by sharing key and value projections across multiple query heads.

In standard Multi-Head Attention (MHA), each attention head has its own independent query (Q), key (K), and value (V) projection matrices. For a model with h heads and hidden dimension d, this results in:

•

h separate Q projections

•

h separate K projections

•

h separate V projections

This creates significant memory overhead, especially during inference when the KV cache must store keys and values for all previous tokens.

•

How GQA Works

GQA introduces a middle ground between MHA and Multi-Query Attention (MQA):

◦

Multi-Query Attention (MQA): All query heads share a single set of keys and values (extreme sharing)

◦

Grouped Query Attention (GQA): Query heads are divided into groups, and each group shares one set of keys and values

◦

Multi-Head Attention (MHA): Each query head has its own keys and values (no sharing)

For example, if you have 32 query heads and use GQA with 8 groups:

◦

You have 32 query projections (one per head)

◦

You have only 8 key projections and 8 value projections (one per group)

◦

Each group of 4 query heads shares the same K and V

•

Mathematical Formulation

Given input x and G groups where each group contains h/G query heads:

For group g:

◦

Compute shared keys and values: Kg=xWgK,Vg=xWgVK_g = x W_g^K, \quad V_g = x W_g^VKg​=xWgK​,Vg​=xWgV​

◦

For each query head i in group g: Qi=xWiQQ_i = x W_i^QQi​=xWiQ​

◦

Compute attention: Attentioni(Qi,Kg,Vg)=softmax(QiKgTdk)Vg\text{Attention}_i(Q_i, K_g, V_g) = \text{softmax}\left(\frac{Q_i K_g^T}{\sqrt{d_k}}\right) V_gAttentioni​(Qi​,Kg​,Vg​)=softmax(dk​​Qi​KgT​​)Vg​

•

Benefits of GQA

◦

Reduced KV cache size: During autoregressive inference, the memory required to store past keys and values is reduced by a factor of h/G, where G is the number of groups

◦

Faster inference: Smaller KV cache means less memory bandwidth consumption, leading to faster generation speeds

◦

Better quality than MQA: GQA maintains better model quality compared to extreme sharing in MQA, while still providing significant efficiency gains

◦

Scalability: Enables deployment of larger models on memory-constrained hardware

•

GQA in Llama 4

Llama 4 employs GQA to balance computational efficiency with model expressiveness. This design choice is particularly important for the MoE architecture, where efficient attention mechanisms help maintain reasonable inference costs despite the large total parameter count. The grouped structure allows Llama 4 to:

◦

Support longer context windows without proportional memory increase

◦

Achieve faster inference speeds for both single-GPU and distributed deployments

◦

Maintain high model quality comparable to full MHA

Here is a simplified code implementation:

import torch
import torch.nn as nn

class GroupedQueryAttention(nn.Module):
    def __init__(self, hidden_size: int, num_heads: int, num_kv_heads: int):
        super().__init__()
        assert num_heads % num_kv_heads == 0
        self.num_heads = num_heads
        self.num_kv_heads = num_kv_heads
        self.num_queries_per_kv = num_heads // num_kv_heads
        self.head_dim = hidden_size // num_heads
        
        self.q_proj = nn.Linear(hidden_size, num_heads * self.head_dim, bias=False)
        self.k_proj = nn.Linear(hidden_size, num_kv_heads * self.head_dim, bias=False)
        self.v_proj = nn.Linear(hidden_size, num_kv_heads * self.head_dim, bias=False)
        self.o_proj = nn.Linear(num_heads * self.head_dim, hidden_size, bias=False)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch, seq_len, _ = x.shape
        
        # Project to Q, K, V
        q = self.q_proj(x).view(batch, seq_len, self.num_heads, self.head_dim)
        k = self.k_proj(x).view(batch, seq_len, self.num_kv_heads, self.head_dim)
        v = self.v_proj(x).view(batch, seq_len, self.num_kv_heads, self.head_dim)
        
        # Repeat K and V to match number of query heads
        k = k.repeat_interleave(self.num_queries_per_kv, dim=2)
        v = v.repeat_interleave(self.num_queries_per_kv, dim=2)
        
        # Compute attention
        scores = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5)
        attn = torch.softmax(scores, dim=-1)
        output = torch.matmul(attn, v)
        
        # Concatenate heads and project
        output = output.transpose(1, 2).contiguous().view(batch, seq_len, -1)
        return self.o_proj(output)
Python
복사

RMSNorm

arxiv.org

https://arxiv.org/pdf/1910.07467

Layer Normalization (LayerNorm) stabilizes training by normalizing activations. It centers the vector (subtracts the mean) and scales it (divides by the standard deviation). RMSNorm is simpler: it skips mean subtraction and only normalizes by the root mean square (RMS) magnitude.

This preserves directional information while controlling scale—often sufficient to keep deep Transformers stable.

Given an input activation vector \(a \in \mathbb{R}^n\), RMSNorm computes:

•

a single scalar RMS(a) measuring the vector's overall magnitude (average energy across dimensions),

•

rescales the vector to control its magnitude,

•

applies a learnable per-dimension scale parameter \(g\) (sometimes called weight).

Key takeaway:

•

Direction stays the same—only the length is normalized (up to learned rescaling).

Backpropagation intuition (why gradients remain stable)

The normalization term is a single scalar computed from the entire vector, so each dimension's gradient is not fully independent. Changing one coordinate changes RMS(a), affecting all coordinates.

However, RMSNorm has a simpler dependency structure than LayerNorm because it avoids mean-centering. In practice, this often results in:

•

fewer coupled terms in the Jacobian,

•

more predictable gradient flow,

•

improved numerical stability in very deep networks.

The combination of RMSNorm's computational efficiency and gradient stability makes it particularly well-suited for the large-scale Llama 4 architecture, where even small improvements in training dynamics and inference speed compound significantly across billions of parameters.

Here is the code snippet:

import torch
import torch.nn as nn

class LlamaRMSNorm(nn.Module):
    def __init__(self, hidden_size: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(hidden_size))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (batch, seq, hidden)
        variance = x.pow(2).mean(dim=-1, keepdim=True)
        x = x * torch.rsqrt(variance + self.eps)
        return self.weight * x
Python
복사

MoE Architecture

Mixture of Experts (MoE) is a neural network architecture that improves efficiency by conditionally activating only a subset of model parameters for each input. Instead of routing every token through all parameters, MoE dynamically selects which "expert" networks to use.

Core Concept

A standard Transformer processes each token through the same feedforward network (FFN). MoE replaces this single FFN with:

•

Multiple expert networks: A collection of specialized FFNs (e.g., 128 experts)

•

A gating/router network: Decides which experts process each token

•

Sparse activation: Only a small subset of experts (e.g., top-1 or top-2) are activated per token

•

How MoE Works

For each input token x:

Router computes scores: A lightweight network calculates a score for each expert: si=Router(x)⋅eis_i = \text{Router}(x) \cdot e_isi​=Router(x)⋅ei​

Top-k selection: Select the top-k experts with highest scores (typically k=1 or k=2)

Weighted combination: The token is processed by the selected experts, and their outputs are combined using softmax-normalized router scores:
   y=∑i∈Top-kexp⁡(si)∑j∈Top-kexp⁡(sj)⋅Experti(x)y = \sum_{i \in \text{Top-k}} \frac{\exp(s_i)}{\sum_{j \in \text{Top-k}} \exp(s_j)} \cdot \text{Expert}_i(x)y=∑i∈Top-k​∑j∈Top-k​exp(sj​)exp(si​)​⋅Experti​(x)

•

Benefits of MoE

◦

Increased model capacity without proportional compute cost: You can have 400B total parameters but only activate 17B per token

◦

Faster inference: Processing fewer parameters per token reduces latency and memory bandwidth

◦

Specialization: Different experts can learn to handle different types of inputs (e.g., code vs. natural language, different domains)

◦

Better scaling: MoE enables training much larger models that would be impractical with dense architectures

•

Challenges and Solutions

◦

Load balancing: Some experts might be used much more than others, leading to inefficient utilizationSolution: Add an auxiliary loss term that encourages balanced expert usage

◦

Training instability: Router decisions can be noisy early in trainingSolution: Use techniques like router z-loss, dropout on router logits, or expert capacity limits

◦

Communication overhead in distributed training: Experts may be on different GPUs, requiring inter-GPU communicationSolution: Expert parallelism strategies, local expert placement, or hybrid dense-MoE layer

MoE Architecture in Llama 4

Here's a simplified MoE implementation:

import torch
import torch.nn as nn
import torch.nn.functional as F

class MixtureOfExperts(nn.Module):
    def __init__(self, hidden_size: int, num_experts: int, expert_size: int, top_k: int = 1):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        
        # Router network
        self.router = nn.Linear(hidden_size, num_experts, bias=False)
        
        # Expert networks (simplified as single linear layers)
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(hidden_size, expert_size),
                nn.ReLU(),
                nn.Linear(expert_size, hidden_size)
            )
            for _ in range(num_experts)
        ])
        
        # Optional: shared expert (always active)
        self.shared_expert = nn.Sequential(
            nn.Linear(hidden_size, expert_size),
            nn.ReLU(),
            nn.Linear(expert_size, hidden_size)
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (batch, seq, hidden)
        batch_size, seq_len, hidden_size = x.shape
        x_flat = x.view(-1, hidden_size)  # (batch*seq, hidden)
        
        # Compute router scores
        router_logits = self.router(x_flat)  # (batch*seq, num_experts)
        
        # Select top-k experts
        router_weights, selected_experts = torch.topk(router_logits, self.top_k, dim=-1)
        router_weights = F.softmax(router_weights, dim=-1)
        
        # Initialize output
        output = torch.zeros_like(x_flat)
        
        # Process through selected experts
        for i in range(self.top_k):
            expert_idx = selected_experts[:, i]
            expert_weights = router_weights[:, i].unsqueeze(-1)
            
            # Process each token through its selected expert
            for expert_id in range(self.num_experts):
                mask = (expert_idx == expert_id)
                if mask.any():
                    expert_input = x_flat[mask]
                    expert_output = self.experts[expert_id](expert_input)
                    output[mask] += expert_weights[mask] * expert_output
        
        # Add shared expert output (always active)
        output += self.shared_expert(x_flat)
        
        return output.view(batch_size, seq_len, hidden_size)
Python
복사

This MoE architecture is a key innovation that enables Llama 4 to achieve state-of-the-art performance while maintaining practical inference costs.

MetaCLIP

Demystifying CLIP Data

MetaCLIP is a data-centric approach to training vision-language models that improves upon CLIP by curating a higher-quality dataset through metadata-based filtering. Instead of using raw image-text pairs from the web, MetaCLIP applies algorithmic curation based on metadata signals (e.g., image captions, alt-text quality) to balance the dataset across concepts.

Original CLIP models showed impressive zero-shot capabilities, but their performance was constrained by several issues:

•

Noisy web-scale image–text pairs

•

Weak alignment between images and captions

•

Over-representation of shallow or templated text

•

Unclear scaling behavior with respect to data size

MetaCLIP addresses the following question:

How far can CLIP go if the data is carefully curated and scaled in a principled way?

MetaCLIP’s main contribution lies in how the training data is constructed.

Key principles include:

•

Removing image–text pairs with weak semantic alignment

•

Filtering out extremely short, generic, or non-descriptive captions

•

Prioritizing natural language descriptions over keyword-style tags

•

Maintaining diversity while avoiding dominance by frequent categories

The study shows that smaller but cleaner datasets can outperform much larger noisy datasets, especially in zero-shot settings.

This results in better alignment between visual and textual representations, leading to improved zero-shot classification and retrieval performance while using fewer training samples than traditional CLIP.

SwiGLU

arxiv.org

https://arxiv.org/pdf/2002.05202

SwiGLU (Swish-Gated Linear Unit) is a feed-forward network (FFN) activation variant used in modern Transformer architectures such as LLaMA. It belongs to the family of Gated Linear Units (GLU) and was introduced to improve the expressiveness and training behavior of Transformer FFNs without increasing computational cost.

SwiGLU replaces the sigmoid gate in GLU with the Swish (SiLU) activation:

\text{SwiGLU}(x) = \text{Swish}(xW) \odot (xV)

Where the Swish (SiLU) activation function is defined as:

\text{Swish}(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}

Or equivalently:

\text{SiLU}(x) = x \cdot \text{sigmoid}(x)

•

Derivative of Swish Activation

Given:

\text{Swish}(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}

The derivative is computed using the product rule:

\frac{d}{dx}\text{Swish}(x) = \sigma(x) + x \cdot \sigma'(x)

Since the derivative of the sigmoid function is:

\sigma'(x) = \sigma(x)(1 - \sigma(x))

We can substitute this to get:

\frac{d}{dx}\text{Swish}(x) = \sigma(x) + x \cdot \sigma(x)(1 - \sigma(x))

Simplifying:

\frac{d}{dx}\text{Swish}(x) = \sigma(x)(1 + x(1 - \sigma(x)))

In Transformer FFN form:

\text{FFN}_{\text{SwiGLU}}(x) = W_2 \big( \text{Swish}(xW_1) \odot (xW_3) \big)

Important characteristics:

•

Two parallel linear projections

•

One branch passes through Swish

•

Element-wise gating before output projection

import torch
import torch.nn as nn
import torch.nn.functional as F

class SwiGLU(nn.Module):
    def __init__(self, dim: int, hidden_dim: int):
        super().__init__()
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.w2(F.silu(self.w1(x)) * self.w3(x))
Python
복사

Reference

Evolution of meta's llama models and parameter-efficient fine-tuning of large language models: a survey

2503.07137

Adaptive Mixtures of Local Experts

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation

Demystifying CLIP Data (MetaCLIP)