Changyu Lee

A Review of Llama 4 Architecture

Published at
2025/12/23
Last edited time
2025/12/23 03:46
Created
2025/12/23 02:26
Section
LLM
Status
Done
Series
Tags
Paper
AI summary
Llama 4 introduces a mixture of experts (MoE) architecture that enhances inference efficiency and deployment flexibility, utilizing 17B active parameters from a total of 400B. It employs a vision encoder based on MetaCLIP for improved multimodal processing. Key innovations include Rotary Positional Embedding (RoPE) for better position encoding, Grouped Query Attention (GQA) for reduced memory usage, and RMSNorm for stable training. The MoE architecture allows selective activation of parameters, improving performance while maintaining lower costs and latency. Additionally, MetaCLIP enhances dataset quality for better alignment in vision-language tasks.
Keywords
llama
LLM
RoPE
RMSNorm
Language
ENG
Week
1 more property
This article explores the key architectural innovations that make Llama 4 a significant advancement in large language model design, focusing on its mixture of experts approach and core technical components.
Llama 4 represents a major architectural advancement in large language models through its mixture of experts (MoE) design, which enables efficient scaling to 400B total parameters while activating only 17B parameters per token. The architecture alternates between dense and MoE layers, with each MoE layer containing 128 routed experts plus a shared expert, allowing for significant reductions in inference costs and latency while maintaining high model quality.
Key architectural components include:
Rotary Positional Embedding (RoPE): Encodes position information by rotating query and key vectors, providing better relative position modeling, computational efficiency through sparse matrix operations, and natural long-term decay properties that stabilize training
Grouped Query Attention (GQA): Reduces memory and computational costs by sharing key and value projections across groups of query heads, dramatically reducing KV cache size while maintaining better quality than extreme sharing approaches
RMSNorm: Simplifies layer normalization by normalizing only by root mean square magnitude without mean subtraction, providing computational efficiency and improved gradient stability
MoE Architecture: Conditionally activates only a subset of expert networks per token based on learned routing decisions, enabling massive model capacity with practical inference costs and allowing experts to specialize for different input types
Vision Encoder (MetaCLIP): Provides multimodal capabilities through specialized training with a frozen Llama language model for optimal visual-textual integration
These innovations collectively enable Llama 4 to achieve state-of-the-art performance while maintaining practical deployment characteristics, including the ability to run on a single NVIDIA H100 DGX host for straightforward scenarios or scale to distributed configurations for maximum efficiency.

A Review of Llama 4

The new Llama 4 models represent a significant architectural advancement as implementation of mixture of experts (MoE) architecture. This design choice enables substantial improvements in both inference efficiency and deployment flexibility.
The Llama 4 Maverick models exemplify this approach, featuring 17B active parameters while maintaining 400B total parameters. The architecture employs an alternating pattern of dense and MoE layers to optimize inference performance. Within the MoE layers, the system utilizes 128 routed experts alongside a shared expert. During inference, each token is processed by both the shared expert and exactly one of the 128 routed experts, ensuring that only a subset of parameters are activated at any given time.
This selective activation strategy yields significant practical benefits. By reducing the number of active parameters during inference, the architecture substantially lowers both serving costs and latency. The efficiency gains are remarkable: Llama 4 Maverick can run on a single NVIDIA H100 DGX host for straightforward deployment scenarios, while also supporting distributed inference configurations for maximum computational efficiency in production environments.

Vision Encoder of Llama 4

The Llama 4 vision encoder also represents a advancement in multimodal processing capabilities. Built upon the MetaCLIP architecture, this encoder was specifically trained in conjunction with a frozen Llama language model to ensure optimal adaptation between visual and textual representations.
This specialized training approach enables more effective integration of visual information into the language model's processing pipeline, resulting in improved performance on tasks requiring joint understanding of images and text. By fine-tuning the vision encoder while keeping the language model frozen, the architecture maintains the strong linguistic capabilities of Llama while developing robust visual understanding.

Components of Llama 4

RoPE

Transformer models rely on self-attention, which is inherently position-agnostic. Without explicit positional information, tokens are treated as an unordered set. Traditional positional encodings address this by adding position-dependent vectors to token embeddings, but this approach has limitations:
Poor extrapolation to longer sequence lengths
Weak modeling of relative positions
Incompatibility with certain efficient attention variants
Rotary Positional Embedding (RoPE) was proposed to overcome these issues by encoding position information directly into the attention mechanism itself, rather than into the embeddings additively.
RoPE injects positional information by rotating the query and key vectors in a position-dependent manner before computing attention.
In 2D Case,
But generally,
The key insight: the attention score depends only on the relative position
Benefits of RoPE
Computational efficient realization of rotary matrix multiplication
Since the rotation matrix is highly sparse and block-diagonal,
By leveraging this sparsity, the rotation can be implemented without explicitly constructing the matrix. Instead, the vector is split into pairs of dimensions, and each pair is rotated using simple element-wise operations involving sine and cosine functions. Concretely, the rotated vector can be expressed as a combination of:
the original vector scaled by \(\cos(m\theta_i)\), and
a version of the vector with swapped and sign-flipped components scaled by \(\sin(m\theta_i)\).
Long-term decay of RoPE
RoPE also contributes to faster and more stable training convergence through its mathematically structured way of encoding relative positions. When queries and keys are grouped into pairs of dimensions, the RoPE-modified inner product can be interpreted as a complex number multiplication. This reformulation reveals that the attention score is a weighted sum of complex exponentials of the relative position \((m - n)\)
Using this representation, the inner product can be transformed (via Abel transformation) into a form that reveals a long-range decay property: as the distance between tokens increases, high-frequency components contribute less. By choosing rotation frequencies θi=100002i/d \theta_i = 10000^{-2i/d} the average magnitude of these terms decreases smoothly with distance.
This has two important consequences:
Inductive bias aligned with language: nearby tokens naturally influence each other more than distant ones.
Optimization stability: attention scores produce less noise for long-range interactions, keeping gradients well-behaved during training.
Empirically, this structured decay leads to faster convergence and better generalization, especially on long-context tasks. Unlike ad-hoc relative position biases, RoPE achieves this as a direct consequence of its geometric formulation—without extra parameters or heuristics.

Grouped Query Attention

Grouped Query Attention (GQA) is an optimization technique that reduces memory and computational costs in multi-head attention by sharing key and value projections across multiple query heads.
In standard Multi-Head Attention (MHA), each attention head has its own independent query (Q), key (K), and value (V) projection matrices. For a model with h heads and hidden dimension d, this results in:
h separate Q projections
h separate K projections
h separate V projections
This creates significant memory overhead, especially during inference when the KV cache must store keys and values for all previous tokens.
How GQA Works
GQA introduces a middle ground between MHA and Multi-Query Attention (MQA):
Multi-Query Attention (MQA): All query heads share a single set of keys and values (extreme sharing)
Grouped Query Attention (GQA): Query heads are divided into groups, and each group shares one set of keys and values
Multi-Head Attention (MHA): Each query head has its own keys and values (no sharing)
For example, if you have 32 query heads and use GQA with 8 groups:
You have 32 query projections (one per head)
You have only 8 key projections and 8 value projections (one per group)
Each group of 4 query heads shares the same K and V
Mathematical Formulation
Given input x and G groups where each group contains h/G query heads:
For group g:
Compute shared keys and values: Kg=xWgK,Vg=xWgVK_g = x W_g^K, \quad V_g = x W_g^V
For each query head i in group g: Qi=xWiQQ_i = x W_i^Q
Compute attention: Attentioni(Qi,Kg,Vg)=softmax(QiKgTdk)Vg\text{Attention}_i(Q_i, K_g, V_g) = \text{softmax}\left(\frac{Q_i K_g^T}{\sqrt{d_k}}\right) V_g
Benefits of GQA
Reduced KV cache size: During autoregressive inference, the memory required to store past keys and values is reduced by a factor of h/G, where G is the number of groups
Faster inference: Smaller KV cache means less memory bandwidth consumption, leading to faster generation speeds
Better quality than MQA: GQA maintains better model quality compared to extreme sharing in MQA, while still providing significant efficiency gains
Scalability: Enables deployment of larger models on memory-constrained hardware
GQA in Llama 4
Llama 4 employs GQA to balance computational efficiency with model expressiveness. This design choice is particularly important for the MoE architecture, where efficient attention mechanisms help maintain reasonable inference costs despite the large total parameter count. The grouped structure allows Llama 4 to:
Support longer context windows without proportional memory increase
Achieve faster inference speeds for both single-GPU and distributed deployments
Maintain high model quality comparable to full MHA
Here is a simplified code implementation:
import torch import torch.nn as nn class GroupedQueryAttention(nn.Module): def __init__(self, hidden_size: int, num_heads: int, num_kv_heads: int): super().__init__() assert num_heads % num_kv_heads == 0 self.num_heads = num_heads self.num_kv_heads = num_kv_heads self.num_queries_per_kv = num_heads // num_kv_heads self.head_dim = hidden_size // num_heads self.q_proj = nn.Linear(hidden_size, num_heads * self.head_dim, bias=False) self.k_proj = nn.Linear(hidden_size, num_kv_heads * self.head_dim, bias=False) self.v_proj = nn.Linear(hidden_size, num_kv_heads * self.head_dim, bias=False) self.o_proj = nn.Linear(num_heads * self.head_dim, hidden_size, bias=False) def forward(self, x: torch.Tensor) -> torch.Tensor: batch, seq_len, _ = x.shape # Project to Q, K, V q = self.q_proj(x).view(batch, seq_len, self.num_heads, self.head_dim) k = self.k_proj(x).view(batch, seq_len, self.num_kv_heads, self.head_dim) v = self.v_proj(x).view(batch, seq_len, self.num_kv_heads, self.head_dim) # Repeat K and V to match number of query heads k = k.repeat_interleave(self.num_queries_per_kv, dim=2) v = v.repeat_interleave(self.num_queries_per_kv, dim=2) # Compute attention scores = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5) attn = torch.softmax(scores, dim=-1) output = torch.matmul(attn, v) # Concatenate heads and project output = output.transpose(1, 2).contiguous().view(batch, seq_len, -1) return self.o_proj(output)
Python
복사

RMSNorm

Layer Normalization (LayerNorm) stabilizes training by normalizing activations. It centers the vector (subtracts the mean) and scales it (divides by the standard deviation). RMSNorm is simpler: it skips mean subtraction and only normalizes by the root mean square (RMS) magnitude.
This preserves directional information while controlling scale—often sufficient to keep deep Transformers stable.
Given an input activation vector \(a \in \mathbb{R}^n\), RMSNorm computes:
a single scalar RMS(a) measuring the vector's overall magnitude (average energy across dimensions),
rescales the vector to control its magnitude,
applies a learnable per-dimension scale parameter \(g\) (sometimes called weight).
Key takeaway:
Direction stays the same—only the length is normalized (up to learned rescaling).

Backpropagation intuition (why gradients remain stable)

The normalization term is a single scalar computed from the entire vector, so each dimension's gradient is not fully independent. Changing one coordinate changes RMS(a), affecting all coordinates.
However, RMSNorm has a simpler dependency structure than LayerNorm because it avoids mean-centering. In practice, this often results in:
fewer coupled terms in the Jacobian,
more predictable gradient flow,
improved numerical stability in very deep networks.
The combination of RMSNorm's computational efficiency and gradient stability makes it particularly well-suited for the large-scale Llama 4 architecture, where even small improvements in training dynamics and inference speed compound significantly across billions of parameters.
Here is the code snippet:
import torch import torch.nn as nn class LlamaRMSNorm(nn.Module): def __init__(self, hidden_size: int, eps: float = 1e-6): super().__init__() self.eps = eps self.weight = nn.Parameter(torch.ones(hidden_size)) def forward(self, x: torch.Tensor) -> torch.Tensor: # x: (batch, seq, hidden) variance = x.pow(2).mean(dim=-1, keepdim=True) x = x * torch.rsqrt(variance + self.eps) return self.weight * x
Python
복사

MoE Architecture

Mixture of Experts (MoE) is a neural network architecture that improves efficiency by conditionally activating only a subset of model parameters for each input. Instead of routing every token through all parameters, MoE dynamically selects which "expert" networks to use.

Core Concept

A standard Transformer processes each token through the same feedforward network (FFN). MoE replaces this single FFN with:
Multiple expert networks: A collection of specialized FFNs (e.g., 128 experts)
A gating/router network: Decides which experts process each token
Sparse activation: Only a small subset of experts (e.g., top-1 or top-2) are activated per token
How MoE Works
For each input token x:
1.
Router computes scores: A lightweight network calculates a score for each expert: si=Router(x)eis_i = \text{Router}(x) \cdot e_i
2.
Top-k selection: Select the top-k experts with highest scores (typically k=1 or k=2)
3.
Weighted combination: The token is processed by the selected experts, and their outputs are combined using softmax-normalized router scores: y=iTop-kexp(si)jTop-kexp(sj)Experti(x)y = \sum_{i \in \text{Top-k}} \frac{\exp(s_i)}{\sum_{j \in \text{Top-k}} \exp(s_j)} \cdot \text{Expert}_i(x)
Benefits of MoE
Increased model capacity without proportional compute cost: You can have 400B total parameters but only activate 17B per token
Faster inference: Processing fewer parameters per token reduces latency and memory bandwidth
Specialization: Different experts can learn to handle different types of inputs (e.g., code vs. natural language, different domains)
Better scaling: MoE enables training much larger models that would be impractical with dense architectures
Challenges and Solutions
Load balancing: Some experts might be used much more than others, leading to inefficient utilizationSolution: Add an auxiliary loss term that encourages balanced expert usage
Training instability: Router decisions can be noisy early in trainingSolution: Use techniques like router z-loss, dropout on router logits, or expert capacity limits
Communication overhead in distributed training: Experts may be on different GPUs, requiring inter-GPU communicationSolution: Expert parallelism strategies, local expert placement, or hybrid dense-MoE layer
MoE Architecture in Llama 4
Here's a simplified MoE implementation:
import torch import torch.nn as nn import torch.nn.functional as F class MixtureOfExperts(nn.Module): def __init__(self, hidden_size: int, num_experts: int, expert_size: int, top_k: int = 1): super().__init__() self.num_experts = num_experts self.top_k = top_k # Router network self.router = nn.Linear(hidden_size, num_experts, bias=False) # Expert networks (simplified as single linear layers) self.experts = nn.ModuleList([ nn.Sequential( nn.Linear(hidden_size, expert_size), nn.ReLU(), nn.Linear(expert_size, hidden_size) ) for _ in range(num_experts) ]) # Optional: shared expert (always active) self.shared_expert = nn.Sequential( nn.Linear(hidden_size, expert_size), nn.ReLU(), nn.Linear(expert_size, hidden_size) ) def forward(self, x: torch.Tensor) -> torch.Tensor: # x: (batch, seq, hidden) batch_size, seq_len, hidden_size = x.shape x_flat = x.view(-1, hidden_size) # (batch*seq, hidden) # Compute router scores router_logits = self.router(x_flat) # (batch*seq, num_experts) # Select top-k experts router_weights, selected_experts = torch.topk(router_logits, self.top_k, dim=-1) router_weights = F.softmax(router_weights, dim=-1) # Initialize output output = torch.zeros_like(x_flat) # Process through selected experts for i in range(self.top_k): expert_idx = selected_experts[:, i] expert_weights = router_weights[:, i].unsqueeze(-1) # Process each token through its selected expert for expert_id in range(self.num_experts): mask = (expert_idx == expert_id) if mask.any(): expert_input = x_flat[mask] expert_output = self.experts[expert_id](expert_input) output[mask] += expert_weights[mask] * expert_output # Add shared expert output (always active) output += self.shared_expert(x_flat) return output.view(batch_size, seq_len, hidden_size)
Python
복사
This MoE architecture is a key innovation that enables Llama 4 to achieve state-of-the-art performance while maintaining practical inference costs.

MetaCLIP

MetaCLIP is a data-centric approach to training vision-language models that improves upon CLIP by curating a higher-quality dataset through metadata-based filtering. Instead of using raw image-text pairs from the web, MetaCLIP applies algorithmic curation based on metadata signals (e.g., image captions, alt-text quality) to balance the dataset across concepts.
Original CLIP models showed impressive zero-shot capabilities, but their performance was constrained by several issues:
Noisy web-scale image–text pairs
Weak alignment between images and captions
Over-representation of shallow or templated text
Unclear scaling behavior with respect to data size
MetaCLIP addresses the following question:
How far can CLIP go if the data is carefully curated and scaled in a principled way?
MetaCLIP’s main contribution lies in how the training data is constructed.
Key principles include:
Removing image–text pairs with weak semantic alignment
Filtering out extremely short, generic, or non-descriptive captions
Prioritizing natural language descriptions over keyword-style tags
Maintaining diversity while avoiding dominance by frequent categories
The study shows that smaller but cleaner datasets can outperform much larger noisy datasets, especially in zero-shot settings.
This results in better alignment between visual and textual representations, leading to improved zero-shot classification and retrieval performance while using fewer training samples than traditional CLIP.

SwiGLU

SwiGLU (Swish-Gated Linear Unit) is a feed-forward network (FFN) activation variant used in modern Transformer architectures such as LLaMA. It belongs to the family of Gated Linear Units (GLU) and was introduced to improve the expressiveness and training behavior of Transformer FFNs without increasing computational cost.
SwiGLU replaces the sigmoid gate in GLU with the Swish (SiLU) activation:
SwiGLU(x)=Swish(xW)(xV)\text{SwiGLU}(x) = \text{Swish}(xW) \odot (xV)
Where the Swish (SiLU) activation function is defined as:
Swish(x)=xσ(x)=x1+ex\text{Swish}(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}
Or equivalently:
SiLU(x)=xsigmoid(x)\text{SiLU}(x) = x \cdot \text{sigmoid}(x)
Derivative of Swish Activation
Given:
Swish(x)=xσ(x)=x1+ex\text{Swish}(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}
The derivative is computed using the product rule:
ddxSwish(x)=σ(x)+xσ(x)\frac{d}{dx}\text{Swish}(x) = \sigma(x) + x \cdot \sigma'(x)
Since the derivative of the sigmoid function is:
σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x))
We can substitute this to get:
ddxSwish(x)=σ(x)+xσ(x)(1σ(x))\frac{d}{dx}\text{Swish}(x) = \sigma(x) + x \cdot \sigma(x)(1 - \sigma(x))
Simplifying:
ddxSwish(x)=σ(x)(1+x(1σ(x)))\frac{d}{dx}\text{Swish}(x) = \sigma(x)(1 + x(1 - \sigma(x)))
In Transformer FFN form:
FFNSwiGLU(x)=W2(Swish(xW1)(xW3)) \text{FFN}_{\text{SwiGLU}}(x) = W_2 \big( \text{Swish}(xW_1) \odot (xW_3) \big)
Important characteristics:
Two parallel linear projections
One branch passes through Swish
Element-wise gating before output projection
import torch import torch.nn as nn import torch.nn.functional as F class SwiGLU(nn.Module): def __init__(self, dim: int, hidden_dim: int): super().__init__() self.w1 = nn.Linear(dim, hidden_dim, bias=False) self.w2 = nn.Linear(hidden_dim, dim, bias=False) self.w3 = nn.Linear(dim, hidden_dim, bias=False) def forward(self, x: torch.Tensor) -> torch.Tensor: return self.w2(F.silu(self.w1(x)) * self.w3(x))
Python
복사

Reference