•
Vision Transformer (ViT) performs image recognition using only a pure Transformer architecture without CNNs by processing images as patch sequences
•
Images are divided into fixed-size patches, each patch is linearly embedded, and then processed by a standard Transformer encoder
•
Visual Transformer (VT) operates in a semantic token space and selectively attends to different parts of the image based on context
•
DETR transforms object detection into a set prediction problem (bounding boxes and labels) processed in a single forward pass
•
DETR consists of a CNN backbone, Transformer encoder-decoder, and prediction heads, using learnable object queries
•
Set-based loss function with bipartite matching enforces one-to-one assignment between predictions and ground truth objects, eliminating duplicate detections
•
Validates that self-attention mechanisms can effectively function as independent layers in vision models without the spatial inductive biases of convolutions
Background
•
CNN → Transformer
A key question emerges: can attention serve as a standalone primitive for vision models, rather than merely augmenting convolutions?
This work verifies that self-attention can function as an effective standalone layer.
•
The paper demonstrates that fully attentional vision models work well without spatial convolutions.
Vision Transformer (ViT)
This paper investigates whether such CNN-specific inductive biases are strictly necessary for image recognition, and explores whether a pure Transformer architecture can be applied directly to images at scale.
The authors propose Vision Transformer (ViT), a model that treats an image as a sequence of patches, analogous to tokens in NLP. Instead of using convolutions, the image is divided into fixed-size patches (e.g., 16×16), each patch is flattened and linearly projected into an embedding space, and the resulting sequence is processed by a standard Transformer encoder.
•
understanding images only with transformer without inductive bias of CNNs
import torch
from torch import nn
from torch.nn import Module, ModuleList
from einops import rearrange, repeat
from einops.layers.torch import Rearrange
# helpers
def pair(t):
return t if isinstance(t, tuple) else (t, t)
# classes
class FeedForward(Module):
def __init__(self, dim, hidden_dim, dropout = 0.):
super().__init__()
self.net = nn.Sequential(
nn.LayerNorm(dim),
nn.Linear(dim, hidden_dim),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, dim),
nn.Dropout(dropout)
)
def forward(self, x):
return self.net(x)
class Attention(Module):
def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
super().__init__()
inner_dim = dim_head * heads
project_out = not (heads == 1 and dim_head == dim)
self.heads = heads
self.scale = dim_head ** -0.5
self.norm = nn.LayerNorm(dim)
self.attend = nn.Softmax(dim = -1)
self.dropout = nn.Dropout(dropout)
self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
self.to_out = nn.Sequential(
nn.Linear(inner_dim, dim),
nn.Dropout(dropout)
) if project_out else nn.Identity()
def forward(self, x):
x = self.norm(x)
qkv = self.to_qkv(x).chunk(3, dim = -1)
q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)
dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale
attn = self.attend(dots)
attn = self.dropout(attn)
out = torch.matmul(attn, v)
out = rearrange(out, 'b h n d -> b n (h d)')
return self.to_out(out)
class Transformer(Module):
def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.):
super().__init__()
self.norm = nn.LayerNorm(dim)
self.layers = ModuleList([])
for _ in range(depth):
self.layers.append(ModuleList([
Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout),
FeedForward(dim, mlp_dim, dropout = dropout)
]))
def forward(self, x):
for attn, ff in self.layers:
x = attn(x) + x
x = ff(x) + x
return self.norm(x)
class ViT(Module):
def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.):
super().__init__()
image_height, image_width = pair(image_size)
patch_height, patch_width = pair(patch_size)
assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'
num_patches = (image_height // patch_height) * (image_width // patch_width)
patch_dim = channels * patch_height * patch_width
assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)'
num_cls_tokens = 1 if pool == 'cls' else 0
self.to_patch_embedding = nn.Sequential(
Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
nn.LayerNorm(patch_dim),
nn.Linear(patch_dim, dim),
nn.LayerNorm(dim),
)
self.cls_token = nn.Parameter(torch.randn(num_cls_tokens, dim))
self.pos_embedding = nn.Parameter(torch.randn(num_patches + num_cls_tokens, dim))
self.dropout = nn.Dropout(emb_dropout)
self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout)
self.pool = pool
self.to_latent = nn.Identity()
self.mlp_head = nn.Linear(dim, num_classes)
def forward(self, img):
batch = img.shape[0]
x = self.to_patch_embedding(img)
cls_tokens = repeat(self.cls_token, '... d -> b ... d', b = batch)
x = torch.cat((cls_tokens, x), dim = 1)
seq = x.shape[1]
x = x + self.pos_embedding[:seq]
x = self.dropout(x)
x = self.transformer(x)
x = x.mean(dim = 1) if self.pool == 'mean' else x[:, 0]
x = self.to_latent(x)
return self.mlp_head(x)
Python
복사
Code Explanation
Imports and Setup
import torch
from torch import nn
from torch.nn import Module, ModuleList
Python
복사
Import PyTorch core modules for building neural networks.
from einops import rearrange, repeat
from einops.layers.torch import Rearrange
Python
복사
Import einops utilities for elegant tensor manipulation operations.
Helper Function
def pair(t):
return t if isinstance(t, tuple) else (t, t)
Python
복사
Converts a single value into a tuple pair, or returns the tuple as-is. Used for handling image/patch sizes that can be specified as either a single int or (height, width) tuple.
FeedForward Class
class FeedForward(Module):
def __init__(self, dim, hidden_dim, dropout = 0.):
Python
복사
Defines the feedforward network that follows attention in each transformer block. Takes input dimension, hidden dimension, and dropout rate.
self.net = nn.Sequential(
nn.LayerNorm(dim),
nn.Linear(dim, hidden_dim),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, dim),
nn.Dropout(dropout)
)
Python
복사
Creates a two-layer MLP with GELU activation: normalize → expand to hidden_dim → activate → dropout → project back to dim → dropout.
Attention Class
class Attention(Module):
def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
Python
복사
Multi-head self-attention mechanism. Parameters: embedding dimension, number of attention heads, dimension per head, and dropout rate.
inner_dim = dim_head * heads
project_out = not (heads == 1 and dim_head == dim)
Python
복사
Calculate total dimension across all heads. Determine if output projection is needed (skip if single head with matching dimensions).
self.heads = heads
self.scale = dim_head ** -0.5
Python
복사
Store number of heads and compute scaling factor (1/√d_k) for dot-product attention.
self.norm = nn.LayerNorm(dim)
self.attend = nn.Softmax(dim = -1)
self.dropout = nn.Dropout(dropout)
Python
복사
Initialize layer normalization, softmax for attention weights, and dropout.
self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
Python
복사
Single linear layer that projects input to Query, Key, and Value matrices simultaneously (3x the inner dimension).
self.to_out = nn.Sequential(
nn.Linear(inner_dim, dim),
nn.Dropout(dropout)
) if project_out else nn.Identity()
Python
복사
Output projection layer to map concatenated heads back to original dimension, or identity if projection not needed.
def forward(self, x):
x = self.norm(x)
Python
복사
Apply layer normalization to input.
qkv = self.to_qkv(x).chunk(3, dim = -1)
q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)
Python
복사
Generate Q, K, V by splitting the projection output into 3 chunks, then reshape each from (batch, seq_len, heads*dim_head) to (batch, heads, seq_len, dim_head).
dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale
Python
복사
Compute scaled dot-product attention scores: QK^T / √d_k.
attn = self.attend(dots)
attn = self.dropout(attn)
Python
복사
Apply softmax to get attention weights, then apply dropout.
out = torch.matmul(attn, v)
out = rearrange(out, 'b h n d -> b n (h d)')
return self.to_out(out)
Python
복사
Multiply attention weights by values, reshape from (batch, heads, seq_len, dim_head) back to (batch, seq_len, heads*dim_head), and apply output projection.
Transformer Class
class Transformer(Module):
def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.):
Python
복사
Stacks multiple transformer blocks. Parameters: embedding dim, number of layers (depth), attention heads, dim per head, MLP hidden dim, dropout.
self.norm = nn.LayerNorm(dim)
self.layers = ModuleList([])
Python
복사
Final layer norm and list to store transformer blocks.
for _ in range(depth):
self.layers.append(ModuleList([
Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout),
FeedForward(dim, mlp_dim, dropout = dropout)
]))
Python
복사
Create 'depth' number of transformer blocks, each containing an attention layer and feedforward layer.
def forward(self, x):
for attn, ff in self.layers:
x = attn(x) + x
x = ff(x) + x
return self.norm(x)
Python
복사
Process input through all transformer blocks with residual connections (x = sublayer(x) + x), then apply final normalization.
ViT Class
class ViT(Module):
def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.):
Python
복사
Main Vision Transformer class. Key parameters: image size, patch size, number of output classes, embedding dimension, transformer depth, attention heads, MLP dimension, pooling type ('cls' or 'mean'), input channels, dimension per head, dropout rates.
image_height, image_width = pair(image_size)
patch_height, patch_width = pair(patch_size)
Python
복사
Convert image and patch sizes to (height, width) tuples.
assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'
Python
복사
Verify that image dimensions are evenly divisible by patch dimensions.
num_patches = (image_height // patch_height) * (image_width // patch_width)
patch_dim = channels * patch_height * patch_width
Python
복사
Calculate total number of patches and the flattened dimension of each patch.
assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)'
num_cls_tokens = 1 if pool == 'cls' else 0
Python
복사
Validate pooling type and determine if a CLS token should be added (1 if using cls pooling, 0 if using mean pooling).
self.to_patch_embedding = nn.Sequential(
Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
nn.LayerNorm(patch_dim),
nn.Linear(patch_dim, dim),
nn.LayerNorm(dim),
)
Python
복사
Patch embedding pipeline: rearrange image into patches → normalize → linear projection to embedding dimension → normalize again.
self.cls_token = nn.Parameter(torch.randn(num_cls_tokens, dim))
self.pos_embedding = nn.Parameter(torch.randn(num_patches + num_cls_tokens, dim))
Python
복사
Learnable CLS token and positional embeddings for all tokens (patches + optional CLS token).
self.dropout = nn.Dropout(emb_dropout)
self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout)
Python
복사
Dropout for embeddings and instantiate the transformer encoder.
self.pool = pool
self.to_latent = nn.Identity()
self.mlp_head = nn.Linear(dim, num_classes)
Python
복사
Store pooling type, identity layer for latent representation, and final classification head.
def forward(self, img):
batch = img.shape[0]
x = self.to_patch_embedding(img)
Python
복사
Get batch size and convert image to patch embeddings.
cls_tokens = repeat(self.cls_token, '... d -> b ... d', b = batch)
x = torch.cat((cls_tokens, x), dim = 1)
Python
복사
Repeat CLS token for each sample in batch and prepend to patch embeddings.
seq = x.shape[1]
x = x + self.pos_embedding[:seq]
x = self.dropout(x)
Python
복사
Get sequence length, add positional embeddings to tokens, and apply dropout.
x = self.transformer(x)
Python
복사
Process through transformer encoder.
x = x.mean(dim = 1) if self.pool == 'mean' else x[:, 0]
Python
복사
Pool the transformer output: either take mean across all tokens or extract the CLS token (first token).
x = self.to_latent(x)
return self.mlp_head(x)
Python
복사
Pass through latent layer (identity) and final classification head to produce class logits.
•
As a result of ViT,
This figure visualizes how a Vision Transformer (ViT) attends to an image during classification.
The left shows the input image, and the right shows attention maps indicating which image patches the class token focuses on.
Brighter regions represent areas that are more important for the model’s prediction, typically highlighting the main object while suppressing background information.
Visual Transformer (VT)
Visual Transformer -operates in a semantic token space, judiciously attending to different image parts based on context.
1) Not all pixels are created equal
2) Not all images have all concepts:
3)Convolutions struggle to relate spatially-distant
concepts:
In this repo,
#Tokenization
wa = rearrange(self.token_wA, 'b h w -> b w h') #Transpose
A= torch.einsum('bij,bjk->bik', x, wa)
A = rearrange(A, 'b h w -> b w h') #Transpose
A = A.softmax(dim=-1)
VV= torch.einsum('bij,bjk->bik', x, self.token_wV)
T = torch.einsum('bij,bjk->bik', A, VV)
Python
복사
DTER
DETR treats object detection as predicting a set of objects (bounding boxes and labels) in a single forward pass. The core ideas are:
•
A Transformer encoder–decoder architecture to model global relationships in the image.
•
A set-based loss with bipartite (Hungarian) matching that enforces one-to-one assignment between predictions and ground-truth objects, eliminating duplicate detections by design.
DETR consists of three main components:
1.
CNN Backbone
A standard CNN (e.g., ResNet-50/101) extracts a low-resolution feature map from the input image.
2.
Transformer Encoder–Decoder
•
The encoder applies global self-attention over flattened image features with positional encodings, enabling global scene reasoning.
•
The decoder takes a fixed number of learned object queries (e.g., N = 100) and attends to the encoded image features to produce object-level embeddings in parallel.
3.
Prediction Heads
•
A shared feed-forward network (FFN) predicts a class label (including a special “no object” class) and normalized bounding box coordinates for each query.
Key findings:
•
Transformer encoder is crucial for global reasoning and instance separation.
•
Multiple decoder layers progressively reduce duplicate predictions.
•
Object queries specialize in different spatial regions and box sizes.
•
NMS is unnecessary and can even harm performance in later decoder layers.










