Visual Transformer & DETR

Published at

2025/12/19

Last edited time

2025/12/19 04:35

Created

2025/12/17 04:40

Section

Research

Status

Done

Series

Background

arxiv.org

https://arxiv.org/pdf/1906.05909

•

CNN → Transformer 

A key question emerges: can attention serve as a standalone primitive for vision models, rather than merely augmenting convolutions?

This work verifies that self-attention can function as an effective standalone layer.

•

The paper demonstrates that fully attentional vision models work well without spatial convolutions.

Vision Transformer (ViT)

arxiv.org

https://arxiv.org/pdf/2010.11929

This paper investigates whether such CNN-specific inductive biases are strictly necessary for image recognition, and explores whether a pure Transformer architecture can be applied directly to images at scale.

The authors propose Vision Transformer (ViT), a model that treats an image as a sequence of patches, analogous to tokens in NLP. Instead of using convolutions, the image is divided into fixed-size patches (e.g., 16×16), each patch is flattened and linearly projected into an embedding space, and the resulting sequence is processed by a standard Transformer encoder.

•

understanding images only with transformer without inductive bias of CNNs

import torch
from torch import nn
from torch.nn import Module, ModuleList

from einops import rearrange, repeat
from einops.layers.torch import Rearrange

# helpers

def pair(t):
    return t if isinstance(t, tuple) else (t, t)

# classes

class FeedForward(Module):
    def __init__(self, dim, hidden_dim, dropout = 0.):
        super().__init__()
        self.net = nn.Sequential(
            nn.LayerNorm(dim),
            nn.Linear(dim, hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, dim),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.net(x)

class Attention(Module):
    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
        super().__init__()
        inner_dim = dim_head *  heads
        project_out = not (heads == 1 and dim_head == dim)

        self.heads = heads
        self.scale = dim_head ** -0.5

        self.norm = nn.LayerNorm(dim)

        self.attend = nn.Softmax(dim = -1)
        self.dropout = nn.Dropout(dropout)

        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)

        self.to_out = nn.Sequential(
            nn.Linear(inner_dim, dim),
            nn.Dropout(dropout)
        ) if project_out else nn.Identity()

    def forward(self, x):
        x = self.norm(x)

        qkv = self.to_qkv(x).chunk(3, dim = -1)
        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)

        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale

        attn = self.attend(dots)
        attn = self.dropout(attn)

        out = torch.matmul(attn, v)
        out = rearrange(out, 'b h n d -> b n (h d)')
        return self.to_out(out)

class Transformer(Module):
    def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.):
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.layers = ModuleList([])

        for _ in range(depth):
            self.layers.append(ModuleList([
                Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout),
                FeedForward(dim, mlp_dim, dropout = dropout)
            ]))

    def forward(self, x):
        for attn, ff in self.layers:
            x = attn(x) + x
            x = ff(x) + x

        return self.norm(x)

class ViT(Module):
    def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.):
        super().__init__()
        image_height, image_width = pair(image_size)
        patch_height, patch_width = pair(patch_size)

        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'

        num_patches = (image_height // patch_height) * (image_width // patch_width)
        patch_dim = channels * patch_height * patch_width

        assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)'
        num_cls_tokens = 1 if pool == 'cls' else 0

        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
            nn.LayerNorm(patch_dim),
            nn.Linear(patch_dim, dim),
            nn.LayerNorm(dim),
        )

        self.cls_token = nn.Parameter(torch.randn(num_cls_tokens, dim))
        self.pos_embedding = nn.Parameter(torch.randn(num_patches + num_cls_tokens, dim))

        self.dropout = nn.Dropout(emb_dropout)

        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout)

        self.pool = pool
        self.to_latent = nn.Identity()

        self.mlp_head = nn.Linear(dim, num_classes)

    def forward(self, img):
        batch = img.shape[0]
        x = self.to_patch_embedding(img)

        cls_tokens = repeat(self.cls_token, '... d -> b ... d', b = batch)
        x = torch.cat((cls_tokens, x), dim = 1)

        seq = x.shape[1]

        x = x + self.pos_embedding[:seq]
        x = self.dropout(x)

        x = self.transformer(x)

        x = x.mean(dim = 1) if self.pool == 'mean' else x[:, 0]

        x = self.to_latent(x)
        return self.mlp_head(x)
Python
복사

Code Explanation

Imports and Setup

import torch
from torch import nn
from torch.nn import Module, ModuleList
Python
복사

Import PyTorch core modules for building neural networks.

from einops import rearrange, repeat
from einops.layers.torch import Rearrange
Python
복사

Import einops utilities for elegant tensor manipulation operations.

Helper Function

def pair(t):
    return t if isinstance(t, tuple) else (t, t)
Python
복사

Converts a single value into a tuple pair, or returns the tuple as-is. Used for handling image/patch sizes that can be specified as either a single int or (height, width) tuple.

FeedForward Class

class FeedForward(Module):
    def __init__(self, dim, hidden_dim, dropout = 0.):
Python
복사

Defines the feedforward network that follows attention in each transformer block. Takes input dimension, hidden dimension, and dropout rate.

self.net = nn.Sequential(
    nn.LayerNorm(dim),
    nn.Linear(dim, hidden_dim),
    nn.GELU(),
    nn.Dropout(dropout),
    nn.Linear(hidden_dim, dim),
    nn.Dropout(dropout)
)
Python
복사

Creates a two-layer MLP with GELU activation: normalize → expand to hidden_dim → activate → dropout → project back to dim → dropout.

Attention Class

class Attention(Module):
    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
Python
복사

Multi-head self-attention mechanism. Parameters: embedding dimension, number of attention heads, dimension per head, and dropout rate.

inner_dim = dim_head * heads
project_out = not (heads == 1 and dim_head == dim)
Python
복사

Calculate total dimension across all heads. Determine if output projection is needed (skip if single head with matching dimensions).

self.heads = heads
self.scale = dim_head ** -0.5
Python
복사

Store number of heads and compute scaling factor (1/√d_k) for dot-product attention.

self.norm = nn.LayerNorm(dim)
self.attend = nn.Softmax(dim = -1)
self.dropout = nn.Dropout(dropout)
Python
복사

Initialize layer normalization, softmax for attention weights, and dropout.

self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
Python
복사

Single linear layer that projects input to Query, Key, and Value matrices simultaneously (3x the inner dimension).

self.to_out = nn.Sequential(
    nn.Linear(inner_dim, dim),
    nn.Dropout(dropout)
) if project_out else nn.Identity()
Python
복사

Output projection layer to map concatenated heads back to original dimension, or identity if projection not needed.

def forward(self, x):
    x = self.norm(x)
Python
복사

Apply layer normalization to input.

qkv = self.to_qkv(x).chunk(3, dim = -1)
q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)
Python
복사

Generate Q, K, V by splitting the projection output into 3 chunks, then reshape each from (batch, seq_len, heads*dim_head) to (batch, heads, seq_len, dim_head).

dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale
Python
복사

Compute scaled dot-product attention scores: QK^T / √d_k.

attn = self.attend(dots)
attn = self.dropout(attn)
Python
복사

Apply softmax to get attention weights, then apply dropout.

out = torch.matmul(attn, v)
out = rearrange(out, 'b h n d -> b n (h d)')
return self.to_out(out)
Python
복사

Multiply attention weights by values, reshape from (batch, heads, seq_len, dim_head) back to (batch, seq_len, heads*dim_head), and apply output projection.

Transformer Class

class Transformer(Module):
    def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.):
Python
복사

Stacks multiple transformer blocks. Parameters: embedding dim, number of layers (depth), attention heads, dim per head, MLP hidden dim, dropout.

self.norm = nn.LayerNorm(dim)
self.layers = ModuleList([])
Python
복사

Final layer norm and list to store transformer blocks.

for _ in range(depth):
    self.layers.append(ModuleList([
        Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout),
        FeedForward(dim, mlp_dim, dropout = dropout)
    ]))
Python
복사

Create 'depth' number of transformer blocks, each containing an attention layer and feedforward layer.

def forward(self, x):
    for attn, ff in self.layers:
        x = attn(x) + x
        x = ff(x) + x
    return self.norm(x)
Python
복사

Process input through all transformer blocks with residual connections (x = sublayer(x) + x), then apply final normalization.

ViT Class

class ViT(Module):
    def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.):
Python
복사

Main Vision Transformer class. Key parameters: image size, patch size, number of output classes, embedding dimension, transformer depth, attention heads, MLP dimension, pooling type ('cls' or 'mean'), input channels, dimension per head, dropout rates.

image_height, image_width = pair(image_size)
patch_height, patch_width = pair(patch_size)
Python
복사

Convert image and patch sizes to (height, width) tuples.

assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'
Python
복사

Verify that image dimensions are evenly divisible by patch dimensions.

num_patches = (image_height // patch_height) * (image_width // patch_width)
patch_dim = channels * patch_height * patch_width
Python
복사

Calculate total number of patches and the flattened dimension of each patch.

assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)'
num_cls_tokens = 1 if pool == 'cls' else 0
Python
복사

Validate pooling type and determine if a CLS token should be added (1 if using cls pooling, 0 if using mean pooling).

self.to_patch_embedding = nn.Sequential(
    Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
    nn.LayerNorm(patch_dim),
    nn.Linear(patch_dim, dim),
    nn.LayerNorm(dim),
)
Python
복사

Patch embedding pipeline: rearrange image into patches → normalize → linear projection to embedding dimension → normalize again.

self.cls_token = nn.Parameter(torch.randn(num_cls_tokens, dim))
self.pos_embedding = nn.Parameter(torch.randn(num_patches + num_cls_tokens, dim))
Python
복사

Learnable CLS token and positional embeddings for all tokens (patches + optional CLS token).

self.dropout = nn.Dropout(emb_dropout)
self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout)
Python
복사

Dropout for embeddings and instantiate the transformer encoder.

self.pool = pool
self.to_latent = nn.Identity()
self.mlp_head = nn.Linear(dim, num_classes)
Python
복사

Store pooling type, identity layer for latent representation, and final classification head.

def forward(self, img):
    batch = img.shape[0]
    x = self.to_patch_embedding(img)
Python
복사

Get batch size and convert image to patch embeddings.

cls_tokens = repeat(self.cls_token, '... d -> b ... d', b = batch)
x = torch.cat((cls_tokens, x), dim = 1)
Python
복사

Repeat CLS token for each sample in batch and prepend to patch embeddings.

seq = x.shape[1]
x = x + self.pos_embedding[:seq]
x = self.dropout(x)
Python
복사

Get sequence length, add positional embeddings to tokens, and apply dropout.

x = self.transformer(x)
Python
복사

Process through transformer encoder.

x = x.mean(dim = 1) if self.pool == 'mean' else x[:, 0]
Python
복사

Pool the transformer output: either take mean across all tokens or extract the CLS token (first token).

x = self.to_latent(x)
return self.mlp_head(x)
Python
복사

Pass through latent layer (identity) and final classification head to produce class logits.

•

As a result of ViT,

This figure visualizes how a Vision Transformer (ViT) attends to an image during classification.

The left shows the input image, and the right shows attention maps indicating which image patches the class token focuses on.

Brighter regions represent areas that are more important for the model’s prediction, typically highlighting the main object while suppressing background information.

Visual Transformer (VT)

arxiv.org

https://arxiv.org/pdf/2006.03677

Visual Transformer -operates in a semantic token space, judiciously attending to different image parts based on context.

1) Not all pixels are created equal

2) Not all images have all concepts:

3)Convolutions struggle to relate spatially-distant concepts:

VisualTransformers.git

tahmid0007

In this repo,

#Tokenization 
wa = rearrange(self.token_wA, 'b h w -> b w h') #Transpose
A= torch.einsum('bij,bjk->bik', x, wa) 
A = rearrange(A, 'b h w -> b w h') #Transpose
A = A.softmax(dim=-1)

VV= torch.einsum('bij,bjk->bik', x, self.token_wV)       
T = torch.einsum('bij,bjk->bik', A, VV)
Python
복사

DTER

arxiv.org

https://arxiv.org/pdf/2005.12872

DETR treats object detection as predicting a set of objects (bounding boxes and labels) in a single forward pass. The core ideas are:

•

A Transformer encoder–decoder architecture to model global relationships in the image.

•

A set-based loss with bipartite (Hungarian) matching that enforces one-to-one assignment between predictions and ground-truth objects, eliminating duplicate detections by design.

DETR consists of three main components:

CNN Backbone

A standard CNN (e.g., ResNet-50/101) extracts a low-resolution feature map from the input image.

Transformer Encoder–Decoder

•

The encoder applies global self-attention over flattened image features with positional encodings, enabling global scene reasoning.

•

The decoder takes a fixed number of learned object queries (e.g., N = 100) and attends to the encoded image features to produce object-level embeddings in parallel.

Prediction Heads

•

A shared feed-forward network (FFN) predicts a class label (including a special “no object” class) and normalized bounding box coordinates for each query.

Key findings:

•

Transformer encoder is crucial for global reasoning and instance separation.

•

Multiple decoder layers progressively reduce duplicate predictions.

•

Object queries specialize in different spatial regions and box sizes.

•

NMS is unnecessary and can even harm performance in later decoder layers.

Reference