Changyu Lee

[Geolocation] CSP: Self-Supervised Contrastive Spatial Pre-Training for Getspatial-Visual Representations

Published at
2025/11/17
Last edited time
2025/11/20 15:55
Created
2025/11/17 21:19
Section
Research
Status
Done
Series
CSP
Tags
Research
AI summary
A novel self-supervised learning approach aligns location and image encoders through contrastive learning to create geospatially aligned visual representations without labeled data. The methodology includes a two-stage training pipeline: contrastive spatial pre-training and supervised fine-tuning. Key contributions include the superiority of multi-class loss for learning discriminative representations, the effectiveness of a locked image encoder during training, and a flexible architecture that supports various spatial encoders and loss functions, demonstrating significant improvements in geospatial-visual tasks.
Keywords
Week 1
Language
KOR-ENG
Week

CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations - Paper Review

Introduction

In the era of multimodal learning, aligning visual and spatial information has become increasingly important for geospatial applications. The CSP (Contrastive Spatial Pre-Training) paper introduces a novel approach to learn geospatially aligned representations by pre-training location and image encoders in a self-supervised manner. This review explores the key contributions, methodology, and implementation details of this work.

Core Concept

CSP proposes a two-stage training pipeline:
1.
Contrastive Spatial Pre-Training: A label-free self-supervised learning phase that aligns location encoder e(x) and image encoder f(I) in a geospatially aligned representation space.
2.
Supervised Fine-Tuning: Task-specific fine-tuning with labeled data to optimize for downstream applications.
The fundamental goal of the pre-training phase is to learn representations where geographically proximate images are embedded close to their corresponding location encodings in the shared latent space.

Architecture Overview

Location Encoder

The LocationEncoder module serves as the core component for encoding geographic coordinates into meaningful embeddings. Key features include:
Spatial Encoder Integration: Uses a spatial encoder (spa_enc) to transform 2D location coordinates into high-dimensional embeddings
Classification Head: Includes a linear layer for multi-class prediction
Flexible Output: Can return either raw location embeddings or class predictions depending on the use case
The forward pass converts location coordinates (batch_size, 2) into spatial embeddings (batch_size, num_filts) through the spatial encoder, which can then be used for downstream tasks.
class LocationEncoder(nn.Module): def __init__(self, spa_enc, num_inputs, num_classes, num_filts, num_users=1): ''' Args: spa_enc: the spatial encoder num_inputs: input embedding dimention num_classes: number of categories we want to classify num_filts: hidden embedding dimention ''' super(LocationEncoder, self).__init__() self.spa_enc = spa_enc self.inc_bias = False self.num_filts = num_filts self.num_classes = num_classes self.num_users = num_users self.class_emb = nn.Linear(num_filts, num_classes, bias=self.inc_bias) self.user_emb = nn.Linear(num_filts, num_users, bias=self.inc_bias) def forward(self, x, class_of_interest=None, return_feats=False): ''' Args: x: torch.FloatTensor(), input location features (batch_size, input_loc_dim = 2) class_of_interest: the class id we want to extract return_feats: whether or not just return location embedding ''' # loc_feat: (batch_size, 1, input_loc_dim = 2) loc_feat = torch.unsqueeze(x, dim=1) loc_feat = loc_feat.cpu().data.numpy() # loc_embed: torch.Tensor(), (batch_size, 1, spa_embed_dim = num_filts) loc_embed = self.spa_enc(loc_feat) # loc_emb: torch.Tensor(), (batch_size, spa_embed_dim = num_filts) loc_emb = loc_embed.squeeze(1) if return_feats: # loc_emb: (batch_size, num_filts) return loc_emb if class_of_interest is None: # class_pred: (batch_size, num_classes) class_pred = self.class_emb(loc_emb) else: # class_pred: shape (batch_size) class_pred = self.eval_single_class(loc_emb, class_of_interest) return torch.sigmoid(class_pred) def eval_single_class(self, x, class_of_interest): ''' Args: x: (batch_size, num_filts) Return: shape (batch_size) ''' # note: self.class_emb.weight shape (num_classes, num_filts) if self.inc_bias: return torch.matmul(x, self.class_emb.weight[class_of_interest, :]) + self.class_emb.bias[class_of_interest] else: return torch.matmul(x, self.class_emb.weight[class_of_interest, :])
Python
복사

Location-Image Encoder

The LocationImageEncoder extends the location encoder by incorporating image features and supporting various self-supervised loss functions. This architecture enables:
Joint Location-Image Representation: Combines location embeddings with CNN-extracted image features
Multiple Loss Functions: Supports L2 regression loss, image contrastive loss, and continuous softmax loss
Projection Layers: Includes optional decoder layers for different self-supervised learning objectives
Used off-the-shelf DNNs: InceptionV3, Geo-SSL pretrained ResNet50
Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826, 2016.
class LocationImageEncoder(nn.Module): def __init__(self, loc_enc, train_loss, unsuper_loss = "none", cnn_feat_dim = 2048, spa_enc_type = "gridcell"): ''' Args: loc_enc: LocationEncoder() or FCNet() ''' super(LocationImageEncoder, self).__init__() self.loc_enc = loc_enc if spa_enc_type in ["geo_net"]: self.spa_enc = loc_enc else: self.spa_enc = loc_enc.spa_enc self.inc_bias = loc_enc.inc_bias self.class_emb = loc_enc.class_emb self.user_emb = loc_enc.user_emb self.cnn_feat_dim = cnn_feat_dim self.loc_emb_dim = loc_enc.num_filts if unsuper_loss == "none": return elif unsuper_loss == "l2regress": self.loc_dec = nn.Linear(in_features = self.loc_emb_dim, out_features = self.cnn_feat_dim, bias = True) elif "imgcontloss" in unsuper_loss or "contsoftmax" in unsuper_loss: self.img_dec = nn.Linear(in_features = self.cnn_feat_dim, out_features = self.loc_emb_dim, bias = True) else: raise Exception(f"Unknown unsuper_loss={unsuper_loss}") def forward(self, x, class_of_interest=None, return_feats=False): ''' Args: x: torch.FloatTensor(), input location features (batch_size, input_loc_dim = 2) class_of_interest: the class id we want to extract return_feats: whether or not just return location embedding ''' return self.loc_enc.forward(x, class_of_interest, return_feats) def eval_single_class(self, x, class_of_interest): ''' Args: x: (batch_size, num_filts) Return: shape (batch_size) ''' return self.loc_enc.eval_single_class(x, class_of_interest)
Python
복사

Contrastive Learning Strategy

The paper explores three distinct approaches for constructing positive and negative pairs during pre-training:

① In-Batch Negative Sampling

Uses other samples within the same batch as negative examples. This is computationally efficient but may suffer from false negatives when batch samples are geographically close.
Definition
Given a mini-batch of location-image pairs, each anchor pair uses all other samples in the batch as negatives.
{(x1,I1),(x2,I2),,(xB,IB)}\{(x_1, I_1), (x_2, I_2), \dots, (x_B, I_B)\}
Positive:
(xi,Ii)(x_i, I_i)
Negatives:
(xj,Ij),  ji(x_j, I_j), \; j \neq i
This provides a set of contrastive negatives entirely from the current batch.
Loss Formulation
Li=logexp(sim(f(Ii),e(xi))/τ)j=1Bexp(sim(f(Ii),e(xj))/τ)\mathcal{L}_i = -\log\frac{\exp(\text{sim}(f(I_i), e(x_i))/\tau)}{\sum_{j=1}^{B}\exp(\text{sim}(f(I_i), e(x_j))/\tau)}
The denominator contains all batch locations, so the model must distinguish the correct location embedding from all others in the batch.

② Random Negative Location Sampling

Randomly samples geographic locations as negatives, ensuring greater diversity in negative examples and reducing the likelihood of false negatives.
Definition
Instead of using only in-batch negatives, this method draws random geographic locations from the whole spatial distribution.
xip(x)x_i^- \sim p(x)
The image is fixed, but its paired location is replaced by randomly sampled negatives.
Positive:
(xi,Ii)(x_i, I_i)
Negatives: randomly sampled locations paired with the same image
This increases the diversity of negative examples and reduces false negatives.
Loss Formulation
Li=logexp(sim(f(Ii),e(xi))/τ)exp(sim(f(Ii),e(xi))/τ)+k=1Kexp(sim(f(Ii),e(xi,k))/τ)\mathcal{L}_i = -\log \frac{\exp(\text{sim}(f(I_i), e(x_i))/\tau)}{\exp(\text{sim}(f(I_i), e(x_i))/\tau) + \sum_{k=1}^{K}\exp(\text{sim}(f(I_i), e(x_{i,k}^{-}))/\tau)}
Where
xi,kx_{i,k}^{-}
are independently sampled negative locations.

③ SimCSE-Style Location Positive Augmentation

Creates positive pairs by augmenting the same location with slight perturbations, inspired by SimCSE's approach in NLP. This helps the model learn robust location representations invariant to small spatial variation 수식anchor = (e(x_i))positive = (e'(x_i))[
Definition
Inspired by SimCSE in NLP, the same location is encoded twice with different dropout masks or encoder stochasticity.
e(xi),e(xi)e(x_i), \quad e'(x_i)
These two embeddings form a positive pair, enforcing invariance to small perturbations.
Positive: two different encodings of the same location
Negatives: other locations in the batch
Loss Formulation
Li=logexp(sim(e(xi),e(xi))/τ)j=1Bexp(sim(e(xi),e(xj))/τ)\mathcal{L}_i = -\log \frac{\exp(\text{sim}(e(x_i), e'(x_i))/\tau)}{\sum_{j=1}^{B}\exp(\text{sim}(e(x_i), e(x_j))/\tau)}
The encoder is trained to produce consistent representations for the same location while separating it from all other locations in the batch.

Loss Functions

The paper evaluates three self-supervised loss functions:

NCE (Noise Contrastive Estimation) Loss

A standard contrastive loss that encourages positive pairs to have similar embeddings while pushing negative pairs apart.

MC (InfoNCE-style Multi-Class) Loss

A multi-class variant of the InfoNCE loss that treats each negative sample as a separate class. This approach achieved the best performance in the paper's experiments, demonstrating superior ability to learn discriminative geospatial representations.

L2/MSE Regression Loss

Attempts to directly regress location embeddings to image features using mean squared error. However, this approach showed poor performance compared to contrastive methods.

Training Methodology

Image Encoder Pre-Training

Following insights from the LiT (Locked-image Tuning) paper, CSP adopts a two-phase approach:
1.
Pre-train the image encoder f(I) first
2.
Lock the image encoder and use it to pre-train the location encoder e(x) (Self-Supervised Fine Tuning)
During CSP pre-training, only the image projection layer W() is trained while the rest of the image encoder remains frozen. This strategy significantly improves performance by preventing the image encoder from overfitting to the specific geospatial task.
Drop

Feature Extraction Pipeline

The implementation shows that:
Location features are extracted dynamically during data loading
CNN image features appear to be pre-computed and stored in the preprocessed dataset
The spatial encoder transforms raw coordinates into embeddings suitable for contrastive learning
Prior Dataset

Key Insights and Contributions

Self-Supervised Geospatial Learning: Demonstrates that contrastive learning can effectively align visual and spatial modalities without requiring labeled data
MC Loss Superiority: Shows that InfoNCE-style multi-class loss outperforms other contrastive objectives for this task
Locked Image Encoder: Validates that freezing the image encoder during location encoder pre-training improves generalization
Flexible Architecture: The modular design allows easy integration of different spatial encoders and loss functions

Conclusion

CSP presents a well-designed framework for learning geospatially aligned visual representations through self-supervised contrastive learning. By carefully designing positive/negative pair construction strategies and selecting appropriate loss functions, the paper demonstrates significant improvements in geospatial-visual tasks. The locked image encoder strategy and the superiority of MC loss provide valuable insights for future multimodal geospatial learning research.
The clean separation between pre-training and fine-tuning phases, combined with the modular architecture, makes this approach both theoretically sound and practically applicable to various geospatial computer vision tasks.

Appendix

The code generates models and input features when loading data