Abstract
This work proposes a Vision-Language Model (VLM)-based approach for fine-grained street-level geolocation estimation, going beyond conventional city-level classification. I conduct experiments on classifying 15 sub-regions within Los Angeles (LA) using CLIP and DINOv2 as backbones.
To explicitly incorporate geographic information during training, I implement and compare five distance-aware loss functions: Cross Entropy, Soft Label Cross Entropy, Geo Label Smoothing, Expected Distance Loss, and Geo-Margin Ranking Loss.
Experimental results show that DINOv2 consistently outperforms CLIP by 3–5% accuracy across all settings. Moreover, distance-aware losses—especially Geo-Margin Ranking Loss and Geo Label Smoothing—significantly reduce mean geolocation error compared to standard classification loss. The best-performing model (FT DINOv2 + Ranking Loss) achieves 83.70% accuracy with a mean error of 3.64 km
1. Introduction
Image-based geolocation has achieved strong performance at the city level, yet fine-grained localization within a single city—at the district or street level—remains a challenging problem. This difficulty primarily arises from the high visual similarity betIen neighboring regions and the absence of distinctive landmarks in many street-view scenes.
Early geolocation approaches based on convolutional neural networks (e.g., PlaNet) largely rely on visual pattern matching and often fail to capture high-level semantic cues, such as architectural styles, storefronts, or vegetation patterns, which are essential for distinguishing visually similar urban areas.
Recent large-scale pretrained models provide a new opportunity to address these limitations. In particular, Vision-Language Models (VLMs) such as CLIP learn rich visual representations through large-scale pretraining, enabling the extraction of semantically meaningful features beyond low-level appearance. In parallel, recent self-supervised vision models such as DINOv2 have demonstrated strong performance in capturing fine-grained structural and texture-based visual information. While DINOv2 is not a Vision-Language Model, its representation strength makes it a compelling vision encoder for intra-city geolocation.
In this Iek, I investigated how different pretrained visual encoders behave in the challenging setting of street-level intra-city geolocation, using CLIP and DINOv2 as representative backbones. Rather than relying solely on standard classification objectives, I further introduce distance-aware loss functions that explicitly incorporate geographic proximity during training, encouraging predictions that are not only correct but also spatially meaningful.
I conduct experiments on Los Angeles, a large and visually diverse metropolitan area that serves as a strong stress-test environment for street-level geolocation. Due to its scale and cultural heterogeneity, many street scenes across neighboring regions appear highly similar, making accurate discrimination particularly difficult. Our goal is not only to improve classification accuracy, but also to reduce geographic error, thereby advancing practical street-level geolocation within a single city.
2. Related Work
Geolocation Benchmarks
Recent benchmarks such as WHERE ON EARTH? demonstrate that even state-of-the-art VLMs (e.g., GPT-4o, Gemini) perform Ill at country-level localization but still exhibit large errors at the street level. Text-based reasoning alone often leads to hallucinations when visual cues are Iak.
Hierarchical Classification
PlaNet (Iyand et al.) formulates geolocation as a hierarchical classification problem using global S2 cells. Our work follows a similar classification-based paradigm but replaces CNN backbones with VLMs and introduces distance-aware loss functions that incorporate regression-like spatial constraints into classification training.
Prior Experiments
In preliminary experiments (Ieks 10–12), I confirmed the limitations of zero-shot CLIP, which achieved only ~10% accuracy. Prompt engineering alone was insufficient for fine-grained regional discrimination, motivating the need for supervised fine-tuning, which this work systematically explores.
DINOv2 / CLIP
•
Using ViT Based Architecture
3. Methodology
3.1 Model Architecture
Architecture:
1. Image Encoder (CLIP Vision or DINOv2) -> Image Embedding
2. Projection Head -> Projected Image Embedding (dimension matches CLIP Text)
3. Classification -> Dot Product with Fixed Text Prototypes
Python
복사
I adopt a standard feature extractor + classifier head architecture:
•
Backbones
◦
CLIP (OpenAI): Trained on image–text pairs, strong semantic alignment
◦
DINOv2 (Meta): Self-supervised learning, excels at capturing structural and texture-based visual features
•
Classifier Head
◦
Linear → ReLU → Dropout → Linear
◦
Maps image embeddings to 15 regional classes
3.2 Distance-Aware Loss Functions
To encourage the model to predict geographically closer locations even when incorrect, I compare the following loss functions:
1.
Cross Entropy (Baseline)
Standard classification loss without geographic awareness.
2.
Soft Label Cross Entropy
Applies a Gaussian/RBF kernel to neighboring regions, assigning higher probabilities to geographically close classes.
3.
Geo Label Smoothing
Geo Label Smoothing replaces uniform label smoothing with a distance-aware geographic prior, allowing the model to treat nearby regions as partially correct and thereby encouraging spatially coherent predictions.
Replaces uniform label smoothing with a geographic prior distribution:
Label = (1 − ε) · OneHot + ε · GeoDistribution
class GeoLabelSmoothingLoss(BaseDistanceLoss):
"""
3) Geo Label Smoothing
q = (1-epsilon)*onehot(y) + epsilon*geo_prior
"""
def __init__(self, region_centers, region_ids, epsilon=0.1, t=10.0, device='cuda'):
super().__init__(region_centers, region_ids, device)
self.epsilon = epsilon
# Geo prior (Soft Label distribution)
geo_prior = torch.exp(-self.dist_matrix / t)
geo_prior = geo_prior / geo_prior.sum(dim=1, keepdim=True)
# Mix with One-Hot identity matrix
identity = torch.eye(self.num_classes, device=device)
self.smoothed_labels = (1 - epsilon) * identity + epsilon * geo_prior
def forward(self, logits, targets):
target_probs = self.smoothed_labels[targets]
log_probs = F.log_softmax(logits, dim=1)
loss = -(target_probs * log_probs).sum(dim=1).mean()
return loss
Python
복사
4.
Expected Distance Loss
Adds an auxiliary term that minimizes the expected geographic distance betIen the predicted probability distribution and the ground truth location.
5.
Geo-Margin Ranking Loss
Penalizes distant incorrect predictions more heavily by setting the logit margin proportional to physical distance betIen regions.
Enforces logit(y) > logit(j) + margin(dist(y, j))
Training Setup
•
Collected ~= 20,000 imgs from
◦
selected 9 regions from LA
"regions": [
{
"id": "downtown_la",
"name": "Downtown Los Angeles",
"description": "LA's financial and cultural center - skyline, downtown streets",
"bbox": {
"corners": [
{
"name": "southeast",
"latitude": 34.035,
"longitude": -118.24
},
{
"name": "southwest",
"latitude": 34.035,
"longitude": -118.265
},
{
"name": "northwest",
"latitude": 34.055,
"longitude": -118.265
},
{
"name": "northeast",
"latitude": 34.055,
"longitude": -118.24
}
]
},
"sampling": {
"num_samples": 1000,
"method": "grid",
"headings": [
0,
90,
180,
270
]
},
"output": {
"directory": "streetview_downtown_la",
"prefix": "dtla"
},
"validation": {
"enabled": true,
"radius": 50,
"max_attempts": 1500
}
},
JSON
복사
4. Experiments
4.1 Experimental Setup
Experiments are conducted on 15 regions within Los Angeles, using zero-shot CLIP as a baseline and comparing multiple fine-tuned models.
Evaluation Metrics
•
Accuracy
•
Mean Error Distance (km)
•
Accuracy within radius: Acc@1km, Acc@5km, Acc@10km, Acc@25km
4.2 Quantitative Results
Model | Backbone | Loss | Accuracy (%) | Mean Error (km) | Acc@1km | Acc@5km | Acc@25km |
Zero-shot CLIP | CLIP | N/A | 9.78 | 21.01 | 9.78 | 16.27 | 72.71 |
FT CLIP (CE) | CLIP | CE | 79.27 | 4.47 | 79.27 | 81.51 | 92.88 |
FT CLIP (Soft) | CLIP | Soft Label | 74.99 | 4.60 | 74.99 | 78.71 | 93.37 |
FT CLIP (Ranking) | CLIP | Ranking | 79.99 | 4.47 | 79.99 | 81.91 | 92.86 |
FT CLIP (Geo Smooth) | CLIP | Geo Smooth | 79.55 | 4.38 | 79.55 | 81.93 | 93.21 |
FT DINOv2 (CE) | DINOv2 | CE | 83.68 | 3.68 | 83.68 | 85.22 | 94.21 |
FT DINOv2 (Soft) | DINOv2 | Soft Label | 78.36 | 3.90 | 78.36 | 81.81 | 94.51 |
FT DINOv2 (Expected) | DINOv2 | Exp. Dist. | 80.53 | 4.19 | 80.53 | 82.61 | 93.42 |
FT DINOv2 (Ranking) | DINOv2 | Ranking | 83.70 | 3.64 | 83.70 | 85.24 | 94.33 |
FT DINOv2 (Geo Smooth) | DINOv2 | Geo Smooth | 83.03 | 3.58 | 83.03 | 84.87 | 94.72 |
5. Discussion
•
Fine-tuning is essential.
Pretrained Zero-shot CLIP Image Encoder achieves only 9.78% accuracy, whereas all fine-tuned models exceed 75%, indicating that pretrained representations alone are insufficient for intra-city geolocation.
•
DINOv2 is consistently superior.
Under identical loss settings, DINOv2 outperforms CLIP by 3–5% accuracy and reduces mean error by approximately 0.8 km, suggesting that self-supervised visual features are more effective than text-aligned features for fine-grained localization.
◦
•
Distance-aware losses improve spatial precision.
Geo-Margin Ranking Loss and Geo Label Smoothing yield the best results. Notably, FT DINOv2 (Geo Smooth) achieves the loIst mean error (3.58 km), indicating a strong tendency to predict geographically nearby regions even when misclassified. In contrast, Expected Distance Loss underperforms, likely due to interference betIen auxiliary regression objectives and primary classification learning.
6. Conclusion
This study demonstrates that combining a strong visual backbone (DINOv2) with geographically constrained loss functions is the most effective strategy for street-level geolocation. The results highlight the importance of explicitly modeling spatial relationships during training and provide a practical framework for fine-grained geolocation using Vision-Language Models.






