Changyu Lee

[Geolocation] - Week 6 - Evaluating Geographic Reasoning Capabilities of CLIP

Published at
2025/10/10
Last edited time
2025/11/19 02:58
Created
2025/10/10 15:56
Section
Research
Status
Done
Series
Geolocation
Tags
Research
AI summary
Keywords
GeoLocation
CLIP
Language
ENG
Week
6

Geolocation - Week 6

This Week's Objectives:
Experimental Data Construction → LA Regional Landmarks
Experiment 1: Geolocation Accuracy Assessment
Experiment 2: Address vs GPS Coordinate Comparison

Abstract

In this week, I investigated the geographic reasoning capability of pretrained vision-language models (VLMs) by assessing their ability to classify visual scenes at varying geographic granularities—continent, country, region, city, and street. Using a dataset of 29 tourist attractions in Los Angeles, we construct a structured metadata pipeline through GPT-based hierarchical labeling, Google Maps cross-validation, and GPS annotation. Two CLIP variants—openai/clip-vit-base-patch32 and laion/CLIP-ViT-H-14-laion2B-s32B-b79K—were compared to evaluate geospatial sensitivity. Results show moderate performance at street-level classification (Top-1 = 0.44, Top-5 = 0.72), revealing that CLIP captures the general “Los Angeles street” context but struggles with fine-grained distinctions among visually similar urban scenes. Cosine similarity distributions between correct and incorrect predictions were nearly overlapping (mean Δ ≈ 0.002), indicating weak discriminability at high resolution. Street name variant analysis further showed small embedding gaps (~0.02–0.05) between same and different street names, suggesting linguistic dilution of location-specific semantics. Additionally, experiments comparing textual addresses and GPS coordinates demonstrated that CLIP models align better with natural language address expressions (mean similarity ≈ 0.22) than with numerical coordinate strings (≈ 0.08–0.13). These findings highlight CLIP’s limitations in fine-grained geolocation and underscore the need for geospatially aware model architectures such as GeoCLIP. Future work includes exploring coordinate ambiguity effects and evaluating negation-aware location reasoning for robust geographic verification.
CLIP shows limited ability to distinguish fine-grained urban street locations, often confusing visually similar scenes.
Natural language address expressions align much better with CLIP’s vision–language space than numerical GPS coordinates.
These findings highlight the need for geospatially aware architectures such as GeoCLIP for accurate geographic reasoning.
Overview of this week’s work

Experiment Setup

Data Processing Pipeline:

1.
GPT-based labeling for Continent / Country / Region / City / Street hierarchies
2.
Cross-validation with Google Maps information
3.
GPS coordinate input from Google Maps to complete metadata

Comparison Models

Comparison between OpenAI's original CLIP model and a model trained with additional datasets (2B pairs, reportedly ~5x more than OpenAI's dataset):
openai/clip-vit-base-patch32
laion/CLIPViTH14-laion2B-s32B-b79K

Experiment 1: Geographic Hierarchy Classification

Objective: Following the methodology of Weyand et al. "PlaNet: Photo Geolocation with Convolutional Neural Networks" (NeurIPS 2023), assess the accuracy of existing pretrained open-source CLIP models across geographic hierarchies: Continent / Country / Region / City / Street.

Top-1 / Top-5 Accuracy

Metric
Value
acc_top1
0.44
acc_top5
0.72
Interpretation:
44% of images correctly identified the target street as the highest similarity match (rank 1)
72% had the correct answer within the top-5 candidates
Analysis: While not perfectly distinguished, CLIP demonstrates partial ability to detect visually related street names. The model appears to capture the overall "LA street atmosphere" but struggles with fine-grained distinctions.

Cosine Similarity Distribution Analysis (Correct vs Incorrect)

Category
mean
median(p50)
p90
max
correct
0.23
0.2388
0.2632
0.2686
wrong-best
0.235
0.2439
0.2615
0.2720
Key Findings:
Average similarity difference ≈ 0.002 → Nearly overlapping distributions
CLIP struggles to distinguish between the "correct street" and "visually similar other streets"
Even at p90 threshold, incorrect answers (0.261) are only marginally lower than correct ones (0.263)
This explains why Top-5 performance significantly outperforms Top-1: The correct answer is typically among the top candidates but often fails to achieve the highest ranking.
For example,

Similarity Value Scale Analysis

Maximum similarity: ~0.27
Minimum similarity: 0.17-0.18
Overall scale: Relatively low compared to typical CLIP performance
Reasoning:
CLIP typically achieves 0.3-0.4+ cosine similarity for strong matches
cf
Current low values attributed to:
Simple prompt structure: "A street photo taken on {street} in Los Angeles"
Visual similarity across urban street scenes (roads, signs, sky, etc.)
Conclusion: "Difficult to distinguish rather than inaccurate"

Street Name Variant Analysis

Objective: Analyze how consistently CLIP recognizes various representations of the same street name (abbreviations, full names, suffix removal, etc.) within the embedding space.
Methodology:
Calculate average and maximum cosine similarity between variants of the same street (same_mean, same_max)
Compare with similarities across different streets (cross_mean, cross_max)
Results:
same_mean slightly higher than cross_mean (difference ~0.02-0.05)
CLIP maintains some consistency for visual variants within the same street (lighting, angles, pedestrian density)
Model tends to recognize different visual representations of the same street as "same location category"
Limitations: The small difference suggests CLIP insufficiently distinguishes fine details between different urban streets, leading to confusion in environments with similar landmark structures or background textures.

Embedding Vector Analysis

Finding: The diagonal (y = x) represents cases where "same street variants" and "different streets" have equal similarity → CLIP fails to distinguish between street names.
Problem Identification: The model encodes Wilshire Blvd, Hollywood Blvd, Ocean Front Walk, etc., with nearly identical sentence vectors, prioritizing 'Los Angeles street' context over linguistic differences between street names.
→ The linguistic distinction between specific street names is weak; instead, the model strongly reflects the broader contextual concept of “Los Angeles street.”

Model Comparison Results

29 data points comparison:
CLIP: 0.467 street-level accuracy
Fine-tuned model (random open-source): 0.433 accuracy
Conclusion: As expected, performance varies significantly based on training data composition.

Experiment 2: Address vs GPS Coordinate Effectiveness

Objective

Evaluate whether human-readable address formats are more effective than numerical coordinate representations when models interpret visual location cues.

Experimental Process

1.
Image Embedding Generation
Input images encoded through CLIP vision encoder → img_emb
2.
Baseline Classification (Label Text)
Generate prompts from metadata address field
Example: "A photo taken at Disneyland Dr in Los Angeles"
Text encoder creates label_emb
Calculate Top-1 accuracy via image-text cosine similarity → acc_baseline_label_text
3.
Address Hybrid
Convert actual address strings directly to prompts
Example: "A photo taken at 1313 Disneyland Dr, Los Angeles"
Calculate address similarity (sim_addr)
Add to existing label-score (S_label) with small weight (α=0.05)
Measure accuracy with new hybrid score → acc_hybrid_address
4.
GPS Hybrid
Convert latitude/longitude to text format
Example: "A photo taken at latitude 34.1 and longitude 118.3"
Apply same correction method with GPS similarity (sim_gps) → acc_hybrid_gps

Result

model
level
acc_baseline_label_text
acc_hybrid_address
acc_hybrid_gps
mean_sim_address_text
mean_sim_gps_text
openai/clip-vit-base-patch32
address
0.400000
0.400000
0.400000
0.225313
0.076969
laion/CLIP-ViT-H-14-laion2B-s32B-b79K
address
0.466667
0.466667
0.466667
0.228383
0.137521

Accuracy Analysis

All models show identical accuracy across baseline, address hybrid, and GPS hybrid conditions because the hybrid calculation (S_label + α · sim_x[:, None]) shifts class scores uniformly, preventing argmax changes and maintaining identical accuracy.
Performance Comparison:
OpenAI CLIP Base: Top-1 accuracy ≈ 0.40
LAION ViT-H: Top-1 accuracy ≈ 0.47
LAION model, trained on broader data, shows slightly better performance in location-related expression discrimination.

Average Visual-Text Similarity Analysis

Consistent pattern across both models: mean_sim_address_text > mean_sim_gps_text
OpenAI CLIP: 0.225 vs 0.077
LAION CLIP: 0.228 vs 0.138
Key Insight: Address expressions create meaningful matches in CLIP's natural language-visual representation space, while GPS coordinates (numerical sequences) have minimal visual semantic meaning, resulting in very low similarity scores.
Conclusion: CLIP responds to natural language expressions like "1313 Disneyland Dr in Anaheim" but virtually cannot interpret numerical coordinates like "lat 33.81, lon 117.91".

Discussion

Can embedding accuracy be determined down to the street level?
Current results suggest significant limitations in fine-grained geographic discrimination.
Is architectural modification and retraining necessary?
The low similarity scores and poor discrimination suggest that standard CLIP architecture may be insufficient for precise geolocation tasks.
How about harder dataset will be given.
In the discussion yesterday, Professor Abdullah said he will give much harder data

Constraints

When map regions themselves are landmarks (e.g., Beverly Hills), any coordinate within the entire specific area was considered correct, which may have inflated accuracy metrics.

Future Work

1.
GeoCLIP Implementation
Planned: Implement GeoCLIP architecture for improved geographic understanding.
2.
Experiment 3: Similar Coordinate Problem Analysis
Objective: Investigate hallucination phenomena when similar coordinates exist, particularly for landmarks with similar visual characteristics.
3.
Experiment 4: Negation Handling Assessment
Background: "CLIP struggles to handle negation effectively" - Classification problems with negative assertions.
Connection to GeoShield: This experiment may connect to GeoShield components for robust geographic verification.
Reference: Junsung Park et al. "Know 'No' Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP" (ICCV 2025)
Example Test Cases:
"This is Disneyland in the US, not Shanghai."
"This is Disneyland in the US, not Tokyo."

Appendix

1.
Metadata
2.
Dataset
Images