Geolocation - Week 6
This Week's Objectives:
•
Experimental Data Construction → LA Regional Landmarks
•
Experiment 1: Geolocation Accuracy Assessment
•
Experiment 2: Address vs GPS Coordinate Comparison
Abstract
In this week, I investigated the geographic reasoning capability of pretrained vision-language models (VLMs) by assessing their ability to classify visual scenes at varying geographic granularities—continent, country, region, city, and street. Using a dataset of 29 tourist attractions in Los Angeles, we construct a structured metadata pipeline through GPT-based hierarchical labeling, Google Maps cross-validation, and GPS annotation. Two CLIP variants—openai/clip-vit-base-patch32 and laion/CLIP-ViT-H-14-laion2B-s32B-b79K—were compared to evaluate geospatial sensitivity. Results show moderate performance at street-level classification (Top-1 = 0.44, Top-5 = 0.72), revealing that CLIP captures the general “Los Angeles street” context but struggles with fine-grained distinctions among visually similar urban scenes. Cosine similarity distributions between correct and incorrect predictions were nearly overlapping (mean Δ ≈ 0.002), indicating weak discriminability at high resolution. Street name variant analysis further showed small embedding gaps (~0.02–0.05) between same and different street names, suggesting linguistic dilution of location-specific semantics. Additionally, experiments comparing textual addresses and GPS coordinates demonstrated that CLIP models align better with natural language address expressions (mean similarity ≈ 0.22) than with numerical coordinate strings (≈ 0.08–0.13). These findings highlight CLIP’s limitations in fine-grained geolocation and underscore the need for geospatially aware model architectures such as GeoCLIP. Future work includes exploring coordinate ambiguity effects and evaluating negation-aware location reasoning for robust geographic verification.
CLIP shows limited ability to distinguish fine-grained urban street locations, often confusing visually similar scenes.
Natural language address expressions align much better with CLIP’s vision–language space than numerical GPS coordinates.
These findings highlight the need for geospatially aware architectures such as GeoCLIP for accurate geographic reasoning.
Overview of this week’s work
Experiment Setup
Data Processing Pipeline:
1.
GPT-based labeling for Continent / Country / Region / City / Street hierarchies
2.
Cross-validation with Google Maps information
3.
GPS coordinate input from Google Maps to complete metadata
Comparison Models
Comparison between OpenAI's original CLIP model and a model trained with additional datasets (2B pairs, reportedly ~5x more than OpenAI's dataset):
•
openai/clip-vit-base-patch32
•
laion/CLIPViTH14-laion2B-s32B-b79K
Experiment 1: Geographic Hierarchy Classification
•
Objective: Following the methodology of Weyand et al. "PlaNet: Photo Geolocation with Convolutional Neural Networks" (NeurIPS 2023), assess the accuracy of existing pretrained open-source CLIP models across geographic hierarchies: Continent / Country / Region / City / Street.
Top-1 / Top-5 Accuracy
Metric | Value |
acc_top1 | 0.44 |
acc_top5 | 0.72 |
Interpretation:
•
44% of images correctly identified the target street as the highest similarity match (rank 1)
•
72% had the correct answer within the top-5 candidates
•
Analysis: While not perfectly distinguished, CLIP demonstrates partial ability to detect visually related street names. The model appears to capture the overall "LA street atmosphere" but struggles with fine-grained distinctions.
Cosine Similarity Distribution Analysis (Correct vs Incorrect)
Category | mean | median(p50) | p90 | max |
0.23 | 0.2388 | 0.2632 | 0.2686 | |
0.235 | 0.2439 | 0.2615 | 0.2720 |
Key Findings:
•
Average similarity difference ≈ 0.002 → Nearly overlapping distributions
•
CLIP struggles to distinguish between the "correct street" and "visually similar other streets"
•
Even at p90 threshold, incorrect answers (0.261) are only marginally lower than correct ones (0.263)
This explains why Top-5 performance significantly outperforms Top-1: The correct answer is typically among the top candidates but often fails to achieve the highest ranking.
For example,
Similarity Value Scale Analysis
•
Maximum similarity: ~0.27
•
Minimum similarity: 0.17-0.18
•
Overall scale: Relatively low compared to typical CLIP performance
Reasoning:
•
CLIP typically achieves 0.3-0.4+ cosine similarity for strong matches
cf
•
Current low values attributed to:
◦
Simple prompt structure: "A street photo taken on {street} in Los Angeles"
◦
Visual similarity across urban street scenes (roads, signs, sky, etc.)
Conclusion: "Difficult to distinguish rather than inaccurate"
Street Name Variant Analysis
Objective: Analyze how consistently CLIP recognizes various representations of the same street name (abbreviations, full names, suffix removal, etc.) within the embedding space.
Methodology:
•
Calculate average and maximum cosine similarity between variants of the same street (same_mean, same_max)
•
Compare with similarities across different streets (cross_mean, cross_max)
Results:
•
same_mean slightly higher than cross_mean (difference ~0.02-0.05)
•
CLIP maintains some consistency for visual variants within the same street (lighting, angles, pedestrian density)
•
Model tends to recognize different visual representations of the same street as "same location category"
Limitations: The small difference suggests CLIP insufficiently distinguishes fine details between different urban streets, leading to confusion in environments with similar landmark structures or background textures.
Embedding Vector Analysis
Finding: The diagonal (y = x) represents cases where "same street variants" and "different streets" have equal similarity → CLIP fails to distinguish between street names.
Problem Identification: The model encodes Wilshire Blvd, Hollywood Blvd, Ocean Front Walk, etc., with nearly identical sentence vectors, prioritizing 'Los Angeles street' context over linguistic differences between street names.
→ The linguistic distinction between specific street names is weak; instead, the model strongly reflects the broader contextual concept of “Los Angeles street.”
Model Comparison Results
29 data points comparison:
•
CLIP: 0.467 street-level accuracy
•
Fine-tuned model (random open-source): 0.433 accuracy
Conclusion: As expected, performance varies significantly based on training data composition.
Experiment 2: Address vs GPS Coordinate Effectiveness
Objective
Evaluate whether human-readable address formats are more effective than numerical coordinate representations when models interpret visual location cues.
Experimental Process
1.
Image Embedding Generation
•
Input images encoded through CLIP vision encoder → img_emb
2.
Baseline Classification (Label Text)
•
Generate prompts from metadata address field
•
Example: "A photo taken at Disneyland Dr in Los Angeles"
•
Text encoder creates label_emb
•
Calculate Top-1 accuracy via image-text cosine similarity → acc_baseline_label_text
3.
Address Hybrid
•
Convert actual address strings directly to prompts
•
Example: "A photo taken at 1313 Disneyland Dr, Los Angeles"
•
Calculate address similarity (sim_addr)
•
Add to existing label-score (S_label) with small weight (α=0.05)
•
Measure accuracy with new hybrid score → acc_hybrid_address
4.
GPS Hybrid
•
Convert latitude/longitude to text format
•
Example: "A photo taken at latitude 34.1 and longitude 118.3"
•
Apply same correction method with GPS similarity (sim_gps) → acc_hybrid_gps
Result
model | level | acc_baseline_label_text | acc_hybrid_address | acc_hybrid_gps | mean_sim_address_text | mean_sim_gps_text |
openai/clip-vit-base-patch32 | address | 0.400000 | 0.400000 | 0.400000 | 0.225313 | 0.076969 |
laion/CLIP-ViT-H-14-laion2B-s32B-b79K | address | 0.466667 | 0.466667 | 0.466667 | 0.228383 | 0.137521 |
Accuracy Analysis
All models show identical accuracy across baseline, address hybrid, and GPS hybrid conditions because the hybrid calculation (S_label + α · sim_x[:, None]) shifts class scores uniformly, preventing argmax changes and maintaining identical accuracy.
Performance Comparison:
•
OpenAI CLIP Base: Top-1 accuracy ≈ 0.40
•
LAION ViT-H: Top-1 accuracy ≈ 0.47
LAION model, trained on broader data, shows slightly better performance in location-related expression discrimination.
Average Visual-Text Similarity Analysis
Consistent pattern across both models: mean_sim_address_text > mean_sim_gps_text
•
OpenAI CLIP: 0.225 vs 0.077
•
LAION CLIP: 0.228 vs 0.138
Key Insight: Address expressions create meaningful matches in CLIP's natural language-visual representation space, while GPS coordinates (numerical sequences) have minimal visual semantic meaning, resulting in very low similarity scores.
Conclusion: CLIP responds to natural language expressions like "1313 Disneyland Dr in Anaheim" but virtually cannot interpret numerical coordinates like "lat 33.81, lon 117.91".
Discussion
•
Can embedding accuracy be determined down to the street level?
Current results suggest significant limitations in fine-grained geographic discrimination.
•
Is architectural modification and retraining necessary?
The low similarity scores and poor discrimination suggest that standard CLIP architecture may be insufficient for precise geolocation tasks.
•
How about harder dataset will be given.
◦
In the discussion yesterday, Professor Abdullah said he will give much harder data
Constraints
•
When map regions themselves are landmarks (e.g., Beverly Hills), any coordinate within the entire specific area was considered correct, which may have inflated accuracy metrics.
Future Work
1.
GeoCLIP Implementation
Planned: Implement GeoCLIP architecture for improved geographic understanding.
2.
Experiment 3: Similar Coordinate Problem Analysis
Objective: Investigate hallucination phenomena when similar coordinates exist, particularly for landmarks with similar visual characteristics.
3.
Experiment 4: Negation Handling Assessment
Background: "CLIP struggles to handle negation effectively" - Classification problems with negative assertions.
Connection to GeoShield: This experiment may connect to GeoShield components for robust geographic verification.
Reference: Junsung Park et al. "Know 'No' Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP" (ICCV 2025)
Example Test Cases:
•
"This is Disneyland in the US, not Shanghai."
•
"This is Disneyland in the US, not Tokyo."
Appendix
1.
Metadata
2.
Dataset
Images






