[Geolocation] - Week 6 - Evaluating Geographic Reasoning Capabilities of CLIP

Published at

2025/10/10

Last edited time

2026/01/04 20:20

Created

2025/10/10 15:56

Section

Research

Status

Done

Series

Geolocation

Geolocation - Week 6

This Week's Objectives:

•

Experimental Data Construction → LA Regional Landmarks

•

Experiment 1: Geolocation Accuracy Assessment

•

Experiment 2: Address vs GPS Coordinate Comparison

Abstract

In this week, I investigated the geographic reasoning capability of pretrained vision-language models (VLMs) by assessing their ability to classify visual scenes at varying geographic granularities—continent, country, region, city, and street. Using a dataset of 29 tourist attractions in Los Angeles, we construct a structured metadata pipeline through GPT-based hierarchical labeling, Google Maps cross-validation, and GPS annotation. Two CLIP variants—openai/clip-vit-base-patch32 and laion/CLIP-ViT-H-14-laion2B-s32B-b79K—were compared to evaluate geospatial sensitivity. Results show moderate performance at street-level classification (Top-1 = 0.44, Top-5 = 0.72), revealing that CLIP captures the general “Los Angeles street” context but struggles with fine-grained distinctions among visually similar urban scenes. Cosine similarity distributions between correct and incorrect predictions were nearly overlapping (mean Δ ≈ 0.002), indicating weak discriminability at high resolution. Street name variant analysis further showed small embedding gaps (~0.02–0.05) between same and different street names, suggesting linguistic dilution of location-specific semantics. Additionally, experiments comparing textual addresses and GPS coordinates demonstrated that CLIP models align better with natural language address expressions (mean similarity ≈ 0.22) than with numerical coordinate strings (≈ 0.08–0.13). These findings highlight CLIP’s limitations in fine-grained geolocation and underscore the need for geospatially aware model architectures such as GeoCLIP. Future work includes exploring coordinate ambiguity effects and evaluating negation-aware location reasoning for robust geographic verification.

CLIP shows limited ability to distinguish fine-grained urban street locations, often confusing visually similar scenes.

Natural language address expressions align much better with CLIP’s vision–language space than numerical GPS coordinates.

These findings highlight the need for geospatially aware architectures such as GeoCLIP for accurate geographic reasoning.

Overview of this week’s work

Experiment Setup

Data Source: "Top 29 Popular Tourist Attractions in Los Angeles! World-class City Tourism with Entertainment" - skyticket travel information

Data Processing Pipeline:

GPT-based labeling for Continent / Country / Region / City / Street hierarchies

Cross-validation with Google Maps information

GPS coordinate input from Google Maps to complete metadata

Comparison Models

Comparison between OpenAI's original CLIP model and a model trained with additional datasets (2B pairs, reportedly ~5x more than OpenAI's dataset):

•

openai/clip-vit-base-patch32

•

laion/CLIPViTH14-laion2B-s32B-b79K

Experiment 1: Geographic Hierarchy Classification

•

Objective: Following the methodology of Weyand et al. "PlaNet: Photo Geolocation with Convolutional Neural Networks" (NeurIPS 2023), assess the accuracy of existing pretrained open-source CLIP models across geographic hierarchies: Continent / Country / Region / City / Street.

Top-1 / Top-5 Accuracy

Metric	Value
acc_top1	0.44
acc_top5	0.72

Interpretation:

•

44% of images correctly identified the target street as the highest similarity match (rank 1)

•

72% had the correct answer within the top-5 candidates

•

Analysis: While not perfectly distinguished, CLIP demonstrates partial ability to detect visually related street names. The model appears to capture the overall "LA street atmosphere" but struggles with fine-grained distinctions.

Cosine Similarity Distribution Analysis (Correct vs Incorrect)

Category	mean	median(p50)	p90	max
correct	0.23	0.2388	0.2632	0.2686
wrong-best	0.235	0.2439	0.2615	0.2720

Key Findings:

•

Average similarity difference ≈ 0.002 → Nearly overlapping distributions

•

CLIP struggles to distinguish between the "correct street" and "visually similar other streets"

•

Even at p90 threshold, incorrect answers (0.261) are only marginally lower than correct ones (0.263)

This explains why Top-5 performance significantly outperforms Top-1: The correct answer is typically among the top candidates but often fails to achieve the highest ranking.

For example,

Similarity Value Scale Analysis

•

Maximum similarity: ~0.27

•

Minimum similarity: 0.17-0.18

•

Overall scale: Relatively low compared to typical CLIP performance

Reasoning:

•

CLIP typically achieves 0.3-0.4+ cosine similarity for strong matches

cf

•

Current low values attributed to:

◦

Simple prompt structure: "A street photo taken on {street} in Los Angeles"

◦

Visual similarity across urban street scenes (roads, signs, sky, etc.)

Conclusion: "Difficult to distinguish rather than inaccurate"

Street Name Variant Analysis

Objective: Analyze how consistently CLIP recognizes various representations of the same street name (abbreviations, full names, suffix removal, etc.) within the embedding space.

Methodology:

•

Calculate average and maximum cosine similarity between variants of the same street (same_mean, same_max)

•

Compare with similarities across different streets (cross_mean, cross_max)

Results:

•

same_mean slightly higher than cross_mean (difference ~0.02-0.05)

•

CLIP maintains some consistency for visual variants within the same street (lighting, angles, pedestrian density)

•

Model tends to recognize different visual representations of the same street as "same location category"

Limitations: The small difference suggests CLIP insufficiently distinguishes fine details between different urban streets, leading to confusion in environments with similar landmark structures or background textures.

Embedding Vector Analysis

Finding: The diagonal (y = x) represents cases where "same street variants" and "different streets" have equal similarity → CLIP fails to distinguish between street names.

Problem Identification: The model encodes Wilshire Blvd, Hollywood Blvd, Ocean Front Walk, etc., with nearly identical sentence vectors, prioritizing 'Los Angeles street' context over linguistic differences between street names.

→ The linguistic distinction between specific street names is weak; instead, the model strongly reflects the broader contextual concept of “Los Angeles street.”

Model Comparison Results

29 data points comparison:

•

CLIP: 0.467 street-level accuracy

•

Fine-tuned model (random open-source): 0.433 accuracy

Conclusion: As expected, performance varies significantly based on training data composition.

Experiment 2: Address vs GPS Coordinate Effectiveness

Objective

Evaluate whether human-readable address formats are more effective than numerical coordinate representations when models interpret visual location cues.

Experimental Process

Image Embedding Generation

•

Input images encoded through CLIP vision encoder → img_emb

Baseline Classification (Label Text)

•

Generate prompts from metadata address field

•

Example: "A photo taken at Disneyland Dr in Los Angeles"

•

Text encoder creates label_emb

•

Calculate Top-1 accuracy via image-text cosine similarity → acc_baseline_label_text

Address Hybrid

•

Convert actual address strings directly to prompts

•

Example: "A photo taken at 1313 Disneyland Dr, Los Angeles"

•

Calculate address similarity (sim_addr)

•

Add to existing label-score (S_label) with small weight (α=0.05)

•

Measure accuracy with new hybrid score → acc_hybrid_address

GPS Hybrid

•

Convert latitude/longitude to text format

•

Example: "A photo taken at latitude 34.1 and longitude 118.3"

•

Apply same correction method with GPS similarity (sim_gps) → acc_hybrid_gps

Result

model	level	acc_baseline_label_text	acc_hybrid_address	acc_hybrid_gps	mean_sim_address_text	mean_sim_gps_text
openai/clip-vit-base-patch32	address	0.400000	0.400000	0.400000	0.225313	0.076969
laion/CLIP-ViT-H-14-laion2B-s32B-b79K	address	0.466667	0.466667	0.466667	0.228383	0.137521

Accuracy Analysis

All models show identical accuracy across baseline, address hybrid, and GPS hybrid conditions because the hybrid calculation (S_label + α · sim_x[:, None]) shifts class scores uniformly, preventing argmax changes and maintaining identical accuracy.

Performance Comparison:

•

OpenAI CLIP Base: Top-1 accuracy ≈ 0.40

•

LAION ViT-H: Top-1 accuracy ≈ 0.47

LAION model, trained on broader data, shows slightly better performance in location-related expression discrimination.

Average Visual-Text Similarity Analysis

Consistent pattern across both models: mean_sim_address_text > mean_sim_gps_text

•

OpenAI CLIP: 0.225 vs 0.077

•

LAION CLIP: 0.228 vs 0.138

Key Insight: Address expressions create meaningful matches in CLIP's natural language-visual representation space, while GPS coordinates (numerical sequences) have minimal visual semantic meaning, resulting in very low similarity scores.

Conclusion: CLIP responds to natural language expressions like "1313 Disneyland Dr in Anaheim" but virtually cannot interpret numerical coordinates like "lat 33.81, lon 117.91".

Discussion

•

Can embedding accuracy be determined down to the street level?

Current results suggest significant limitations in fine-grained geographic discrimination.

•

Is architectural modification and retraining necessary?

The low similarity scores and poor discrimination suggest that standard CLIP architecture may be insufficient for precise geolocation tasks.

•

How about harder dataset will be given.

◦

In the discussion yesterday, Professor Abdullah said he will give much harder data

Constraints

•

When map regions themselves are landmarks (e.g., Beverly Hills), any coordinate within the entire specific area was considered correct, which may have inflated accuracy metrics.

Future Work

GeoCLIP Implementation

Planned: Implement GeoCLIP architecture for improved geographic understanding.

Experiment 3: Similar Coordinate Problem Analysis

Objective: Investigate hallucination phenomena when similar coordinates exist, particularly for landmarks with similar visual characteristics.

Experiment 4: Negation Handling Assessment

Background: "CLIP struggles to handle negation effectively" - Classification problems with negative assertions.

Connection to GeoShield: This experiment may connect to GeoShield components for robust geographic verification.

Reference: Junsung Park et al. "Know 'No' Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP" (ICCV 2025)

Example Test Cases:

•

"This is Disneyland in the US, not Shanghai."

•

"This is Disneyland in the US, not Tokyo."

Appendix

Metadata

Dataset 

Images