[Geolocation] Geolocation Performance Comparison Using Vision-Language Models

Published at

2025/10/30

Last edited time

2026/01/04 20:20

Created

2025/10/31 06:25

Section

Research

Status

Done

Series

Geolocation

Geolocation Performance Comparison Using VLMs

Review of the lastest research

Evaluate Visual–Language Models (VLMs) on city-level image classification to determine how accurately they identify which city an image belongs to.

Dataset

•

Cities: Los Angeles (28 images), San Francisco (19 images), New York City (20 images)

•

Total: 67 images

•

Each image shows a famous landmark (e.g., Dodger Stadium, Disneyland).

•

Metadata includes city name, latitude/longitude, and location granularity (continent / city / street level).

Models Tested

•

Qwen3-VL-8B — latest open-source VLM from China

•

LLaVA-1.6 — conversational multimodal model

•

CLIP-Large (patch 14 and 32) — contrastive-learning models

Prompting and Methodology

•

VLMs (Qwen, LLaVA):Natural-language question:"Which city is shown in this image? Choose from Los Angeles, San Francisco, or New York City."Model outputs a text-based response.

•

CLIP Models:Non-generative; compute cosine similarity between image and text embeddings. Multiple prompt templates (e.g., "a photo of the skyline of {city}", "a landmark in {city}") combined with aliases ("New York City", "NYC") produce 42 text prompts total. Aggregated similarity per class determines the city with the highest score.

Results

Model	Accuracy Summary	Notes
CLIP (L/14)	Highest accuracy; perfect prediction (1.0) for New York City	Smaller patch → better detail
Qwen3-VL	Relatively high accuracy	Performs well with classification-type prompts
LLaVA-1.6	Very low accuracy	Generates free-form sentences instead of class labels

Reason for LLaVA's poor accuracy:

As a conversational model, LLaVA generates descriptive sentences like "This image is in front of the Taeyong Museum" even when instructed to output a city label. This lack of strict output formatting reduced accuracy.

Discussion & Feedback

•

Feedbacks

◦

Use English prompts for all models (LLaVA prompts were in Korean).

◦

Dataset too small for statistical validity; accuracy values are indicative only.

◦

Landmarks are easily recognizable; include ordinary street images for realistic evaluation.

Enhancements to Existing Experiments and Additional Experiments

Retry the last experiment

•

Use English prompts for all models (LLaVA prompts were in Korean).

Add "I don't know" as class

Images Qwen-VL said “IDK” #8, 11, 69

•

Why CLIP has 0.94 → 0.23?

Low absolute maximum similarity

•

Rule: top1 < 0.22 → IDK

•

Examples:

◦

L-14 1.jpg: LA 0.199 → below 0.22, classified as IDK

◦

L-14 31.jpg: SF 0.205 → below 0.22, classified as IDK

◦

L-14 11.jpg: LA 0.110 → obviously IDK

◦

L-14 64.jpg: NYC 0.206 → below 0.22, classified as IDK

◦

Similar pattern in B/32: 58.jpg NYC 0.225, SF 0.221 → also triggers rule 2) below, but absolute values themselves are low

Small margin between 1st and 2nd place scores

•

Rule: (top1 - top2) < 0.02 → IDK

•

Examples:

◦

B/32 20.jpg: LA 0.268 vs SF 0.260 → difference 0.008 → IDK

◦

B/32 32.jpg: SF 0.254 vs NYC 0.236 → difference 0.018 → IDK

◦

B/32 37.jpg: SF 0.263 vs LA 0.254 → difference 0.009 → IDK

◦

L-14 22.jpg: SF 0.197 vs LA 0.197 → perfect tie → IDK

Additionally, we can identify content-based and model-based reasons for why similarities are low or margins are small.

◦

Weak city-specific cues in scenes: Indoor/suburban/nighttime scenes or images dominated by sky and roads yield "somewhat similar" scores for all three cities → low absolute values and small inter-city margins.

◦

LA vs SF West Coast confusion: Similar elements like coastlines, hills, low-rise housing, and palm trees cause frequent competition between these two scores (small margins). In contrast, NYC only shows clearly elevated scores in distinctive shots featuring high-rises, brick buildings, SoHo/Manhattan skylines, etc. (e.g., 50, 55, 57, 62.jpg).

◦

Text prototypes focused on 'skylines/landmarks': Alleyways, residential streets, intersections, beaches, and other atypical shots tend to have overall lower absolute values.

◦

ViT-B/32 vs ViT-L/14 difference: B/32 shows slightly higher average similarity and broader distribution, leading to more margin competition cases (relatively). L/14 produces lower absolute values in more shots, causing more IDK classifications under rule 1).

Where On Earth Review

To answer the below questions,

•

Dataset too small for statistical validity; accuracy values are indicative only.

•

Landmarks are easily recognizable; include ordinary street images for realistic evaluation.

•

Title: WHERE ON EARTH? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales

•

Authors: Research teams from UC Santa Cruz, Columbia University, UC Berkeley, Amazon Research

•

Status: Under Review (Submitted October 2025)

Research Objective

Develop a comprehensive benchmark to evaluate Vision-Language Models' (VLMs) image-based geolocation capabilities. Unlike existing research that only measures final answer accuracy, this work also assesses the faithfulness of reasoning processes.

EarthWhere Benchmark Structure

1. WhereCountry (Country-level Identification)

•

500 panoramic 360° images

•

Multiple-choice questions (4 options)

•

Distractors include adjacent or culturally similar countries

•

Difficulty: Medium (average accuracy ~40%)

2. WhereStreet (Street-level Identification)

•

310 high-quality images (188 from Bilibili, 122 from YouTube)

•

861 human-verified visual clues

•

Answer types:

◦

Coordinate-based (46.1%)

◦

Street-level (32.6%)

◦

City/County/State (21.3%)

•

Difficulty: High (requires multi-step reasoning + web search)

Dataset Characteristics

Geographic Distribution

•

Asia: 58.7%

•

Europe: 17.4%

•

North America: 5.5%

•

Other regions: 18.4%

Key Features

•

 Global coverage

•

 Multi-scale evaluation (country → street)

•

 Human-verified reasoning processes

•

 Web search tool integration

Evaluation Metrics

1. Final Answer Evaluation

•

WhereCountry: MCQA accuracy

•

WhereStreet (Coordinates): Acc@k km (1km, 5km, 20km, 200km)

•

WhereStreet (Text): Hierarchical Path Score (HPS)

◦

Matches: Country → Province/State → City → County → Town → Street

◦

Score = Longest matching prefix / Total hierarchy levels

2. Reasoning Process Evaluation

•

Thinking Score: Proportion of key visual clues used by model

•

Reweighted Thinking Score: Weighted by Shapley value contribution

◦

Shows 13.7% higher correlation on average than vanilla score

Evaluated Models (13 Total)

Open-source Models

•

Qwen3-VL-235B, Qwen-2.5-7B

•

GLM-4.5V (best open-source)

•

DeepEyes-7B

•

Skywork-R1V3

Closed-source Models

•

Gemini-2.5-Pro/Flash

•

GPT-4o, GPT-5, o3, o4-mini

•

Claude-4-Opus/Sonnet

Main Experimental Results

Overall Performance Ranking

Rank	Model	Overall Accuracy	Notes
1	Gemini-2.5-Pro	56.32%	Best performance
2	o3 (high reasoning)	~54%	Reasoning-focused
3	GPT-5	~52%	High-performance reasoning
...	...	...	...
Best Open	GLM-4.5V	34.71%	#1 open-source
Avg Others	Other open-source	~19.57%	Near random chance

WhereCountry Results

•

Gemini-2.5-Pro: 68.4% (without web search)

•

GLM-4.5V: 43.8% (best open-source)

•

Average open-source: 19.57% (random guessing level)

WhereStreet Results (Acc@1km)

Model	Bilibili	YouTube	Difference
Gemini-2.5-Pro (web)	6.38%	65.63%	+59.25%p
o3 (high, web)	2.13%	57.29%	+55.16%p
GPT-5 (high, web)	4.26%	50.53%	+46.27%p
GLM-4.5V	2.13%	18.95%	+16.82%p

Findings

1. Closed-source Model Dominance

•

Gemini-2.5-Pro leads at 56.32%

•

21.61%p gap with best open-source (GLM-4.5V)

•

Most open-source models at random level (≤20%)

2. Paradoxical Effect of Web Search

 Contrary to expectations, web search doesn't always help

On WhereCountry:

•

GPT-4o: 13.2% (with web search)

•

Average: 1.72% decrease

•

Reason: Limited visual clues provide insufficient information for retrieval

On WhereStreet:

•

Average +6.5% improvement

•

Reason: Detailed visual clues enable effective searching

3. Limitations of Reasoning Depth

Reasoning Level	WhereCountry	WhereStreet
Low → Medium	+1.03%	+14.0% (Bilibili), +5.9% (YouTube)
Medium → High	-1.03%	+0.8% (minimal gain)

•

Moderate reasoning helps, but excessive reasoning can hurt

•

o3 (high): +1.3% minimal improvement

•

GPT-5 (high): 2.51% decrease

4. Severe Regional Bias

YouTube (Europe/US) vs Bilibili (China)
Answer Accuracy: +42.7% difference
Thinking Score: +0.238 difference
Plain Text
복사

Root Causes:

•

Imbalanced training data distribution

•

Over-representation of Western content

•

Poor recognition of non-English visual clues

5. Validity of Thinking Score

•

Reweighted approach shows 13.70% higher correlation on average

•

More human-verified clues → higher accuracy

◦

1 clue: ~17.5%

◦

2 clues: ~22.5%

◦

3 clues: ~27.5%

Failure Case Analysis

Case 1: Failure to Utilize Visual Clues

•

GPT-4o overlooked tree types and fencing styles in background

•

Web search actually decreased accuracy

•

Without web search, better focus on details led to correct answer

Case 2: Overthinking

•

GLM-4.5V correctly inferred territory and coastline structure

•

But rejected correct assumption with self-contradictory reasoning

•

Too many "aha moments" led to indecision

Case 3: Incomplete Search

•

Gemini-2.5-Pro correctly identified key visual elements

•

Made reasonable assumptions

•

But limited search capabilities led to early termination

◦

Suboptimal search queries

◦

Limited search iterations

◦

Restricted retrieval context length

Limitations & Future Work

Current Limitations

Performance Gap: Best model only achieves 56.32%

Regional Bias: Western-centric training data problem

Limited Visual Clues: Some images lack sufficient information

Inconsistent Tool Usage: Web search shows inconsistent effectiveness

Future Research Directions

Build regionally balanced training datasets

Develop more sophisticated visual clue extraction methods

Improve tool usage strategies (when/how to search)

Enhance multimodal reasoning capabilities

Develop bias mitigation techniques

Key Takeaway

The benchmark reveals that state-of-the-art VLMs still struggle with:

•

Fine-grained spatial reasoning

•

Effective integration of external tools

•

Consistent performance across different geographic regions

•

Balancing reasoning depth with efficiency

What I want to discuss

I am thinking to do reproduction of PlaNet with VLMs and collecting more data

•

Problems: how to do sub-city (street-level) Location classification?

•

Current Approach: Hierarchical Geographic Classification (used in PlaNet)

The core idea is to leverage PlaNet's hierarchical spatial partitioning strategy, but replace the CNN-based image classification with VLM-based reasoning. This approach involves:

Dividing geographic space into a hierarchical grid structure (similar to S2 cells or quadtree)

Fine-tuning or prompting VLMs to classify images into these geographic partitions

Evaluating performance at multiple granularity levels (country → city → neighborhood)

•

But, on classification, What features and keywords should be assigned to each quad?

◦

It seems that there are several methods I can use for comparing the similarity scores (e.g. Templates, Keywords based similarity calculation)

◦

for example, for every quad, make captioning that involves spatial features and I can use it to calculate similarity.

◦

Also, should it be explicit? or Implicit?

Future works

Reproduction of PlaNet replacing CNNs to VLMs 

Search and Test some Sub-city (Street-level) Classification Methods 

Test for Non-famous places