Changyu Lee

[Geolocation] Research Topic Suggestion; Fine-Grained Street-Level Geolocation Using Vision–Language Models

Published at
2025/11/13
Last edited time
2025/11/21 14:26
Created
2025/11/13 09:37
Section
Research
Status
Done
Series
Geolocation
Tags
Research
AI summary
Week 10 focused on evaluating Vision-Language Models (VLMs) for city-level geolocation, achieving notable results with CLIP models. Week 11 involved collecting a new dataset for expanded testing and renewing the VLM suite. The research aims to adapt VLMs for fine-grained street-level geolocation, addressing challenges in distinguishing neighborhoods within cities. Key research questions include the performance of off-the-shelf VLMs and the potential benefits of fine-tuning. The study area is Los Angeles, chosen for its intra-city diversity and macro-level consistency, with a methodology that includes dataset construction and proposed VLM-based approaches for improved geolocation accuracy.
Keywords
GeoLocation
Week 11
Language
ENG
Week
11

Week 10 Brief Review

Week 10 focused on evaluating Vision-Language Models (VLMs) for city-level geolocation tasks. Key accomplishments included:
Initial Experiment:
Tested VLMs (Qwen3-VL, LLaVA-1.6) and CLIP models on a small dataset of 67 landmark images from Los Angeles, San Francisco, and New York City
CLIP-L/14 achieved the highest accuracy, while LLaVA-1.6 struggled due to its conversational output format
Improvements & Expanded Testing:
Retested all models with English prompts (previously LLaVA used Korean prompts)
Added "I don't know" (IDK) as a classification option for uncertain predictions
CLIP models showed significant accuracy drop when IDK option was introduced, revealing confidence threshold issues
Literature Review:
Analyzed the "WHERE ON EARTH?" benchmark paper
Key findings: Closed-source models (Gemini-2.5-Pro at 56.32%) significantly outperform open-source models (GLM-4.5V at 34.71%)
Web search showed paradoxical effects: helpful for street-level but sometimes harmful for country-level tasks
Identified severe regional bias toward Western content in VLM training data
Research Direction:
Decided to focus on reproduction of PlaNet using VLMs instead of CNNs
Identified challenge: how to perform sub-city (street-level) classification using hierarchical geographic partitioning

Week 11 Progress Summary

Completed Tasks:
Collected new dataset for expanded testing
60 images from google street view
LA
SF
NYC
LM models: Qwen3-VL and CLIP variants (CLIP-L-14, CLIP-32)
Failed to add GLM-V4.5 (too big, 225B)
Renewing VLM suite with additional models:
Pending: GLM-4.5V, Claude-4-Opus/Sonnet, Gemini 2.5 Pro/Flash, GPT-5/GPT-o3
Retesting on new dataset with expanded model suite (left: new, right: old)

Next Week

Testing Street Level classification in one city (week12-2)

Fine-Grained Street-Level Geolocation Using Vision–Language Models

1. Problem Definition & Background

1.1 Motivation

Recent advances in image geolocation show that modern models can reliably determine city-level locations from a single image. However, experimental evidence and prior work consistently demonstrate that intra-city street-level localization remains extremely challenging. Even state-of-the-art vision–language models (VLMs), which achieve strong performance at high-level geolocation (e.g., LA vs. SF vs. NYC), exhibit substantial confusion when distinguishing neighborhoods or street segments within the same city.
This discrepancy between macro-level success and micro-level failure motivates the central research question:
How can we adapt VLMs—models with rich semantic reasoning and multimodal understanding—so they can perform fine-grained street-level geolocation, not merely city classification?

1.2 Why Vision–Language Models (VLMs)?

Unlike CNNs or purely visual models that rely primarily on texture- or shape-based cues, VLMs are trained on web-scale image–text corpora and encode:
architectural styles, cultural aesthetics
signage language and business types
socio-cultural patterns and urban functions
world knowledge about cities, regions, and typical urban layouts
semantic concepts such as “Korean district,” “beachfront tourist area,” “office district,” etc.
These capabilities enable semantic, human-like reasoning beyond raw visual appearance—forming the key hypothesis that VLMs can be adapted for street-level geolocation in a way CNNs fundamentally cannot.

2. Research Questions and Hypotheses

2.1 Research Questions

1.
RQ1. How well do off-the-shelf VLMs actually perform on intra-city street-level geolocation?
2.
RQ2. Can we fine-tune the VLM’s vision encoder to meaningfully improve street-level discrimination?
3.
RQ3. Does leveraging the VLM’s semantic and linguistic reasoning provide measurable advantages over CNN-based geolocation at fine-grained scales?

2.2 Hypotheses

H1 (Gap Hypothesis).
Off-the-shelf VLMs perform strongly at city-level classification but struggle with intra-city (street/neighborhood) localization.
H2 (Geo-Aware Adaptation Hypothesis).
Geo-aware metric learning or contrastive tuning of the VLM’s vision encoder improves neighborhood- or cell-level discrimination, surpassing CNN/CLIP baselines.
H3 (Semantic Reasoning Hypothesis).
Using VLM-generated semantic descriptions (e.g., signage language, building type, cultural patterns) provides additional discriminative cues that enhance fine-grained geolocation beyond vision-only models.

3. Why Los Angeles? (Study Area Justification)

Los Angeles is an ideal testbed because it exhibits:
Shared macro-level appearance
→ Most street images across LA exhibit a consistent “LA-ness” (climate, building colors, street layout).
High intra-city diversity
→ Distinct neighborhoods (Koreatown, Santa Monica, Venice, Beverly Hills, Downtown, Hollywood) reflect different cultural, architectural, and commercial patterns.
Experimental evidence
Preliminary tests show that VLMs almost always identify images from the LA test set as LA, yet become highly confused when distinguishing LA neighborhoods.
Thus, LA provides the ideal “stress test” environment where macro patterns are informative yet micro patterns are subtle and challenging, revealing the true capabilities and limitations of VLMs.
DEPRECATED