Week 10 Brief Review
Week 10 focused on evaluating Vision-Language Models (VLMs) for city-level geolocation tasks. Key accomplishments included:
Initial Experiment:
•
Tested VLMs (Qwen3-VL, LLaVA-1.6) and CLIP models on a small dataset of 67 landmark images from Los Angeles, San Francisco, and New York City
•
CLIP-L/14 achieved the highest accuracy, while LLaVA-1.6 struggled due to its conversational output format
Improvements & Expanded Testing:
•
Retested all models with English prompts (previously LLaVA used Korean prompts)
•
Added "I don't know" (IDK) as a classification option for uncertain predictions
•
CLIP models showed significant accuracy drop when IDK option was introduced, revealing confidence threshold issues
Literature Review:
•
Analyzed the "WHERE ON EARTH?" benchmark paper
•
Key findings: Closed-source models (Gemini-2.5-Pro at 56.32%) significantly outperform open-source models (GLM-4.5V at 34.71%)
•
Web search showed paradoxical effects: helpful for street-level but sometimes harmful for country-level tasks
•
Identified severe regional bias toward Western content in VLM training data
Research Direction:
•
Decided to focus on reproduction of PlaNet using VLMs instead of CNNs
•
Identified challenge: how to perform sub-city (street-level) classification using hierarchical geographic partitioning
Week 11 Progress Summary
Completed Tasks:
•
◦
60 images from google street view
LA
SF
NYC
•
◦
Failed to add GLM-V4.5 (too big, 225B)
◦
▪
Pending: GLM-4.5V, Claude-4-Opus/Sonnet, Gemini 2.5 Pro/Flash, GPT-5/GPT-o3
•
Next Week
Testing Street Level classification in one city (week12-2)
Fine-Grained Street-Level Geolocation Using Vision–Language Models
1. Problem Definition & Background
1.1 Motivation
Recent advances in image geolocation show that modern models can reliably determine city-level locations from a single image. However, experimental evidence and prior work consistently demonstrate that intra-city street-level localization remains extremely challenging. Even state-of-the-art vision–language models (VLMs), which achieve strong performance at high-level geolocation (e.g., LA vs. SF vs. NYC), exhibit substantial confusion when distinguishing neighborhoods or street segments within the same city.
This discrepancy between macro-level success and micro-level failure motivates the central research question:
How can we adapt VLMs—models with rich semantic reasoning and multimodal understanding—so they can perform fine-grained street-level geolocation, not merely city classification?
1.2 Why Vision–Language Models (VLMs)?
Unlike CNNs or purely visual models that rely primarily on texture- or shape-based cues, VLMs are trained on web-scale image–text corpora and encode:
•
architectural styles, cultural aesthetics
•
signage language and business types
•
socio-cultural patterns and urban functions
•
world knowledge about cities, regions, and typical urban layouts
•
semantic concepts such as “Korean district,” “beachfront tourist area,” “office district,” etc.
These capabilities enable semantic, human-like reasoning beyond raw visual appearance—forming the key hypothesis that VLMs can be adapted for street-level geolocation in a way CNNs fundamentally cannot.
2. Research Questions and Hypotheses
2.1 Research Questions
1.
RQ1. How well do off-the-shelf VLMs actually perform on intra-city street-level geolocation?
2.
RQ2. Can we fine-tune the VLM’s vision encoder to meaningfully improve street-level discrimination?
3.
RQ3. Does leveraging the VLM’s semantic and linguistic reasoning provide measurable advantages over CNN-based geolocation at fine-grained scales?
2.2 Hypotheses
•
H1 (Gap Hypothesis).
Off-the-shelf VLMs perform strongly at city-level classification but struggle with intra-city (street/neighborhood) localization.
•
H2 (Geo-Aware Adaptation Hypothesis).
Geo-aware metric learning or contrastive tuning of the VLM’s vision encoder improves neighborhood- or cell-level discrimination, surpassing CNN/CLIP baselines.
•
H3 (Semantic Reasoning Hypothesis).
Using VLM-generated semantic descriptions (e.g., signage language, building type, cultural patterns) provides additional discriminative cues that enhance fine-grained geolocation beyond vision-only models.
3. Why Los Angeles? (Study Area Justification)
Los Angeles is an ideal testbed because it exhibits:
•
Shared macro-level appearance
→ Most street images across LA exhibit a consistent “LA-ness” (climate, building colors, street layout).
•
High intra-city diversity
→ Distinct neighborhoods (Koreatown, Santa Monica, Venice, Beverly Hills, Downtown, Hollywood) reflect different cultural, architectural, and commercial patterns.
•
Experimental evidence
Preliminary tests show that VLMs almost always identify images from the LA test set as LA, yet become highly confused when distinguishing LA neighborhoods.
Thus, LA provides the ideal “stress test” environment where macro patterns are informative yet micro patterns are subtle and challenging, revealing the true capabilities and limitations of VLMs.
DEPRECATED







