Changyu Lee

[Geolocation] First Experiment for Fine-tuning CLIP

Published at
2026/01/26
Last edited time
2026/01/27 03:49
Created
2026/01/04 20:19
Section
Research
Status
Done
Series
Geolocation
Tags
Research
AI summary
Fine-tuning the CLIP ViT-32 model improved accuracy for LA County sub-city level geolocation from 0.14 to 0.41 after 8 epochs. The dataset included 28,551 images, with performance metrics showing 27.86% accuracy within 10km on 5000 test samples. The model utilizes cosine similarity for predictions, with Venice Beach achieving the highest recall and Long Beach the highest precision. The approach involves zero-shot classification using region centroids for GPS estimation. But When using Acc@Xkm metric, Finetuned models are worse than unfine-tuned models
Keywords
GeoLocation
CLIP
Fine-Tuning
Language
ENG
Week
26-1
1 more property
Fine-tuned CLIP ViT-32 model improved accuracy from 0.14 to 0.41 on LA County sub-city level geolocation task after 8 epochs of training
Dataset consists of 28,551 images across 9 regions for training, with 15 regions tested during inference
Model uses CLIP's vision-language matching: compares image embeddings against region text embeddings (e.g., "A street view photo from Downtown Los Angeles, LA's financial and cultural center - skyline")
Prediction method: calculates cosine similarity between normalized image and text embeddings, applies softmax for probability distribution, returns top-k region centroids as GPS coordinates
Performance metrics on 5000 sampled images: ACC@1km: 9.78%, ACC@5km: 15.32%, ACC@10km: 27.86%
Venice Beach achieved highest recall (0.26) among regions, while Long Beach showed highest precision (0.87) in initial CLIP evaluation
Zero-shot classification approach maps predicted regions to their bounding box centroids for GPS coordinate estimation

Comparision of Finetuned CLIP and in Sub-city Level

Overview

Data Collection
Total 28551 imgs → sampled approximately 6800 imgs
Examples

Finetuning CLIP

Target: CLIP ViT-32
Training was stopped at 8 epochs (Best Evaluation)
After Finetuning,
Accuracy was improved on test dataset 0.14 → 0.41
After Finetuning (Because of Data Collection was stopped)
Before Finetuning
# LA County Geolocation Experiment Report Date: 2026-01-04 19:13:51 ## Model Performance Comparison | Model | Accuracy | Precision | Recall | F1 Score | |:--------|-----------:|------------:|---------:|-----------:| | CLIP | 0.147 | 0.565 | 0.147 | 0.203 | ## CLIP Detailed Results - **Accuracy**: 0.147 - **Precision**: 0.565 - **Recall**: 0.147 - **F1 Score**: 0.203 ### Per-Region Performance ``` precision recall f1-score support Downtown Los Angeles 0.73 0.14 0.24 800 Hollywood 0.68 0.02 0.05 800 Beverly Hills 0.44 0.17 0.24 800 Santa Monica 0.40 0.14 0.20 800 Venice Beach 0.53 0.26 0.35 800 Malibu 0.00 0.00 0.00 0 Griffith Park & Observatory 0.13 0.30 0.18 30 Koreatown 0.21 0.30 0.24 599 Pasadena 0.60 0.03 0.06 800 Long Beach 0.87 0.15 0.25 800 Silver Lake 0.00 0.00 0.00 0 West Hollywood 0.00 0.00 0.00 0 Manhattan Beach 0.00 0.00 0.00 0 Culver City 0.00 0.00 0.00 0 Santa Clarita 0.00 0.00 0.00 0 accuracy 0.15 6229 macro avg 0.31 0.10 0.12 6229 weighted avg 0.56 0.15 0.20 6229 ```
Markdown
복사

Experiment

1-1. Preparation Stage: Region Text Embeddings

The constructor first loads region information and creates text embeddings.
self.regions = self._load_regions() # Read la_county_regions.json self.region_names =list(self.regions.keys()) self._create_region_embeddings()
Python
복사
_create_region_embeddings():
texts = [f"A street view photo from {name}, {info['description']}" for name, info in self.regions.items()] inputs =self.processor(text=texts, return_tensors="pt", padding=True).to(self.device) with torch.no_grad(): text_features =self.model.get_text_features(**inputs) self.region_embeddings = text_features / text_features.norm(dim=-1, keepdim=True)
Python
복사
For each class, prompts like this are used:
"A street view photo from Downtown Los Angeles, LA's financial and cultural center - skyline, downtown streets"
Normalized CLIP text embeddings for these texts are stored in self.region_embeddings. (shape: [num_regions, dim])

1-2. Actual Prediction Flow

predict(self, image_path, top_k=5):
image = Image.open(image_path).convert("RGB") inputs = self.processor(images=image, return_tensors="pt").to(self.device) with torch.no_grad(): image_features = self.model.get_image_features(**inputs) image_features = image_features / image_features.norm(dim=-1, keepdim=True)
Python
복사
1.
Load the image and preprocess with CLIP processor
2.
Extract image embeddings with get_image_features (including normalization)
similarity = (image_features @self.region_embeddings.T).squeeze() probs = torch.softmax(similarity, dim=0)
Python
복사
3.
Calculate cosine similarity between image embeddings and pre-computed region text embeddings
(Since both are ℓ2 normalized, dot product = cosine similarity)
4.
Apply softmax to the similarity scores → probability distribution over each region
top_probs, top_indices = torch.topk(probs, min(top_k, len(self.region_names))) top_pred_gps = [] top_pred_prob = [] for idx, prob in zip(top_indices, top_probs): region_name = self.region_names[idx.item()] centroid = self.regions[region_name]["centroid"] top_pred_gps.append(centroid) top_pred_prob.append(prob.item())
Python
복사
5.
Extract top_k region indices ordered by probability
→ Find the region names
→ Retrieve the region's bbox centroid (latitude, longitude) from la_county_regions.json
Final return value:
return top_pred_gps, top_pred_prob # Example top_pred_gps = [(34.05, -118.25), (34.07, -118.30), (34.02, -118.50), ...] top_pred_prob = [0.62, 0.21, 0.08, ...]
Python
복사
In summary, this is a CLIP zero-shot classification → using region centroids as GPS coordinates approach.
Sampled 5000 images from approximately 8000 data points
Class Examples:
"A street view photo from Downtown Los Angeles, LA's financial and cultural center - skyline"
"A street view photo from Koreatown, Korean town - Korean restaurants, karaoke"

Result

Fine-Tuned CLIP
ACC@1km: 9.78%Worse than unfinetuned CLIPs
ACC@5km: 15.32%
ACC@10km: 27.86%