[Geolocation] First Experiment for Fine-tuning CLIP

Published at

2026/01/26

Last edited time

2026/01/27 03:49

Created

2026/01/04 20:19

Section

Research

Status

Done

Series

Geolocation

Tags

Research

AI summary

Fine-tuning the CLIP ViT-32 model improved accuracy for LA County sub-city level geolocation from 0.14 to 0.41 after 8 epochs. The dataset included 28,551 images, with performance metrics showing 27.86% accuracy within 10km on 5000 test samples. The model utilizes cosine similarity for predictions, with Venice Beach achieving the highest recall and Long Beach the highest precision. The approach involves zero-shot classification using region centroids for GPS estimation. But When using Acc@Xkm metric, Finetuned models are worse than unfine-tuned models

Keywords

GeoLocation

CLIP

Fine-Tuning

Language

ENG

Week

26-1

1 more property

•

Fine-tuned CLIP ViT-32 model improved accuracy from 0.14 to 0.41 on LA County sub-city level geolocation task after 8 epochs of training

•

Dataset consists of 28,551 images across 9 regions for training, with 15 regions tested during inference

•

Model uses CLIP's vision-language matching: compares image embeddings against region text embeddings (e.g., "A street view photo from Downtown Los Angeles, LA's financial and cultural center - skyline")

•

Prediction method: calculates cosine similarity between normalized image and text embeddings, applies softmax for probability distribution, returns top-k region centroids as GPS coordinates

•

Performance metrics on 5000 sampled images: ACC@1km: 9.78%, ACC@5km: 15.32%, ACC@10km: 27.86%

•

Venice Beach achieved highest recall (0.26) among regions, while Long Beach showed highest precision (0.87) in initial CLIP evaluation

•

Zero-shot classification approach maps predicted regions to their bounding box centroids for GPS coordinate estimation

Comparision of Finetuned CLIP and in Sub-city Level

Overview

Finetuning CLIP

Experiment

1-1. Preparation Stage: Region Text Embeddings

1-2. Actual Prediction Flow

Result

Comparision of Finetuned CLIP and in Sub-city Level

Overview

•

Data Collection

◦

Total 28551 imgs → sampled approximately 6800 imgs

◦

Examples

Finetuning CLIP

•

Target: CLIP ViT-32

◦

Training was stopped at 8 epochs (Best Evaluation)

After Finetuning,

Accuracy was improved on test dataset 0.14 → 0.41

After Finetuning (Because of Data Collection was stopped)

Before Finetuning

# LA County Geolocation Experiment Report

Date: 2026-01-04 19:13:51

## Model Performance Comparison

| Model   |   Accuracy |   Precision |   Recall |   F1 Score |
|:--------|-----------:|------------:|---------:|-----------:|
| CLIP    |      0.147 |       0.565 |    0.147 |      0.203 |

## CLIP Detailed Results

- **Accuracy**: 0.147
- **Precision**: 0.565
- **Recall**: 0.147
- **F1 Score**: 0.203

### Per-Region Performance

```
                             precision    recall  f1-score   support

       Downtown Los Angeles       0.73      0.14      0.24       800
                  Hollywood       0.68      0.02      0.05       800
              Beverly Hills       0.44      0.17      0.24       800
               Santa Monica       0.40      0.14      0.20       800
               Venice Beach       0.53      0.26      0.35       800
                     Malibu       0.00      0.00      0.00         0
Griffith Park & Observatory       0.13      0.30      0.18        30
                  Koreatown       0.21      0.30      0.24       599
                   Pasadena       0.60      0.03      0.06       800
                 Long Beach       0.87      0.15      0.25       800
                Silver Lake       0.00      0.00      0.00         0
             West Hollywood       0.00      0.00      0.00         0
            Manhattan Beach       0.00      0.00      0.00         0
                Culver City       0.00      0.00      0.00         0
              Santa Clarita       0.00      0.00      0.00         0

                   accuracy                           0.15      6229
                  macro avg       0.31      0.10      0.12      6229
               weighted avg       0.56      0.15      0.20      6229

```

Markdown
복사

Experiment

1-1. Preparation Stage: Region Text Embeddings

The constructor first loads region information and creates text embeddings.

self.regions = self._load_regions()  # Read la_county_regions.json
self.region_names =list(self.regions.keys())
self._create_region_embeddings()

Python
복사

_create_region_embeddings():

texts = [f"A street view photo from {name}, {info['description']}"
         for name, info in self.regions.items()]

inputs =self.processor(text=texts, return_tensors="pt", padding=True).to(self.device)

with torch.no_grad():
    text_features =self.model.get_text_features(**inputs)
self.region_embeddings = text_features / text_features.norm(dim=-1, keepdim=True)

Python
복사

For each class, prompts like this are used:

"A street view photo from Downtown Los Angeles, LA's financial and cultural center - skyline, downtown streets"

→ Normalized CLIP text embeddings for these texts are stored in self.region_embeddings. (shape: [num_regions, dim])

1-2. Actual Prediction Flow

predict(self, image_path, top_k=5):

image = Image.open(image_path).convert("RGB")
inputs = self.processor(images=image, return_tensors="pt").to(self.device)

with torch.no_grad():
    image_features = self.model.get_image_features(**inputs)
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)

Python
복사

Load the image and preprocess with CLIP processor

Extract image embeddings with get_image_features (including normalization)

similarity = (image_features @self.region_embeddings.T).squeeze()
probs = torch.softmax(similarity, dim=0)

Python
복사

Calculate cosine similarity between image embeddings and pre-computed region text embeddings

(Since both are ℓ2 normalized, dot product = cosine similarity)

Apply softmax to the similarity scores → probability distribution over each region

top_probs, top_indices = torch.topk(probs, min(top_k, len(self.region_names)))

top_pred_gps = []
top_pred_prob = []

for idx, prob in zip(top_indices, top_probs):
    region_name = self.region_names[idx.item()]
    centroid = self.regions[region_name]["centroid"]
    top_pred_gps.append(centroid)
    top_pred_prob.append(prob.item())

Python
복사

Extract top_k region indices ordered by probability

→ Find the region names

→ Retrieve the region's bbox centroid (latitude, longitude) from la_county_regions.json

Final return value:

return top_pred_gps, top_pred_prob
# Example
top_pred_gps  = [(34.05, -118.25), (34.07, -118.30), (34.02, -118.50), ...]
top_pred_prob = [0.62, 0.21, 0.08, ...]

Python
복사

In summary, this is a CLIP zero-shot classification → using region centroids as GPS coordinates approach.

Sampled 5000 images from approximately 8000 data points

Class Examples:

•

"A street view photo from Downtown Los Angeles, LA's financial and cultural center - skyline"

•

"A street view photo from Koreatown, Korean town - Korean restaurants, karaoke"

Result

•

Fine-Tuned CLIP

◦

ACC@1km: 9.78% → Worse than unfinetuned CLIPs

◦

ACC@5km: 15.32%

◦

ACC@10km: 27.86%