Overview of ML & Classification

Published at

2025/12/22

Last edited time

2026/02/12 23:48

Created

2025/12/22 10:30

Section

Status

Done

Series

From the Bottom

Overview of Machine Learning

What is Machine Learning (ML)?

•

A set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty

•

Key indigredients in ML

Data

collected from past observation (⇒ training data)

Model

devised to capture the pattrns in the data

for example, models are trained to learn the relationship between input features and output labels

Prediction

Apply the model to forecast what is going to happen in future

for instance, predicting the category of a new data point or estimating a continuous value

•

Different flavors of learning problems

Supervised Learning

Unsupervised Learning

Decision Making (e.g. Reinforcement Learning)

Classification & Nearest Neighbor Classification

What is Classification Problems?

•

Classification is a supervised learning task where the goal is to predict a categorical label for a given input

•

Given training data with known labels, we build a model that can assign the correct category to new, unseen data points

•

Examples include spam detection, image recognition, medical diagnosis, and sentiment analysis

•

E.g. Recognizing Flowers

◦

Types of Iris: setosa, versicolor, and virginica

The Iris flower recognition problem is a classic example in machine learning, introduced by statistician Ronald Fisher in 1936

The dataset consists of 150 samples of iris flowers, with 50 samples from each of three species:

▪

Iris setosa - typically has smaller petals and distinct characteristics

▪

Iris versicolor - medium-sized petals with intermediate features

▪

Iris virginica - generally has the largest petals

Each sample is described by four features measured in centimeters:

▪

Sepal length

▪

Sepal width

▪

Petal length

▪

Petal width

The goal is to build a classification model that can predict which species an iris belongs to based on these four measurements

This is a multi-class classification problem where C = 3 (three species) and the feature vector x ∈ ℝ⁴ (four measurements)

Visualization of Classification

General Setup for Multi-class classification

•

N samples/instances

D^TRAIN = {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}

where xᵢ ∈ ℝᵈ is the feature vector and yᵢ ∈ {1, 2, ..., C} is the class label

•

Future Prediction: f(x); 

given a new input x, predict its class label ŷ = f(x)

Binary Classification

•

Number of classes: C = 2

•

Conventional labels: {0,1} or {-1, 1}

Nearest Neighbor Classification (NNC)

Nearest Neighbor Classification (NNC) is one of the simplest and most intuitive classification algorithms in machine learning. The core idea is to classify a new data point based on the class of its closest neighbor in the training dataset.

•

How It Works

◦

Step 1: Calculate distancesFor a new point x that we want to classify, we calculate the distance between x and every training point xᵢ in our dataset

◦

Step 2: Find the nearest neighborWe identify which training point has the smallest distance to our query point. This is expressed as nn(x) = argmin_{i∈{1,...,N}} d(x, xᵢ)

where d(x, xᵢ) is the distance between x and training point xᵢ

▪

The predicted label for x is: ŷ = y_{nn(x)}

▪

Common distance metrics:

Euclidean distance: d(x, x') = ||x - x'||₂ = √(Σⱼ(xⱼ - x'ⱼ)²)

Manhattan distance: d(x, x') = ||x - x'||₁ = Σⱼ|xⱼ - x'ⱼ|

Expanding the nearest neighbor function with the L2 distance metric:

NN(x) = \arg\min_{i \in \{1, ..., N\}} \sqrt{\sum_{j=1}^{d} (x_j - x_{i,j})^2}

This equation finds the training sample x_i that minimizes the Euclidean distance to the query point x, where:

▪

x = (x₁, x₂, ..., x_d) is the query point (new data point to classify)

▪

x_i = (x_{i,1}, x_{i,2}, ..., x_{i,d}) is the i-th training sample

▪

d is the dimensionality of the feature space

▪

N is the total number of training samples

◦

Step 3: Assign the labelThe predicted class for the new point is simply the class label of its nearest neighbor: ŷ = y_{nn(x)}

•

Decision Boundary

The decision boundary is a fundamental concept in classification that represents the boundary or surface that separates different classes in the feature space.

◦

Definition

A decision boundary is the region of space where the classifier changes its prediction from one class to another. It's where the model is "uncertain" between classes.

◦

For Nearest Neighbor Classification

The decision boundary is formed by all points that are equidistant from their nearest neighbors of different classes

Since NNC assigns the label of the nearest training point, the boundary lies exactly halfway between training points of different classes

◦

Mathematical Representation

For two classes, the decision boundary is the set of points x where:

d(x, x_i) = d(x, x_j)

where x_i and x_j are the nearest neighbors from different classes

◦

Visual Example

In the image below, you can see how the decision boundary creates regions around each training point. Each region represents the area where that point is the nearest neighbor, and points falling in that region will be assigned that point's class label.

•

Accuracy

Accuracy is the proportion of correct predictions made by the classifier out of all predictions

It measures how often the model correctly classifies data points

◦

Formula

\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} = \frac{1}{N}\sum_{i=1}^{N} \mathbb{1}(y_i = \hat{y}_i)

where 𝟙(y_i = ŷ_i) is an indicator function that equals 1 when the prediction matches the true label, and 0 otherwise

◦

Range

Accuracy ranges from 0 to 1 (or 0% to 100%)

Higher accuracy indicates better performance

•

Error Rate

Error rate is the proportion of incorrect predictions made by the classifier

It's the complement of accuracy

◦

Formula

\text{Error Rate} = 1 - \text{Accuracy} = \frac{\text{Number of Incorrect Predictions}}{\text{Total Number of Predictions}}

◦

Relationship

Error Rate + Accuracy = 1

If a model has 90% accuracy, it has a 10% error rate

Training vs Test Error

•

Training Error

The error rate calculated on the training dataset D^TRAIN

Measures how well the model fits the training data

For Nearest Neighbor Classification with training data, the training error is always 0% because each training point is its own nearest neighbor

•

Test Error

◦

Why We Need a Separate Test Dataset

Training error alone cannot tell us how well our model will perform on new, unseen data.

▪

For NNC, the training error is always 0% (perfect accuracy)

•

Why? Because each training point is its own nearest neighbor!When we query a training point, it will always find itself as the closest match and return its own correct labelThis gives us a false sense of perfect performance

▪

Overfitting Issue: A model can memorize the training data without learning generalizable patternsLow training error doesn't guarantee good performance on new dataThe model might just be fitting noise in the training set rather than true underlying patterns

▪

Test Dataset Solution: By evaluating on a completely separate test dataset that wasn't used during training, we get an honest assessment of generalization

•

Test error reveals how well the model performs on data it has never seen beforeThis is the true measure of a model's practical usefulness

The error rate calculated on a separate test dataset D^TEST that the model has never seen during training

Formula: Test Error = (1/N_test) Σ 𝟙(y_i ≠ ŷ_i) for all (x_i, y_i) in the test set

•

Example

Consider a classifier evaluated on 100 test samples:

◦

Correctly classified: 85 samples

◦

Incorrectly classified: 15 samples

◦

Accuracy = 85/100 = 0.85 or 85%

◦

Error Rate = 15/100 = 0.15 or 15%

K-Nearest Neighbor

A Variant of NNC;

K-Nearest Neighbor (KNN) is an extension of the basic Nearest Neighbor Classification that considers multiple neighbors instead of just one to make more robust predictions.

•

Core Idea

Instead of using only the single nearest neighbor, KNN looks at the K closest training points and makes a decision based on the majority vote among these K neighbors

The parameter K is a positive integer (typically K = 3, 5, 7, etc.) that determines how many neighbors to consider

•

How It Works

◦

Step 1: Choose K

Select the number of neighbors K to consider (e.g., K = 5)

◦

Step 2: Find K nearest neighbors

For a new point x, calculate distances to all training points and identify the K closest ones

Let N_K(x) denote the set of indices of the K nearest neighbors

◦

Step 3: Majority voting

Count how many of the K nearest neighbors belong to each class

Assign the class label that appears most frequently among the K neighbors

\hat{y} = \arg\max_{c \in \{1,...,C\}} \sum_{i \in N_K(x)} \mathbb{1}(y_i = c)

This formula finds the class c that has the most votes among the K nearest neighbors

•

Choosing K (K is called as Hyper-parameter)

◦

What is a Hyperparameter?

A hyperparameter is a parameter whose value is set before the learning process begins and controls the learning process itself, rather than being learned from the data.

◦

Key Characteristics

▪

Not learned from training data - must be specified in advanceControls the behavior and performance of the learning algorithmDifferent from model parameters (like weights) which are learned during training

◦

Impact on Model Performance

▪

Poor hyperparameter choices can lead to underfitting or overfittingOptimal hyperparameters depend on the specific dataset and problemFinding good hyperparameters is crucial for model performance

◦

Hyper-parameters in NNC

The distance measure (e.g. the parameter p for L_p norm)

K (how many nearest neighbor we consider?)

•

Small K (e.g., K=1):

More complex decision boundaries

More sensitive to noise

Can lead to overfitting

•

Large K:

Smoother decision boundaries

More stable predictions

Can lead to underfitting if K is too large

•

Optimal K: Typically chosen through cross-validation on the training data

Different ways of preprocessing

•

Validation Datset & S-fold Cross Validation 

The validation dataset is a crucial component in machine learning that helps us select the best hyperparameters for our model without touching the test data.

◦

What is a Validation Dataset?

▪

A validation dataset is a portion of data held out from training that is used specifically for tuning hyperparameters and making model selection decisionsIt acts as a "practice test" to evaluate different model configurations

▪

Three-Way Data Split:

•

Training Set: Used to train the model (learn parameters)

•

Validation Set: Used to tune hyperparameters and select the best model

•

Test Set: Used only once at the end to evaluate final model performance

▪

Why We Need It:

We cannot use the test set for hyperparameter tuning because that would "leak" information about the test set into our model selection process

The validation set provides an unbiased estimate of model performance for different hyperparameter choices

◦

S-Fold Cross-ValidationS-fold cross-validation (also called K-fold cross-validation) is a more sophisticated approach that makes better use of limited data by creating multiple train/validation splits

▪

How It Works:

Divide the data: Split the training data into S equal-sized "folds" (subsets)

Iterate S times: For each iteration s = 1, 2, ..., S:

•

Use fold s as the validation set

•

Use the remaining S-1 folds as the training set

•

Train the model and evaluate performance on the validation fold

Average results: Compute the average validation error across all S folds

Select best hyperparameters: Choose the hyperparameter configuration with the lowest average validation error

Final training: Retrain the model with the chosen hyperparameters on the entire training dataset (all folds combined)

▪

Mathematical Notation:

\text{CV Error} = \frac{1}{S} \sum_{s=1}^{S} \text{Error}_s

where Error_s is the validation error on the s-th fold

▪

Common Choices for S:

•

S = 5 or S = 10 are most common

•

Larger S means more training data per fold but more computational cost

•

S = N (where N is the number of samples) is called "Leave-One-Out Cross-Validation" (LOOCV)

▪

Advantages:

•

Every data point is used for both training and validation

•

More reliable estimate of model performance than a single validation set

•

Reduces variance in performance estimation

•

Particularly useful when data is limited

▪

Example for KNN Hyperparameter Tuning:

Suppose we want to find the best value of K for KNN:

Split training data into S = 5 folds

For each candidate K value (e.g., K = 1, 3, 5, 7, 9):

•

Perform 5-fold cross-validation

•

Calculate average validation error across all 5 folds

Select the K with the lowest average cross-validation error

Retrain KNN with the chosen K on all training data

Finally, evaluate on the test set to get unbiased performance estimate

•

Tuning via a validation dataset

List view

Wiki

Overview of ML & Classification

Lecture Summary

2025/12/22

Done

Overview of ML & Classification

Classification & Logistic Regression

Lecture Summary

2026/01/02

Done

Classification & Logistic Regression

Lecture Summary

2026/01/02

Done

Multiclass Classification & Kernel Method

Lecture Summary

2026/01/03

Done

Multiclass Classification & Kernel Method

Support Vector Machine (SVM) / Decision Tree / Boosting

Lecture Summary

2026/02/02

Done

Support Vector Machine (SVM) / Decision Tree / Boosting

Lecture Summary

2026/02/02

Done

Clustering & Gaussian Mixture Model

Lecture Summary

In progress

Clustering & Gaussian Mixture Model

Lecture Summary

In progress

Machine Learning (Fall 2025)