Changyu Lee

Overview of ML & Classification

Published at
2025/12/22
Last edited time
2026/01/02 00:23
Created
2025/12/22 10:30
Section
ML
Status
Done
Series
From the Bottom
Tags
Lecture Summary
AI summary
The overview covers fundamental machine learning concepts, focusing on classification problems and nearest neighbor methods. It discusses key ingredients like data, models, and predictions, introduces the Iris flower dataset, and explains both basic Nearest Neighbor Classification and K-Nearest Neighbor (KNN) techniques, including hyperparameter tuning through cross-validation.
Keywords
ML
Language
ENG
Week
1
This post provides a comprehensive overview of fundamental machine learning concepts, focusing on classification problems and nearest neighbor methods. It covers the core ingredients of ML (data, model, prediction), introduces the Iris flower dataset as a classic example, and explores both basic Nearest Neighbor Classification and its K-NN extension, including practical considerations for hyperparameter tuning through cross-validation.

Overview of Machine Learning

What is Machine Learning (ML)?

A set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty
Key indigredients in ML
1.
Data
collected from past observation (⇒ training data)
2.
Model
devised to capture the pattrns in the data
for example, models are trained to learn the relationship between input features and output labels
3.
Prediction
Apply the model to forecast what is going to happen in future
for instance, predicting the category of a new data point or estimating a continuous value
Different flavors of learning problems
1.
Supervised Learning
2.
Unsupervised Learning
3.
Decision Making (e.g. Reinforcement Learning)

Classification & Nearest Neighbor Classification

What is Classification Problems?

Classification is a supervised learning task where the goal is to predict a categorical label for a given input
Given training data with known labels, we build a model that can assign the correct category to new, unseen data points
Examples include spam detection, image recognition, medical diagnosis, and sentiment analysis
E.g. Recognizing Flowers
Types of Iris: setosa, versicolor, and virginica
The Iris flower recognition problem is a classic example in machine learning, introduced by statistician Ronald Fisher in 1936
The dataset consists of 150 samples of iris flowers, with 50 samples from each of three species:
Iris setosa - typically has smaller petals and distinct characteristics
Iris versicolor - medium-sized petals with intermediate features
Iris virginica - generally has the largest petals
Each sample is described by four features measured in centimeters:
Sepal length
Sepal width
Petal length
Petal width
The goal is to build a classification model that can predict which species an iris belongs to based on these four measurements
This is a multi-class classification problem where C = 3 (three species) and the feature vector x ∈ ℝ⁴ (four measurements)
Visualization of Classification

General Setup for Multi-class classification

N samples/instances
D^TRAIN = {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}
where xᵢ ∈ ℝᵈ is the feature vector and yᵢ ∈ {1, 2, ..., C} is the class label
Future Prediction: f(x);
given a new input x, predict its class label ŷ = f(x)

Binary Classification

Number of classes: C = 2
Conventional labels: {0,1} or {-1, 1}

Nearest Neighbor Classification (NNC)

Nearest Neighbor Classification (NNC) is one of the simplest and most intuitive classification algorithms in machine learning. The core idea is to classify a new data point based on the class of its closest neighbor in the training dataset.
How It Works
Step 1: Calculate distancesFor a new point x that we want to classify, we calculate the distance between x and every training point xᵢ in our dataset
Step 2: Find the nearest neighborWe identify which training point has the smallest distance to our query point. This is expressed as nn(x) = argmin_{i∈{1,...,N}} d(x, xᵢ)
where d(x, xᵢ) is the distance between x and training point xᵢ
The predicted label for x is: ŷ = y_{nn(x)}
Common distance metrics:
1.
Euclidean distance: d(x, x') = ||x - x'||₂ = √(Σⱼ(xⱼ - x'ⱼ)²)
2.
Manhattan distance: d(x, x') = ||x - x'||₁ = Σⱼ|xⱼ - x'ⱼ|
Expanding the nearest neighbor function with the L2 distance metric:
NN(x)=argmini{1,...,N}j=1d(xjxi,j)2NN(x) = \arg\min_{i \in \{1, ..., N\}} \sqrt{\sum_{j=1}^{d} (x_j - x_{i,j})^2}
This equation finds the training sample x_i that minimizes the Euclidean distance to the query point x, where:
x = (x₁, x₂, ..., x_d) is the query point (new data point to classify)
x_i = (x_{i,1}, x_{i,2}, ..., x_{i,d}) is the i-th training sample
d is the dimensionality of the feature space
N is the total number of training samples
Step 3: Assign the labelThe predicted class for the new point is simply the class label of its nearest neighbor: ŷ = y_{nn(x)}
Decision Boundary
The decision boundary is a fundamental concept in classification that represents the boundary or surface that separates different classes in the feature space.
Definition
A decision boundary is the region of space where the classifier changes its prediction from one class to another. It's where the model is "uncertain" between classes.
For Nearest Neighbor Classification
The decision boundary is formed by all points that are equidistant from their nearest neighbors of different classes
Since NNC assigns the label of the nearest training point, the boundary lies exactly halfway between training points of different classes
Mathematical Representation
For two classes, the decision boundary is the set of points x where:
d(x,xi)=d(x,xj)d(x, x_i) = d(x, x_j)
where x_i and x_j are the nearest neighbors from different classes
Visual Example
In the image below, you can see how the decision boundary creates regions around each training point. Each region represents the area where that point is the nearest neighbor, and points falling in that region will be assigned that point's class label.
Accuracy
Accuracy is the proportion of correct predictions made by the classifier out of all predictions
It measures how often the model correctly classifies data points
Formula
Accuracy=Number of Correct PredictionsTotal Number of Predictions=1Ni=1N1(yi=y^i) \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} = \frac{1}{N}\sum_{i=1}^{N} \mathbb{1}(y_i = \hat{y}_i)
where 𝟙(y_i = ŷ_i) is an indicator function that equals 1 when the prediction matches the true label, and 0 otherwise
Range
Accuracy ranges from 0 to 1 (or 0% to 100%)
Higher accuracy indicates better performance
Error Rate
Error rate is the proportion of incorrect predictions made by the classifier
It's the complement of accuracy
Formula
Error Rate=1Accuracy=Number of Incorrect PredictionsTotal Number of Predictions \text{Error Rate} = 1 - \text{Accuracy} = \frac{\text{Number of Incorrect Predictions}}{\text{Total Number of Predictions}}
Relationship
Error Rate + Accuracy = 1
If a model has 90% accuracy, it has a 10% error rate

Training vs Test Error

Training Error
The error rate calculated on the training dataset D^TRAIN
Measures how well the model fits the training data
For Nearest Neighbor Classification with training data, the training error is always 0% because each training point is its own nearest neighbor
Test Error
Why We Need a Separate Test Dataset
Training error alone cannot tell us how well our model will perform on new, unseen data.
For NNC, the training error is always 0% (perfect accuracy)
Why? Because each training point is its own nearest neighbor!When we query a training point, it will always find itself as the closest match and return its own correct labelThis gives us a false sense of perfect performance
Overfitting Issue: A model can memorize the training data without learning generalizable patternsLow training error doesn't guarantee good performance on new dataThe model might just be fitting noise in the training set rather than true underlying patterns
Test Dataset Solution: By evaluating on a completely separate test dataset that wasn't used during training, we get an honest assessment of generalization
Test error reveals how well the model performs on data it has never seen beforeThis is the true measure of a model's practical usefulness
The error rate calculated on a separate test dataset D^TEST that the model has never seen during training
Formula: Test Error = (1/N_test) Σ 𝟙(y_i ≠ ŷ_i) for all (x_i, y_i) in the test set
Example
Consider a classifier evaluated on 100 test samples:
Correctly classified: 85 samples
Incorrectly classified: 15 samples
Accuracy = 85/100 = 0.85 or 85%
Error Rate = 15/100 = 0.15 or 15%

K-Nearest Neighbor

A Variant of NNC;
K-Nearest Neighbor (KNN) is an extension of the basic Nearest Neighbor Classification that considers multiple neighbors instead of just one to make more robust predictions.
Core Idea
Instead of using only the single nearest neighbor, KNN looks at the K closest training points and makes a decision based on the majority vote among these K neighbors
The parameter K is a positive integer (typically K = 3, 5, 7, etc.) that determines how many neighbors to consider
How It Works
Step 1: Choose K
Select the number of neighbors K to consider (e.g., K = 5)
Step 2: Find K nearest neighbors
For a new point x, calculate distances to all training points and identify the K closest ones
Let N_K(x) denote the set of indices of the K nearest neighbors
Step 3: Majority voting
Count how many of the K nearest neighbors belong to each class
Assign the class label that appears most frequently among the K neighbors
y^=argmaxc{1,...,C}iNK(x)1(yi=c) \hat{y} = \arg\max_{c \in \{1,...,C\}} \sum_{i \in N_K(x)} \mathbb{1}(y_i = c)
This formula finds the class c that has the most votes among the K nearest neighbors
Choosing K (K is called as Hyper-parameter)
What is a Hyperparameter?
A hyperparameter is a parameter whose value is set before the learning process begins and controls the learning process itself, rather than being learned from the data.
Key Characteristics
Not learned from training data - must be specified in advanceControls the behavior and performance of the learning algorithmDifferent from model parameters (like weights) which are learned during training
Impact on Model Performance
Poor hyperparameter choices can lead to underfitting or overfittingOptimal hyperparameters depend on the specific dataset and problemFinding good hyperparameters is crucial for model performance
Hyper-parameters in NNC
1.
The distance measure (e.g. the parameter p for L_p norm)
2.
K (how many nearest neighbor we consider?)
Small K (e.g., K=1):
More complex decision boundaries
More sensitive to noise
Can lead to overfitting
Large K:
Smoother decision boundaries
More stable predictions
Can lead to underfitting if K is too large
Optimal K: Typically chosen through cross-validation on the training data
3.
Different ways of preprocessing
Validation Datset & S-fold Cross Validation
The validation dataset is a crucial component in machine learning that helps us select the best hyperparameters for our model without touching the test data.
What is a Validation Dataset?
A validation dataset is a portion of data held out from training that is used specifically for tuning hyperparameters and making model selection decisionsIt acts as a "practice test" to evaluate different model configurations
Three-Way Data Split:
Training Set: Used to train the model (learn parameters)
Validation Set: Used to tune hyperparameters and select the best model
Test Set: Used only once at the end to evaluate final model performance
Why We Need It:
We cannot use the test set for hyperparameter tuning because that would "leak" information about the test set into our model selection process
The validation set provides an unbiased estimate of model performance for different hyperparameter choices
S-Fold Cross-ValidationS-fold cross-validation (also called K-fold cross-validation) is a more sophisticated approach that makes better use of limited data by creating multiple train/validation splits
How It Works:
1.
Divide the data: Split the training data into S equal-sized "folds" (subsets)
2.
Iterate S times: For each iteration s = 1, 2, ..., S:
Use fold s as the validation set
Use the remaining S-1 folds as the training set
Train the model and evaluate performance on the validation fold
3.
Average results: Compute the average validation error across all S folds
4.
Select best hyperparameters: Choose the hyperparameter configuration with the lowest average validation error
5.
Final training: Retrain the model with the chosen hyperparameters on the entire training dataset (all folds combined)
Mathematical Notation:
CV Error=1Ss=1SErrors\text{CV Error} = \frac{1}{S} \sum_{s=1}^{S} \text{Error}_s
where Error_s is the validation error on the s-th fold
Common Choices for S:
S = 5 or S = 10 are most common
Larger S means more training data per fold but more computational cost
S = N (where N is the number of samples) is called "Leave-One-Out Cross-Validation" (LOOCV)
Advantages:
Every data point is used for both training and validation
More reliable estimate of model performance than a single validation set
Reduces variance in performance estimation
Particularly useful when data is limited
Example for KNN Hyperparameter Tuning:
Suppose we want to find the best value of K for KNN:
1.
Split training data into S = 5 folds
2.
For each candidate K value (e.g., K = 1, 3, 5, 7, 9):
Perform 5-fold cross-validation
Calculate average validation error across all 5 folds
3.
Select the K with the lowest average cross-validation error
4.
Retrain KNN with the chosen K on all training data
5.
Finally, evaluate on the test set to get unbiased performance estimate
Tuning via a validation dataset
List view
Search
Wiki
Overview of ML & Classification
Lecture Summary
ML
2025/12/22
Done
Overview of ML & Classification
Lecture Summary
ML
2025/12/22
Done