This post provides a comprehensive overview of fundamental machine learning concepts, focusing on classification problems and nearest neighbor methods. It covers the core ingredients of ML (data, model, prediction), introduces the Iris flower dataset as a classic example, and explores both basic Nearest Neighbor Classification and its K-NN extension, including practical considerations for hyperparameter tuning through cross-validation.
Overview of Machine Learning
What is Machine Learning (ML)?
•
A set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty
•
Key indigredients in ML
1.
Data
collected from past observation (⇒ training data)
2.
Model
devised to capture the pattrns in the data
for example, models are trained to learn the relationship between input features and output labels
3.
Prediction
Apply the model to forecast what is going to happen in future
for instance, predicting the category of a new data point or estimating a continuous value
•
Different flavors of learning problems
1.
Supervised Learning
2.
Unsupervised Learning
3.
Decision Making (e.g. Reinforcement Learning)
Classification & Nearest Neighbor Classification
What is Classification Problems?
•
Classification is a supervised learning task where the goal is to predict a categorical label for a given input
•
Given training data with known labels, we build a model that can assign the correct category to new, unseen data points
•
Examples include spam detection, image recognition, medical diagnosis, and sentiment analysis
•
E.g. Recognizing Flowers
◦
Types of Iris: setosa, versicolor, and virginica
The Iris flower recognition problem is a classic example in machine learning, introduced by statistician Ronald Fisher in 1936
The dataset consists of 150 samples of iris flowers, with 50 samples from each of three species:
▪
Iris setosa - typically has smaller petals and distinct characteristics
▪
Iris versicolor - medium-sized petals with intermediate features
▪
Iris virginica - generally has the largest petals
Each sample is described by four features measured in centimeters:
▪
Sepal length
▪
Sepal width
▪
Petal length
▪
Petal width
The goal is to build a classification model that can predict which species an iris belongs to based on these four measurements
This is a multi-class classification problem where C = 3 (three species) and the feature vector x ∈ ℝ⁴ (four measurements)
Visualization of Classification
General Setup for Multi-class classification
•
N samples/instances
D^TRAIN = {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}
where xᵢ ∈ ℝᵈ is the feature vector and yᵢ ∈ {1, 2, ..., C} is the class label
•
Future Prediction: f(x);
given a new input x, predict its class label ŷ = f(x)
Binary Classification
•
Number of classes: C = 2
•
Conventional labels: {0,1} or {-1, 1}
Nearest Neighbor Classification (NNC)
Nearest Neighbor Classification (NNC) is one of the simplest and most intuitive classification algorithms in machine learning. The core idea is to classify a new data point based on the class of its closest neighbor in the training dataset.
•
How It Works
◦
Step 1: Calculate distancesFor a new point x that we want to classify, we calculate the distance between x and every training point xᵢ in our dataset
◦
Step 2: Find the nearest neighborWe identify which training point has the smallest distance to our query point. This is expressed as nn(x) = argmin_{i∈{1,...,N}} d(x, xᵢ)
where d(x, xᵢ) is the distance between x and training point xᵢ
▪
The predicted label for x is: ŷ = y_{nn(x)}
▪
Common distance metrics:
1.
Euclidean distance: d(x, x') = ||x - x'||₂ = √(Σⱼ(xⱼ - x'ⱼ)²)
2.
Manhattan distance: d(x, x') = ||x - x'||₁ = Σⱼ|xⱼ - x'ⱼ|
Expanding the nearest neighbor function with the L2 distance metric:
This equation finds the training sample x_i that minimizes the Euclidean distance to the query point x, where:
▪
x = (x₁, x₂, ..., x_d) is the query point (new data point to classify)
▪
x_i = (x_{i,1}, x_{i,2}, ..., x_{i,d}) is the i-th training sample
▪
d is the dimensionality of the feature space
▪
N is the total number of training samples
◦
Step 3: Assign the labelThe predicted class for the new point is simply the class label of its nearest neighbor: ŷ = y_{nn(x)}
•
Decision Boundary
The decision boundary is a fundamental concept in classification that represents the boundary or surface that separates different classes in the feature space.
◦
Definition
A decision boundary is the region of space where the classifier changes its prediction from one class to another. It's where the model is "uncertain" between classes.
◦
For Nearest Neighbor Classification
The decision boundary is formed by all points that are equidistant from their nearest neighbors of different classes
Since NNC assigns the label of the nearest training point, the boundary lies exactly halfway between training points of different classes
◦
Mathematical Representation
For two classes, the decision boundary is the set of points x where:
where x_i and x_j are the nearest neighbors from different classes
◦
Visual Example
In the image below, you can see how the decision boundary creates regions around each training point. Each region represents the area where that point is the nearest neighbor, and points falling in that region will be assigned that point's class label.
•
Accuracy
Accuracy is the proportion of correct predictions made by the classifier out of all predictions
It measures how often the model correctly classifies data points
◦
Formula
where 𝟙(y_i = ŷ_i) is an indicator function that equals 1 when the prediction matches the true label, and 0 otherwise
◦
Range
Accuracy ranges from 0 to 1 (or 0% to 100%)
Higher accuracy indicates better performance
•
Error Rate
Error rate is the proportion of incorrect predictions made by the classifier
It's the complement of accuracy
◦
Formula
◦
Relationship
Error Rate + Accuracy = 1
If a model has 90% accuracy, it has a 10% error rate
Training vs Test Error
•
Training Error
The error rate calculated on the training dataset D^TRAIN
Measures how well the model fits the training data
For Nearest Neighbor Classification with training data, the training error is always 0% because each training point is its own nearest neighbor
•
Test Error
◦
Why We Need a Separate Test Dataset
Training error alone cannot tell us how well our model will perform on new, unseen data.
▪
For NNC, the training error is always 0% (perfect accuracy)
•
Why? Because each training point is its own nearest neighbor!When we query a training point, it will always find itself as the closest match and return its own correct labelThis gives us a false sense of perfect performance
▪
Overfitting Issue: A model can memorize the training data without learning generalizable patternsLow training error doesn't guarantee good performance on new dataThe model might just be fitting noise in the training set rather than true underlying patterns
▪
Test Dataset Solution: By evaluating on a completely separate test dataset that wasn't used during training, we get an honest assessment of generalization
•
Test error reveals how well the model performs on data it has never seen beforeThis is the true measure of a model's practical usefulness
The error rate calculated on a separate test dataset D^TEST that the model has never seen during training
Formula: Test Error = (1/N_test) Σ 𝟙(y_i ≠ ŷ_i) for all (x_i, y_i) in the test set
•
Example
Consider a classifier evaluated on 100 test samples:
◦
Correctly classified: 85 samples
◦
Incorrectly classified: 15 samples
◦
Accuracy = 85/100 = 0.85 or 85%
◦
Error Rate = 15/100 = 0.15 or 15%
K-Nearest Neighbor
A Variant of NNC;
K-Nearest Neighbor (KNN) is an extension of the basic Nearest Neighbor Classification that considers multiple neighbors instead of just one to make more robust predictions.
•
Core Idea
Instead of using only the single nearest neighbor, KNN looks at the K closest training points and makes a decision based on the majority vote among these K neighbors
The parameter K is a positive integer (typically K = 3, 5, 7, etc.) that determines how many neighbors to consider
•
How It Works
◦
Step 1: Choose K
Select the number of neighbors K to consider (e.g., K = 5)
◦
Step 2: Find K nearest neighbors
For a new point x, calculate distances to all training points and identify the K closest ones
Let N_K(x) denote the set of indices of the K nearest neighbors
◦
Step 3: Majority voting
Count how many of the K nearest neighbors belong to each class
Assign the class label that appears most frequently among the K neighbors
This formula finds the class c that has the most votes among the K nearest neighbors
•
Choosing K (K is called as Hyper-parameter)
◦
What is a Hyperparameter?
A hyperparameter is a parameter whose value is set before the learning process begins and controls the learning process itself, rather than being learned from the data.
◦
Key Characteristics
▪
Not learned from training data - must be specified in advanceControls the behavior and performance of the learning algorithmDifferent from model parameters (like weights) which are learned during training
◦
Impact on Model Performance
▪
Poor hyperparameter choices can lead to underfitting or overfittingOptimal hyperparameters depend on the specific dataset and problemFinding good hyperparameters is crucial for model performance
◦
Hyper-parameters in NNC
1.
The distance measure (e.g. the parameter p for L_p norm)
2.
K (how many nearest neighbor we consider?)
•
Small K (e.g., K=1):
More complex decision boundaries
More sensitive to noise
Can lead to overfitting
•
Large K:
Smoother decision boundaries
More stable predictions
Can lead to underfitting if K is too large
•
Optimal K: Typically chosen through cross-validation on the training data
3.
Different ways of preprocessing
•
Validation Datset & S-fold Cross Validation
The validation dataset is a crucial component in machine learning that helps us select the best hyperparameters for our model without touching the test data.
◦
What is a Validation Dataset?
▪
A validation dataset is a portion of data held out from training that is used specifically for tuning hyperparameters and making model selection decisionsIt acts as a "practice test" to evaluate different model configurations
▪
Three-Way Data Split:
•
Training Set: Used to train the model (learn parameters)
•
Validation Set: Used to tune hyperparameters and select the best model
•
Test Set: Used only once at the end to evaluate final model performance
▪
Why We Need It:
We cannot use the test set for hyperparameter tuning because that would "leak" information about the test set into our model selection process
The validation set provides an unbiased estimate of model performance for different hyperparameter choices
◦
S-Fold Cross-ValidationS-fold cross-validation (also called K-fold cross-validation) is a more sophisticated approach that makes better use of limited data by creating multiple train/validation splits
▪
How It Works:
1.
Divide the data: Split the training data into S equal-sized "folds" (subsets)
2.
Iterate S times: For each iteration s = 1, 2, ..., S:
•
Use fold s as the validation set
•
Use the remaining S-1 folds as the training set
•
Train the model and evaluate performance on the validation fold
3.
Average results: Compute the average validation error across all S folds
4.
Select best hyperparameters: Choose the hyperparameter configuration with the lowest average validation error
5.
Final training: Retrain the model with the chosen hyperparameters on the entire training dataset (all folds combined)
▪
Mathematical Notation:
where Error_s is the validation error on the s-th fold
▪
Common Choices for S:
•
S = 5 or S = 10 are most common
•
Larger S means more training data per fold but more computational cost
•
S = N (where N is the number of samples) is called "Leave-One-Out Cross-Validation" (LOOCV)
▪
Advantages:
•
Every data point is used for both training and validation
•
More reliable estimate of model performance than a single validation set
•
Reduces variance in performance estimation
•
Particularly useful when data is limited
▪
Example for KNN Hyperparameter Tuning:
Suppose we want to find the best value of K for KNN:
1.
Split training data into S = 5 folds
2.
For each candidate K value (e.g., K = 1, 3, 5, 7, 9):
•
Perform 5-fold cross-validation
•
Calculate average validation error across all 5 folds
3.
Select the K with the lowest average cross-validation error
4.
Retrain KNN with the chosen K on all training data
5.
Finally, evaluate on the test set to get unbiased performance estimate
•
Tuning via a validation dataset
List view
Search










