•
Binary classification uses linear classifiers with hyperplanes () as decision boundaries, and nonlinear mappings handle non-linearly separable data
•
Surrogate losses (hinge, logistic, exponential) replace the non-differentiable 0-1 loss to enable gradient-based optimization through Empirical Risk Minimization (ERM)
•
Gradient Descent iteratively minimizes loss functions using , with adaptive learning rates and optimizers like Adam improving convergence
•
Stochastic Gradient Descent (SGD) computes gradients on random subsets, making iterations faster and helping escape local minima and saddle points in non-convex problems
•
Newton's method uses second-order information (Hessian matrix) for faster convergence but is computationally expensive for high-dimensional problems
•
The Perceptron algorithm updates weights only on misclassified examples using , applying SGD to perceptron loss
•
Local minima, global minima, and saddle points are critical concepts in optimization, with saddle points being particularly challenging in high-dimensional spaces
•
Distinguishing between local minima and saddle points is NP-Hard, making optimization in non-convex settings computationally difficult
•
Logistic Regression applies a logistic (sigmoid) function to the linear combination to produce probabilities between 0 and 1 for binary classification
•
The logistic loss (also called cross-entropy loss) is defined as , which is convex and differentiable, making it suitable for gradient-based optimization
•
Logistic regression is trained by minimizing the negative log-likelihood (equivalent to minimizing logistic loss):
•
The gradient of logistic loss is , which can be used in gradient descent or SGD to find optimal parameters
•
Unlike the perceptron which only updates on misclassified examples, logistic regression updates on all examples with weights proportional to the prediction confidence
•
Logistic regression provides probabilistic predictions , allowing for uncertainty quantification and threshold adjustment based on application requirements
Classification & Logistic Regression
Linear Classifiers (Binary Classification) and Surrogate Losses
Binary Classification
Let’s assume we are doing binary classification, which means the number of classes are 2 and Labels include -1 , 1 (because it’s a very simple case)
What we need to do first is to find the “model”
To make binary classification model sign of can be used.
A hyperplane is a flat subspace that divides the feature space into two regions. In 2D, it's a line; in 3D, it's a plane; in higher dimensions, it generalizes accordingly. For binary classification, the hyperplane serves as the decision boundary that separates the two classes.
But, what if clearly not linearly seperable data?
e.g.
We can apply a nonlinear mapping
Surrogate Losses
The next step is to define a loss function.
The 0-1 loss (also called the classification error or indicator loss) is the most intuitive loss function for classification problems. It directly measures whether a prediction is correct or incorrect.
The 0-1 loss is defined as:
Where:
•
is the true label (either -1 or 1 in binary classification)
•
is the predicted label
•
The loss equals 0 if the prediction is correct, and 1 if it's incorrect
For linear classifiers where , we can rewrite this as:
This means the loss is 1 when (misclassification) and 0 when (correct classification). The term is called the margin, and a positive margin indicates correct classification.
However, the 0-1 loss has a critical problem: it's not differentiable and not convex, making it difficult to optimize using gradient-based methods.
To address this issue, we use surrogate losses that are convex and differentiable approximations of the 0-1 loss. These surrogate losses upper-bound the 0-1 loss and enable efficient optimization through gradient descent.
Common surrogate losses include the hinge loss (used in SVMs), logistic loss (used in logistic regression), and exponential loss (used in AdaBoost).
Find ERM
With the model and losses defined on what we were doing,
we need to find the optimal parameters that minimize the empirical risk. This leads us to the Empirical Risk Minimization (ERM) framework:
Where is our chosen surrogate loss function (such as hinge loss or logistic loss), and we sum over all training examples.
•
Using this, we can apply general convex optimization methods
A Detour of Numerical Optimization Method
Numerical optimization in machine learning refers to algorithms that iteratively find the parameters (like weights ) that minimize a loss function. Since most ML loss functions can't be solved analytically, we use iterative methods.
The most fundamental approach is gradient descent, which updates parameters in the direction opposite to the gradient: , where is the learning rate.
These optimization methods are crucial for training ML models efficiently, especially for large datasets and complex models like neural networks.
In briefly, our goal is to minimize F(w) given a function F(w)
First-Order Optimization
•
Gradient Descent
Gradient descent is an iterative optimization algorithm that finds the minimum of a function by repeatedly taking steps in the direction of steepest descent. Starting from an initial point , the algorithm updates the parameters using the rule , where is the learning rate (step size) and is the gradient of the function at the current point.
The key intuition is that the gradient points in the direction of steepest ascent, so moving in the opposite direction (negative gradient) decreases the function value. The learning rate controls how large each step is—if it's too large, the algorithm might overshoot the minimum; if it's too small, convergence will be slow.
For convex functions, gradient descent is guaranteed to converge to the global minimum (given an appropriate learning rate). For non-convex functions, it converges to a local minimum or saddle point. The algorithm terminates when the gradient becomes sufficiently small (near zero) or after a fixed number of iterations.
In the context of machine learning, we apply gradient descent to minimize the empirical risk , where the gradient is computed as . This requires computing the gradient of the loss function with respect to all training examples at each iteration.
◦
Why using GD?
Gradient descent is appealing because of the first-order Taylor approximation. When we take a small step in the direction of the negative gradient, we're guaranteed to decrease the function value (assuming the step size is small enough). This is formalized by the Taylor expansion: , and choosing ensures when is appropriately chosen.
◦
Variation: Learning Rate might need to be changing over iterations
In practice, using a fixed learning rate throughout training can be suboptimal. A learning rate that is too large may cause the algorithm to overshoot the minimum and oscillate, while a learning rate that is too small leads to slow convergence. Therefore, adaptive learning rate schedules are often employed.
Common strategies include:
▪
Step decay: Reduce the learning rate by a factor every few epochs (e.g., )
▪
Exponential decay: Decay the learning rate exponentially over time (e.g., )
▪
1/t decay: Use , which decreases the learning rate as training progresses
▪
Adaptive methods: Algorithms like Adam, RMSprop, and AdaGrad automatically adjust the learning rate based on the history of gradients
These adaptive strategies help the algorithm make larger steps early in training when far from the minimum, and smaller, more precise steps as it approaches the optimal solution.
▪
Adam Optimizer
Adam (Adaptive Moment Estimation) is an optimization algorithm that combines the benefits of two other popular methods: momentum-based gradient descent and adaptive learning rates. Instead of using a single fixed learning rate for all parameters, Adam maintains separate adaptive learning rates for each parameter based on estimates of the first and second moments of the gradients.
The algorithm keeps track of two moving averages: (the first moment, which is the exponentially decaying average of past gradients) and (the second moment, which is the exponentially decaying average of past squared gradients). These are updated at each iteration using:
where and are decay rates (typically and ). Since and are initialized to zero, they are biased toward zero especially during the initial time steps. To correct this bias, Adam uses bias-corrected estimates:
Finally, the parameter update rule becomes:
where is the learning rate (often set to 0.001) and is a small constant (typically ) added for numerical stability to prevent division by zero.
The key advantage of Adam is that it adapts the learning rate for each parameter individually. Parameters with large gradients get smaller effective learning rates (due to the division by ), while parameters with small gradients get larger effective learning rates. This makes Adam particularly effective for problems with sparse gradients or noisy objective functions, and it's widely used in training deep neural networks because it often converges faster than standard gradient descent and requires less manual tuning of the learning rate.
•
Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is a variant of gradient descent that addresses the computational cost of calculating gradients over the entire training dataset. Instead of computing the gradient using all training examples at each iteration, SGD randomly selects a single example (or a small batch of examples) and computes the gradient based only on that subset.
The update rule for SGD with a single example is , where is randomly selected from the training set at each iteration. This makes each iteration much faster than standard gradient descent, especially for large datasets.
So, usually SGD needs more iterations but each iteration takes less time
And, even for nonconvex objectives,
It means that local minimizer
In optimization, a local minimizer (or local minimum) is a point where the function value is lower than all nearby points, but not necessarily the lowest point overall. Formally, a point is a local minimizer if there exists some neighborhood around where for all in that neighborhood.
In contrast, a global minimizer (or global minimum) is the point where the function achieves its absolute lowest value across the entire domain: for all possible .
For convex functions, every local minimizer is also a global minimizer, which is why convex optimization is particularly desirable in machine learning. However, for non-convex functions (like those in deep learning), there can be many local minima, and standard gradient descent might get stuck in a poor local minimum instead of reaching the global minimum.
This is where the stochasticity of SGD becomes advantageous: the noise in gradient estimates can help the algorithm escape shallow local minima and explore different regions of the parameter space, potentially finding better solutions.
But the stationary point can also be neither a local minimizer nor a local maximizer → Saddle Point
A saddle point is a critical point in optimization where the gradient is zero, but it is neither a local minimum nor a local maximum. Instead, it's a point where the function curves up in some directions and down in others—like the center of a saddle on a horse.
Mathematically, at a saddle point, the Hessian matrix (the matrix of second derivatives) has both positive and negative eigenvalues. This means that in some directions, the function is locally convex (curving upward), while in other directions, it is locally concave (curving downward).
Saddle points are particularly problematic in high-dimensional optimization problems like deep learning, where they can slow down or even halt the progress of gradient-based optimization algorithms. However, algorithms like SGD can escape saddle points more easily than local minima because the stochastic noise in gradient estimates provides enough perturbation to move away from these flat regions.
Even worse, distinguishing local minima and the saddle point is generally NP-Hard.
Sencond-Order Optimization
•
Newton Method
Newton's method is a second-order optimization algorithm that uses both the gradient and the Hessian matrix (the matrix of second derivatives) to find the optimal parameters. Unlike gradient descent which only uses first-order information, Newton's method incorporates curvature information to make more informed steps toward the minimum.
The update rule for Newton's method is , where is the Hessian matrix of second partial derivatives. The key advantage is that Newton's method can converge in fewer iterations than gradient descent, especially near the optimum, because it accounts for the curvature of the function.
◦
Hessian Matrix
The Hessian matrix is a square matrix of second-order partial derivatives of a scalar-valued function. For a function where , the Hessian is a matrix where the element at position is .
The Hessian captures the curvature of the function at a given point, providing information about how the gradient itself is changing. When the Hessian is positive definite (all eigenvalues are positive), the point is a local minimum; when it's negative definite (all eigenvalues are negative), the point is a local maximum; and when it has both positive and negative eigenvalues, the point is a saddle point.
However, computing and inverting the Hessian matrix is computationally expensive, requiring space and time for matrix inversion, where is the number of parameters. This makes Newton's method impractical for high-dimensional problems like deep learning
◦
vs Gradient Descent
Newton Method has no learning rate and converges super fast
But, computing Hessian is very slow and does not really make sense for nonconvex objectives
Perceptron
•
Recall Perceptron Loss
Apply GD to Perceptron,
The Perceptron algorithm is a simple online learning algorithm for binary classification. Given a training set of labeled examples, it iteratively updates the weight vector whenever it makes a prediction error on a training example.
The algorithm starts with an initial weight vector (often ) and processes training examples one at a time. For each example , if the prediction matches the true label , no update is made. However, if there's a misclassification (i.e., ), the weight vector is updated according to , where is the learning rate (often set to 1).
•
SGD
In summary, Perceptron Algorithm is
The Perceptron algorithm makes sense as an application of Stochastic Gradient Descent (SGD) to the Perceptron loss function for several important reasons:
•
Natural gradient descent interpretation: The Perceptron update rule emerges naturally from applying gradient descent to the Perceptron loss. When there's a misclassification (i.e., ), the gradient of the loss with respect to is , so moving in the opposite direction (adding ) reduces the loss.
•
Online learning efficiency: By processing one example at a time (SGD approach), the Perceptron algorithm can update weights immediately after each prediction error, rather than waiting to compute gradients over the entire dataset. This makes it extremely efficient for large-scale problems and allows the algorithm to start making better predictions quickly.
•
Geometric intuition: Each update moves the decision boundary (defined by ) in the direction that correctly classifies the misclassified example. Adding to effectively rotates the decision boundary toward correctly classifying point with label .
•
Convergence guarantee for separable data: If the training data is linearly separable, the Perceptron algorithm is guaranteed to converge to a solution that correctly classifies all training examples in a finite number of steps. This theoretical guarantee makes the algorithm particularly attractive for problems where linear separability holds.
Logistic Regression
Logistic regression is a fundamental algorithm for binary classification that predicts the probability that an input belongs to a particular class. Unlike linear regression which outputs continuous values, logistic regression outputs probabilities between 0 and 1 by applying the logistic function (also called the sigmoid function) to a linear combination of features.
The logistic function is defined as , which takes any real-valued input and maps it to the range . In logistic regression, we compute and then apply the sigmoid function to get the predicted probability: .
•
Why it is called Logistic Regression?
Despite its name, logistic regression is actually a classification algorithm, not a regression algorithm. The term "regression" comes from its historical development and the fact that it uses a regression-like approach (finding optimal weights) to predict probabilities, but the ultimate goal is classification rather than predicting continuous values.
•
Key Properties of the Logistic FunctionThe sigmoid function has several important properties that make it suitable for binary classification:
◦
Bounded output: The output is always between 0 and 1, making it interpretable as a probability.
◦
Smooth and differentiable: This allows us to use gradient-based optimization methods.
◦
S-shaped curve: The function transitions smoothly from 0 to 1, with the steepest slope at .
◦
Symmetry: , which relates the probabilities of the two classes.
•
Decision Rule
For classification, we typically use a threshold (usually 0.5) to make predictions: predict if , and (or 0) otherwise. Since when , this is equivalent to predicting when , which gives us a linear decision boundary.
•
Loss Function: Cross-Entropy Loss
To train a logistic regression model, we use the cross-entropy loss (also called log loss or logistic loss).
For a single example where , the loss is:
where is the predicted probability.
This loss function penalizes confident wrong predictions heavily while giving small penalties for correct predictions.
For the entire training set, the objective is to minimize the average cross-entropy loss:
•
How to regress with discrete labels?
Instead of directly predicting discrete class labels, logistic regression predicts continuous probability values between 0 and 1 using the sigmoid function, which can then be thresholded for classification.
This approach has several advantages: it provides a smooth, differentiable objective function (cross-entropy loss) that can be optimized using gradient descent, it naturally handles the constraint that probabilities must be between 0 and 1, and it gives us confidence estimates rather than just hard classifications. The sigmoid function acts as a bridge between the continuous linear combination and the discrete labels, allowing us to use regression-like optimization techniques for classification problems.
•
MLE Solution
•
Gradient Descent for Logistic Regression
The gradient of the cross-entropy loss with respect to has a particularly elegant form. For a single example, the gradient is:
This means the update rule for gradient descent is:
Or for SGD with a single example:
Notice the similarity to the Perceptron update rule, but with the key difference that the error term is continuous rather than binary, allowing for more nuanced updates based on the confidence of the prediction.
◦
SGD
•
Apply Newton Method to Logistic Regression
•
Advantages of Logistic Regression
◦
Probabilistic interpretation: Unlike Perceptron, logistic regression provides probability estimates, not just class predictions.
◦
Convex optimization: The cross-entropy loss is convex, guaranteeing that gradient descent will find the global minimum.
◦
Well-calibrated probabilities: The output probabilities can be used for ranking and decision-making under uncertainty.
◦
Regularization: Easy to add regularization (L1 or L2) to prevent overfitting.
◦
Interpretability: The weights directly indicate the importance and direction of influence of each feature.
•
Comparison with Perceptron
While both Perceptron and logistic regression find linear decision boundaries, there are important differences:
◦
Perceptron only updates on misclassified examples, while logistic regression updates on all examples based on the magnitude of error.
◦
Perceptron's loss is not differentiable everywhere, while cross-entropy loss is smooth.
◦
Logistic regression converges to the maximum likelihood solution, while Perceptron just finds any separating hyperplane (if one exists).
◦
Logistic regression works well even when data is not linearly separable, while Perceptron may not converge in such cases.
Summary
Linear Models for Classification
•
STEP 1: Find hyperplanes (Set Models)
•
STEP 2: Pick surrogate loss
•
STEP 3: Find ERM
List view
Search

































