Changyu Lee

Classification & Logistic Regression

Published at
2026/01/02
Last edited time
2026/01/02 00:23
Created
2025/12/31 23:09
Section
ML
Status
Done
Series
From the Bottom
Tags
Lecture Summary
AI summary
This document explores binary classification using linear classifiers and logistic regression, discussing surrogate loss functions, optimization methods like gradient descent and its variants, and the mathematical formulation of logistic regression. Key concepts include hyperplane decision boundaries, the transition from non-differentiable 0-1 loss to differentiable surrogate losses, and the application of logistic regression for probabilistic predictions. The importance of adaptive learning rates and methods such as Stochastic Gradient Descent and Newton's method for optimization are also highlighted, along with the advantages of logistic regression over the Perceptron algorithm.
Keywords
ML
Language
ENG
Week
3
Binary classification uses linear classifiers with hyperplanes (wx=0w^\top x = 0) as decision boundaries, and nonlinear mappings handle non-linearly separable data
Surrogate losses (hinge, logistic, exponential) replace the non-differentiable 0-1 loss to enable gradient-based optimization through Empirical Risk Minimization (ERM)
Gradient Descent iteratively minimizes loss functions using w(t+1)=w(t)ηF(w(t))w^{(t+1)} = w^{(t)} - \eta \nabla F(w^{(t)}), with adaptive learning rates and optimizers like Adam improving convergence
Stochastic Gradient Descent (SGD) computes gradients on random subsets, making iterations faster and helping escape local minima and saddle points in non-convex problems
Newton's method uses second-order information (Hessian matrix) for faster convergence but is computationally expensive for high-dimensional problems
The Perceptron algorithm updates weights only on misclassified examples using w(t+1)=w(t)+ηyixiw^{(t+1)} = w^{(t)} + \eta y_i x_i, applying SGD to perceptron loss
Local minima, global minima, and saddle points are critical concepts in optimization, with saddle points being particularly challenging in high-dimensional spaces
Distinguishing between local minima and saddle points is NP-Hard, making optimization in non-convex settings computationally difficult
Logistic Regression applies a logistic (sigmoid) function σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} to the linear combination wxw^\top x to produce probabilities between 0 and 1 for binary classification
The logistic loss (also called cross-entropy loss) is defined as L(y,wx)=log(1+eywx)L(y, w^\top x) = \log(1 + e^{-y \cdot w^\top x}), which is convex and differentiable, making it suitable for gradient-based optimization
Logistic regression is trained by minimizing the negative log-likelihood (equivalent to minimizing logistic loss): w^=argminw1ni=1nlog(1+eyiwxi)\hat{w} = \arg\min_w \frac{1}{n} \sum_{i=1}^n \log(1 + e^{-y_i \cdot w^\top x_i})
The gradient of logistic loss is wL(yi,wxi)=yixi1+eyiwxi\nabla_w L(y_i, w^\top x_i) = -\frac{y_i x_i}{1 + e^{y_i \cdot w^\top x_i}}, which can be used in gradient descent or SGD to find optimal parameters
Unlike the perceptron which only updates on misclassified examples, logistic regression updates on all examples with weights proportional to the prediction confidence
Logistic regression provides probabilistic predictions P(y=1x)=σ(wx)P(y=1|x) = \sigma(w^\top x), allowing for uncertainty quantification and threshold adjustment based on application requirements

Classification & Logistic Regression

Linear Classifiers (Binary Classification) and Surrogate Losses

Binary Classification

Let’s assume we are doing binary classification, which means the number of classes are 2 and Labels include -1 , 1 (because it’s a very simple case)
What we need to do first is to find the “model”
To make binary classification model sign of wxw^\top x can be used.
A hyperplane is a flat subspace that divides the feature space into two regions. In 2D, it's a line; in 3D, it's a plane; in higher dimensions, it generalizes accordingly. For binary classification, the hyperplane wx=0w^\top x = 0 serves as the decision boundary that separates the two classes.
But, what if clearly not linearly seperable data?
e.g.
We can apply a nonlinear mapping

Surrogate Losses

The next step is to define a loss function.
The 0-1 loss (also called the classification error or indicator loss) is the most intuitive loss function for classification problems. It directly measures whether a prediction is correct or incorrect.
The 0-1 loss is defined as:
Where:
yy is the true label (either -1 or 1 in binary classification)
y^\hat{y} is the predicted label
The loss equals 0 if the prediction is correct, and 1 if it's incorrect
For linear classifiers where y^=sign(wx)\hat{y} = \text{sign}(w^\top x), we can rewrite this as:
L01(y,wx)=1[ywx0]L_{0-1}(y, w^\top x) = \mathbb{1}[y \cdot w^\top x \leq 0]
This means the loss is 1 when ywx0y \cdot w^\top x \leq 0 (misclassification) and 0 when ywx>0y \cdot w^\top x > 0 (correct classification). The term ywxy \cdot w^\top x is called the margin, and a positive margin indicates correct classification.
However, the 0-1 loss has a critical problem: it's not differentiable and not convex, making it difficult to optimize using gradient-based methods.
To address this issue, we use surrogate losses that are convex and differentiable approximations of the 0-1 loss. These surrogate losses upper-bound the 0-1 loss and enable efficient optimization through gradient descent.
Common surrogate losses include the hinge loss (used in SVMs), logistic loss (used in logistic regression), and exponential loss (used in AdaBoost).

Find ERM

With the model and losses defined on what we were doing,
we need to find the optimal parameters ww that minimize the empirical risk. This leads us to the Empirical Risk Minimization (ERM) framework:
w^=argminw1ni=1nL(yi,wxi)\hat{w} = \arg\min_w \frac{1}{n} \sum_{i=1}^n L(y_i, w^\top x_i)
Where LL is our chosen surrogate loss function (such as hinge loss or logistic loss), and we sum over all nn training examples.
Using this, we can apply general convex optimization methods

A Detour of Numerical Optimization Method

Numerical optimization in machine learning refers to algorithms that iteratively find the parameters (like weights ww) that minimize a loss function. Since most ML loss functions can't be solved analytically, we use iterative methods.
The most fundamental approach is gradient descent, which updates parameters in the direction opposite to the gradient: w(t+1)=w(t)ηL(w(t))w^{(t+1)} = w^{(t)} - \eta \nabla L(w^{(t)}), where η\eta is the learning rate.
These optimization methods are crucial for training ML models efficiently, especially for large datasets and complex models like neural networks.
In briefly, our goal is to minimize F(w) given a function F(w)

First-Order Optimization

Gradient Descent
Gradient descent is an iterative optimization algorithm that finds the minimum of a function by repeatedly taking steps in the direction of steepest descent. Starting from an initial point w(0)w^{(0)}, the algorithm updates the parameters using the rule w(t+1)=w(t)ηF(w(t))w^{(t+1)} = w^{(t)} - \eta \nabla F(w^{(t)}), where η\eta is the learning rate (step size) and F(w(t))\nabla F(w^{(t)}) is the gradient of the function at the current point.
The key intuition is that the gradient F(w)\nabla F(w) points in the direction of steepest ascent, so moving in the opposite direction (negative gradient) decreases the function value. The learning rate η\eta controls how large each step is—if it's too large, the algorithm might overshoot the minimum; if it's too small, convergence will be slow.
For convex functions, gradient descent is guaranteed to converge to the global minimum (given an appropriate learning rate). For non-convex functions, it converges to a local minimum or saddle point. The algorithm terminates when the gradient becomes sufficiently small (near zero) or after a fixed number of iterations.
In the context of machine learning, we apply gradient descent to minimize the empirical risk F(w)=1ni=1nL(yi,wxi)F(w) = \frac{1}{n} \sum_{i=1}^n L(y_i, w^\top x_i), where the gradient is computed as F(w)=1ni=1nwL(yi,wxi)\nabla F(w) = \frac{1}{n} \sum_{i=1}^n \nabla_w L(y_i, w^\top x_i). This requires computing the gradient of the loss function with respect to all training examples at each iteration.
Why using GD?
Gradient descent is appealing because of the first-order Taylor approximation. When we take a small step in the direction of the negative gradient, we're guaranteed to decrease the function value (assuming the step size is small enough). This is formalized by the Taylor expansion: F(w+Δw)F(w)+F(w)ΔwF(w + \Delta w) \approx F(w) + \nabla F(w)^\top \Delta w, and choosing Δw=ηF(w)\Delta w = -\eta \nabla F(w) ensures F(w+Δw)<F(w)F(w + \Delta w) < F(w) when η\eta is appropriately chosen.
Variation: Learning Rate might need to be changing over iterations
In practice, using a fixed learning rate throughout training can be suboptimal. A learning rate that is too large may cause the algorithm to overshoot the minimum and oscillate, while a learning rate that is too small leads to slow convergence. Therefore, adaptive learning rate schedules are often employed.
Common strategies include:
Step decay: Reduce the learning rate by a factor every few epochs (e.g., ηt=η0γt/k\eta_t = \eta_0 \cdot \gamma^{\lfloor t/k \rfloor})
Exponential decay: Decay the learning rate exponentially over time (e.g., ηt=η0eλt\eta_t = \eta_0 \cdot e^{-\lambda t})
1/t decay: Use ηt=η01+λt\eta_t = \frac{\eta_0}{1 + \lambda t}, which decreases the learning rate as training progresses
Adaptive methods: Algorithms like Adam, RMSprop, and AdaGrad automatically adjust the learning rate based on the history of gradients
These adaptive strategies help the algorithm make larger steps early in training when far from the minimum, and smaller, more precise steps as it approaches the optimal solution.
Adam Optimizer
Adam (Adaptive Moment Estimation) is an optimization algorithm that combines the benefits of two other popular methods: momentum-based gradient descent and adaptive learning rates. Instead of using a single fixed learning rate for all parameters, Adam maintains separate adaptive learning rates for each parameter based on estimates of the first and second moments of the gradients.
The algorithm keeps track of two moving averages: mtm_t (the first moment, which is the exponentially decaying average of past gradients) and vtv_t (the second moment, which is the exponentially decaying average of past squared gradients). These are updated at each iteration tt using:
mt=β1mt1+(1β1)F(w(t))m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla F(w^{(t)})
vt=β2vt1+(1β2)(F(w(t)))2v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla F(w^{(t)}))^2
where β1\beta_1 and β2\beta_2 are decay rates (typically β1=0.9\beta_1 = 0.9 and β2=0.999\beta_2 = 0.999). Since mtm_t and vtv_t are initialized to zero, they are biased toward zero especially during the initial time steps. To correct this bias, Adam uses bias-corrected estimates:
m^t=mt1β1t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}
v^t=vt1β2t\hat{v}_t = \frac{v_t}{1 - \beta_2^t}
Finally, the parameter update rule becomes:
w(t+1)=w(t)ηm^tv^t+ϵw^{(t+1)} = w^{(t)} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
where η\eta is the learning rate (often set to 0.001) and ϵ\epsilon is a small constant (typically 10810^{-8}) added for numerical stability to prevent division by zero.
The key advantage of Adam is that it adapts the learning rate for each parameter individually. Parameters with large gradients get smaller effective learning rates (due to the division by v^t\sqrt{\hat{v}_t}), while parameters with small gradients get larger effective learning rates. This makes Adam particularly effective for problems with sparse gradients or noisy objective functions, and it's widely used in training deep neural networks because it often converges faster than standard gradient descent and requires less manual tuning of the learning rate.
Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is a variant of gradient descent that addresses the computational cost of calculating gradients over the entire training dataset. Instead of computing the gradient using all nn training examples at each iteration, SGD randomly selects a single example (or a small batch of examples) and computes the gradient based only on that subset.
The update rule for SGD with a single example is w(t+1)=w(t)ηwL(yi,wxi)w^{(t+1)} = w^{(t)} - \eta \nabla_w L(y_i, w^\top x_i), where ii is randomly selected from the training set at each iteration. This makes each iteration much faster than standard gradient descent, especially for large datasets.
So, usually SGD needs more iterations but each iteration takes less time
And, even for nonconvex objectives,
It means that local minimizer
In optimization, a local minimizer (or local minimum) is a point where the function value is lower than all nearby points, but not necessarily the lowest point overall. Formally, a point ww^* is a local minimizer if there exists some neighborhood around ww^* where F(w)F(w)F(w^*) \leq F(w) for all ww in that neighborhood.
In contrast, a global minimizer (or global minimum) is the point where the function achieves its absolute lowest value across the entire domain: F(w)F(w)F(w^*) \leq F(w) for all possible ww.
For convex functions, every local minimizer is also a global minimizer, which is why convex optimization is particularly desirable in machine learning. However, for non-convex functions (like those in deep learning), there can be many local minima, and standard gradient descent might get stuck in a poor local minimum instead of reaching the global minimum.
This is where the stochasticity of SGD becomes advantageous: the noise in gradient estimates can help the algorithm escape shallow local minima and explore different regions of the parameter space, potentially finding better solutions.
But the stationary point can also be neither a local minimizer nor a local maximizer → Saddle Point
A saddle point is a critical point in optimization where the gradient is zero, but it is neither a local minimum nor a local maximum. Instead, it's a point where the function curves up in some directions and down in others—like the center of a saddle on a horse.
Mathematically, at a saddle point, the Hessian matrix (the matrix of second derivatives) has both positive and negative eigenvalues. This means that in some directions, the function is locally convex (curving upward), while in other directions, it is locally concave (curving downward).
Saddle points are particularly problematic in high-dimensional optimization problems like deep learning, where they can slow down or even halt the progress of gradient-based optimization algorithms. However, algorithms like SGD can escape saddle points more easily than local minima because the stochastic noise in gradient estimates provides enough perturbation to move away from these flat regions.
Even worse, distinguishing local minima and the saddle point is generally NP-Hard.

Sencond-Order Optimization

Newton Method
Newton's method is a second-order optimization algorithm that uses both the gradient and the Hessian matrix (the matrix of second derivatives) to find the optimal parameters. Unlike gradient descent which only uses first-order information, Newton's method incorporates curvature information to make more informed steps toward the minimum.
The update rule for Newton's method is w(t+1)=w(t)η[H(w(t))]1F(w(t))w^{(t+1)} = w^{(t)} - \eta [H(w^{(t)})]^{-1} \nabla F(w^{(t)}), where H(w)H(w) is the Hessian matrix of second partial derivatives. The key advantage is that Newton's method can converge in fewer iterations than gradient descent, especially near the optimum, because it accounts for the curvature of the function.
Hessian Matrix
The Hessian matrix is a square matrix of second-order partial derivatives of a scalar-valued function. For a function F(w)F(w) where wRdw \in \mathbb{R}^d, the Hessian H(w)H(w) is a d×dd \times d matrix where the element at position (i,j)(i,j) is Hij=2FwiwjH_{ij} = \frac{\partial^2 F}{\partial w_i \partial w_j}.
The Hessian captures the curvature of the function at a given point, providing information about how the gradient itself is changing. When the Hessian is positive definite (all eigenvalues are positive), the point is a local minimum; when it's negative definite (all eigenvalues are negative), the point is a local maximum; and when it has both positive and negative eigenvalues, the point is a saddle point.
However, computing and inverting the Hessian matrix is computationally expensive, requiring O(d2)O(d^2) space and O(d3)O(d^3) time for matrix inversion, where dd is the number of parameters. This makes Newton's method impractical for high-dimensional problems like deep learning
vs Gradient Descent
Newton Method has no learning rate and converges super fast
But, computing Hessian is very slow and does not really make sense for nonconvex objectives

Perceptron

Recall Perceptron Loss
Apply GD to Perceptron,
The Perceptron algorithm is a simple online learning algorithm for binary classification. Given a training set of labeled examples, it iteratively updates the weight vector whenever it makes a prediction error on a training example.
The algorithm starts with an initial weight vector (often w(0)=0w^{(0)} = 0) and processes training examples one at a time. For each example (xi,yi)(x_i, y_i), if the prediction sign(wxi)\text{sign}(w^\top x_i) matches the true label yiy_i, no update is made. However, if there's a misclassification (i.e., yiwxi0y_i \cdot w^\top x_i \leq 0), the weight vector is updated according to w(t+1)=w(t)+ηyixiw^{(t+1)} = w^{(t)} + \eta y_i x_i, where η\eta is the learning rate (often set to 1).
SGD
In summary, Perceptron Algorithm is
The Perceptron algorithm makes sense as an application of Stochastic Gradient Descent (SGD) to the Perceptron loss function for several important reasons:
Natural gradient descent interpretation: The Perceptron update rule w(t+1)=w(t)+ηyixiw^{(t+1)} = w^{(t)} + \eta y_i x_i emerges naturally from applying gradient descent to the Perceptron loss. When there's a misclassification (i.e., yiwxi0y_i \cdot w^\top x_i \leq 0), the gradient of the loss with respect to ww is yixi-y_i x_i, so moving in the opposite direction (adding ηyixi\eta y_i x_i) reduces the loss.
Online learning efficiency: By processing one example at a time (SGD approach), the Perceptron algorithm can update weights immediately after each prediction error, rather than waiting to compute gradients over the entire dataset. This makes it extremely efficient for large-scale problems and allows the algorithm to start making better predictions quickly.
Geometric intuition: Each update moves the decision boundary (defined by wx=0w^\top x = 0) in the direction that correctly classifies the misclassified example. Adding yixiy_i x_i to ww effectively rotates the decision boundary toward correctly classifying point xix_i with label yiy_i.
Convergence guarantee for separable data: If the training data is linearly separable, the Perceptron algorithm is guaranteed to converge to a solution that correctly classifies all training examples in a finite number of steps. This theoretical guarantee makes the algorithm particularly attractive for problems where linear separability holds.

Logistic Regression

Logistic regression is a fundamental algorithm for binary classification that predicts the probability that an input belongs to a particular class. Unlike linear regression which outputs continuous values, logistic regression outputs probabilities between 0 and 1 by applying the logistic function (also called the sigmoid function) to a linear combination of features.
The logistic function is defined as σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}, which takes any real-valued input zz and maps it to the range (0,1)(0, 1). In logistic regression, we compute z=wxz = w^\top x and then apply the sigmoid function to get the predicted probability: P(y=1x)=σ(wx)=11+ewxP(y = 1 | x) = \sigma(w^\top x) = \frac{1}{1 + e^{-w^\top x}}.
Why it is called Logistic Regression?
Despite its name, logistic regression is actually a classification algorithm, not a regression algorithm. The term "regression" comes from its historical development and the fact that it uses a regression-like approach (finding optimal weights) to predict probabilities, but the ultimate goal is classification rather than predicting continuous values.
Key Properties of the Logistic FunctionThe sigmoid function has several important properties that make it suitable for binary classification:
Bounded output: The output is always between 0 and 1, making it interpretable as a probability.
Smooth and differentiable: This allows us to use gradient-based optimization methods.
S-shaped curve: The function transitions smoothly from 0 to 1, with the steepest slope at z=0z = 0.
Symmetry: σ(z)=1σ(z)\sigma(-z) = 1 - \sigma(z), which relates the probabilities of the two classes.
Decision Rule
For classification, we typically use a threshold (usually 0.5) to make predictions: predict y=1y = 1 if σ(wx)0.5\sigma(w^\top x) \geq 0.5, and y=1y = -1 (or 0) otherwise. Since σ(z)0.5\sigma(z) \geq 0.5 when z0z \geq 0, this is equivalent to predicting y=1y = 1 when wx0w^\top x \geq 0, which gives us a linear decision boundary.
Loss Function: Cross-Entropy Loss
To train a logistic regression model, we use the cross-entropy loss (also called log loss or logistic loss).
For a single example (xi,yi)(x_i, y_i) where yi{0,1}y_i \in \{0, 1\}, the loss is:
L(yi,y^i)=yilog(y^i)(1yi)log(1y^i)L(y_i, \hat{y}_i) = -y_i \log(\hat{y}_i) - (1 - y_i) \log(1 - \hat{y}_i)
where y^i=σ(wxi)\hat{y}_i = \sigma(w^\top x_i) is the predicted probability.
This loss function penalizes confident wrong predictions heavily while giving small penalties for correct predictions.
For the entire training set, the objective is to minimize the average cross-entropy loss:
F(w)=1ni=1n[yilog(σ(wxi))+(1yi)log(1σ(wxi))]F(w) = -\frac{1}{n} \sum_{i=1}^n [y_i \log(\sigma(w^\top x_i)) + (1 - y_i) \log(1 - \sigma(w^\top x_i))]
How to regress with discrete labels?
Instead of directly predicting discrete class labels, logistic regression predicts continuous probability values between 0 and 1 using the sigmoid function, which can then be thresholded for classification.
This approach has several advantages: it provides a smooth, differentiable objective function (cross-entropy loss) that can be optimized using gradient descent, it naturally handles the constraint that probabilities must be between 0 and 1, and it gives us confidence estimates rather than just hard classifications. The sigmoid function acts as a bridge between the continuous linear combination wxw^\top x and the discrete labels, allowing us to use regression-like optimization techniques for classification problems.
MLE Solution
Gradient Descent for Logistic Regression
The gradient of the cross-entropy loss with respect to ww has a particularly elegant form. For a single example, the gradient is:wL=(σ(wxi)yi)xi\nabla_w L = (\sigma(w^\top x_i) - y_i) x_i
This means the update rule for gradient descent is:
w(t+1)=w(t)ηi=1n(σ(wxi)yi)xiw^{(t+1)} = w^{(t)} - \eta \sum_{i=1}^n (\sigma(w^\top x_i) - y_i) x_i
Or for SGD with a single example:
w(t+1)=w(t)η(σ(wxi)yi)xiw^{(t+1)} = w^{(t)} - \eta (\sigma(w^\top x_i) - y_i) x_i
Notice the similarity to the Perceptron update rule, but with the key difference that the error term (σ(wxi)yi)(\sigma(w^\top x_i) - y_i) is continuous rather than binary, allowing for more nuanced updates based on the confidence of the prediction.
SGD
Apply Newton Method to Logistic Regression
Advantages of Logistic Regression
Probabilistic interpretation: Unlike Perceptron, logistic regression provides probability estimates, not just class predictions.
Convex optimization: The cross-entropy loss is convex, guaranteeing that gradient descent will find the global minimum.
Well-calibrated probabilities: The output probabilities can be used for ranking and decision-making under uncertainty.
Regularization: Easy to add regularization (L1 or L2) to prevent overfitting.
Interpretability: The weights directly indicate the importance and direction of influence of each feature.
Comparison with Perceptron
While both Perceptron and logistic regression find linear decision boundaries, there are important differences:
Perceptron only updates on misclassified examples, while logistic regression updates on all examples based on the magnitude of error.
Perceptron's loss is not differentiable everywhere, while cross-entropy loss is smooth.
Logistic regression converges to the maximum likelihood solution, while Perceptron just finds any separating hyperplane (if one exists).
Logistic regression works well even when data is not linearly separable, while Perceptron may not converge in such cases.

Summary

Linear Models for Classification
STEP 1: Find hyperplanes (Set Models)
STEP 2: Pick surrogate loss
STEP 3: Find ERM
List view
Search
Wiki
Overview of ML & Classification
Lecture Summary
ML
2025/12/22
Done
Overview of ML & Classification
Lecture Summary
ML
2025/12/22
Done