Classification & Logistic Regression

Published at

2026/01/02

Last edited time

2026/01/02 00:23

Created

2025/12/31 23:09

Section

Status

Done

Series

From the Bottom

Classification & Logistic Regression

Linear Classifiers (Binary Classification) and Surrogate Losses

Binary Classification

Let’s assume we are doing binary classification, which means the number of classes are 2 and Labels include -1 , 1 (because it’s a very simple case)

What we need to do first is to find the “model”

To make binary classification model sign of

w^\top x

can be used.

A hyperplane is a flat subspace that divides the feature space into two regions. In 2D, it's a line; in 3D, it's a plane; in higher dimensions, it generalizes accordingly. For binary classification, the hyperplane

w^\top x = 0

serves as the decision boundary that separates the two classes.

But, what if clearly not linearly seperable data?

e.g.

We can apply a nonlinear mapping

Surrogate Losses

The next step is to define a loss function.

The 0-1 loss (also called the classification error or indicator loss) is the most intuitive loss function for classification problems. It directly measures whether a prediction is correct or incorrect.

The 0-1 loss is defined as:

Where:

•

yyy is the true label (either -1 or 1 in binary classification)

•

y^\hat{y}y^​ is the predicted label

•

The loss equals 0 if the prediction is correct, and 1 if it's incorrect

For linear classifiers where

\hat{y} = \text{sign}(w^\top x)

, we can rewrite this as:

L_{0-1}(y, w^\top x) = \mathbb{1}[y \cdot w^\top x \leq 0]

This means the loss is 1 when

y \cdot w^\top x \leq 0

(misclassification) and 0 when

y \cdot w^\top x > 0

(correct classification). The term

y \cdot w^\top x

is called the margin, and a positive margin indicates correct classification.

However, the 0-1 loss has a critical problem: it's not differentiable and not convex, making it difficult to optimize using gradient-based methods.

To address this issue, we use surrogate losses that are convex and differentiable approximations of the 0-1 loss. These surrogate losses upper-bound the 0-1 loss and enable efficient optimization through gradient descent.

Common surrogate losses include the hinge loss (used in SVMs), logistic loss (used in logistic regression), and exponential loss (used in AdaBoost).

Find ERM

With the model and losses defined on what we were doing,

we need to find the optimal parameters

w

that minimize the empirical risk. This leads us to the Empirical Risk Minimization (ERM) framework:

\hat{w} = \arg\min_w \frac{1}{n} \sum_{i=1}^n L(y_i, w^\top x_i)

Where

L

is our chosen surrogate loss function (such as hinge loss or logistic loss), and we sum over all

n

training examples.

•

Using this, we can apply general convex optimization methods

A Detour of Numerical Optimization Method

Numerical optimization in machine learning refers to algorithms that iteratively find the parameters (like weights

w

) that minimize a loss function. Since most ML loss functions can't be solved analytically, we use iterative methods.

The most fundamental approach is gradient descent, which updates parameters in the direction opposite to the gradient:

w^{(t+1)} = w^{(t)} - \eta \nabla L(w^{(t)})

, where

\eta

is the learning rate.

These optimization methods are crucial for training ML models efficiently, especially for large datasets and complex models like neural networks.

In briefly, our goal is to minimize F(w) given a function F(w)

First-Order Optimization

•

Gradient Descent

Gradient descent is an iterative optimization algorithm that finds the minimum of a function by repeatedly taking steps in the direction of steepest descent. Starting from an initial point

w^{(0)}

, the algorithm updates the parameters using the rule

w^{(t+1)} = w^{(t)} - \eta \nabla F(w^{(t)})

, where

\eta

is the learning rate (step size) and

\nabla F(w^{(t)})

is the gradient of the function at the current point.

The key intuition is that the gradient

\nabla F(w)

points in the direction of steepest ascent, so moving in the opposite direction (negative gradient) decreases the function value. The learning rate

\eta

controls how large each step is—if it's too large, the algorithm might overshoot the minimum; if it's too small, convergence will be slow.

For convex functions, gradient descent is guaranteed to converge to the global minimum (given an appropriate learning rate). For non-convex functions, it converges to a local minimum or saddle point. The algorithm terminates when the gradient becomes sufficiently small (near zero) or after a fixed number of iterations.

In the context of machine learning, we apply gradient descent to minimize the empirical risk

F(w) = \frac{1}{n} \sum_{i=1}^n L(y_i, w^\top x_i)

, where the gradient is computed as

\nabla F(w) = \frac{1}{n} \sum_{i=1}^n \nabla_w L(y_i, w^\top x_i)

. This requires computing the gradient of the loss function with respect to all training examples at each iteration.

◦

Why using GD?

Gradient descent is appealing because of the first-order Taylor approximation. When we take a small step in the direction of the negative gradient, we're guaranteed to decrease the function value (assuming the step size is small enough). This is formalized by the Taylor expansion:

F(w + \Delta w) \approx F(w) + \nabla F(w)^\top \Delta w

, and choosing

\Delta w = -\eta \nabla F(w)

ensures

F(w + \Delta w) < F(w)

when

\eta

is appropriately chosen.

◦

Variation: Learning Rate might need to be changing over iterations

In practice, using a fixed learning rate throughout training can be suboptimal. A learning rate that is too large may cause the algorithm to overshoot the minimum and oscillate, while a learning rate that is too small leads to slow convergence. Therefore, adaptive learning rate schedules are often employed.

Common strategies include:

▪

Step decay: Reduce the learning rate by a factor every few epochs (e.g., ηt=η0⋅γ⌊t/k⌋\eta_t = \eta_0 \cdot \gamma^{\lfloor t/k \rfloor}ηt​=η0​⋅γ⌊t/k⌋)

▪

Exponential decay: Decay the learning rate exponentially over time (e.g., ηt=η0⋅e−λt\eta_t = \eta_0 \cdot e^{-\lambda t}ηt​=η0​⋅e−λt)

▪

1/t decay: Use ηt=η01+λt\eta_t = \frac{\eta_0}{1 + \lambda t}ηt​=1+λtη0​​, which decreases the learning rate as training progresses

▪

Adaptive methods: Algorithms like Adam, RMSprop, and AdaGrad automatically adjust the learning rate based on the history of gradients

These adaptive strategies help the algorithm make larger steps early in training when far from the minimum, and smaller, more precise steps as it approaches the optimal solution.

▪

Adam Optimizer 

Adam (Adaptive Moment Estimation) is an optimization algorithm that combines the benefits of two other popular methods: momentum-based gradient descent and adaptive learning rates. Instead of using a single fixed learning rate for all parameters, Adam maintains separate adaptive learning rates for each parameter based on estimates of the first and second moments of the gradients.

The algorithm keeps track of two moving averages:

m_t

(the first moment, which is the exponentially decaying average of past gradients) and

v_t

(the second moment, which is the exponentially decaying average of past squared gradients). These are updated at each iteration

t

using:

m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla F(w^{(t)})

v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla F(w^{(t)}))^2

where

\beta_1

and

\beta_2

are decay rates (typically

\beta_1 = 0.9

and

\beta_2 = 0.999

). Since

m_t

and

v_t

are initialized to zero, they are biased toward zero especially during the initial time steps. To correct this bias, Adam uses bias-corrected estimates:

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}

\hat{v}_t = \frac{v_t}{1 - \beta_2^t}

Finally, the parameter update rule becomes:

w^{(t+1)} = w^{(t)} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

where

\eta

is the learning rate (often set to 0.001) and

\epsilon

is a small constant (typically

10^{-8}

) added for numerical stability to prevent division by zero.

The key advantage of Adam is that it adapts the learning rate for each parameter individually. Parameters with large gradients get smaller effective learning rates (due to the division by

\sqrt{\hat{v}_t}

), while parameters with small gradients get larger effective learning rates. This makes Adam particularly effective for problems with sparse gradients or noisy objective functions, and it's widely used in training deep neural networks because it often converges faster than standard gradient descent and requires less manual tuning of the learning rate.

•

Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is a variant of gradient descent that addresses the computational cost of calculating gradients over the entire training dataset. Instead of computing the gradient using all

n

training examples at each iteration, SGD randomly selects a single example (or a small batch of examples) and computes the gradient based only on that subset.

The update rule for SGD with a single example is

w^{(t+1)} = w^{(t)} - \eta \nabla_w L(y_i, w^\top x_i)

, where

i

is randomly selected from the training set at each iteration. This makes each iteration much faster than standard gradient descent, especially for large datasets.

So, usually SGD needs more iterations but each iteration takes less time

And, even for nonconvex objectives,

It means that local minimizer

In optimization, a local minimizer (or local minimum) is a point where the function value is lower than all nearby points, but not necessarily the lowest point overall. Formally, a point

w^*

is a local minimizer if there exists some neighborhood around

w^*

where

F(w^*) \leq F(w)

for all

w

in that neighborhood.

In contrast, a global minimizer (or global minimum) is the point where the function achieves its absolute lowest value across the entire domain:

F(w^*) \leq F(w)

for all possible

w

For convex functions, every local minimizer is also a global minimizer, which is why convex optimization is particularly desirable in machine learning. However, for non-convex functions (like those in deep learning), there can be many local minima, and standard gradient descent might get stuck in a poor local minimum instead of reaching the global minimum.

This is where the stochasticity of SGD becomes advantageous: the noise in gradient estimates can help the algorithm escape shallow local minima and explore different regions of the parameter space, potentially finding better solutions.

But the stationary point can also be neither a local minimizer nor a local maximizer → Saddle Point

A saddle point is a critical point in optimization where the gradient is zero, but it is neither a local minimum nor a local maximum. Instead, it's a point where the function curves up in some directions and down in others—like the center of a saddle on a horse.

Mathematically, at a saddle point, the Hessian matrix (the matrix of second derivatives) has both positive and negative eigenvalues. This means that in some directions, the function is locally convex (curving upward), while in other directions, it is locally concave (curving downward).

Saddle points are particularly problematic in high-dimensional optimization problems like deep learning, where they can slow down or even halt the progress of gradient-based optimization algorithms. However, algorithms like SGD can escape saddle points more easily than local minima because the stochastic noise in gradient estimates provides enough perturbation to move away from these flat regions.

Even worse, distinguishing local minima and the saddle point is generally NP-Hard.

Sencond-Order Optimization

•

Newton Method

Newton's method is a second-order optimization algorithm that uses both the gradient and the Hessian matrix (the matrix of second derivatives) to find the optimal parameters. Unlike gradient descent which only uses first-order information, Newton's method incorporates curvature information to make more informed steps toward the minimum.

The update rule for Newton's method is

w^{(t+1)} = w^{(t)} - \eta [H(w^{(t)})]^{-1} \nabla F(w^{(t)})

, where

H(w)

is the Hessian matrix of second partial derivatives. The key advantage is that Newton's method can converge in fewer iterations than gradient descent, especially near the optimum, because it accounts for the curvature of the function.

◦

Hessian Matrix

The Hessian matrix is a square matrix of second-order partial derivatives of a scalar-valued function. For a function

F(w)

where

w \in \mathbb{R}^d

, the Hessian

H(w)

is a

d \times d

matrix where the element at position

(i,j)

H_{ij} = \frac{\partial^2 F}{\partial w_i \partial w_j}

The Hessian captures the curvature of the function at a given point, providing information about how the gradient itself is changing. When the Hessian is positive definite (all eigenvalues are positive), the point is a local minimum; when it's negative definite (all eigenvalues are negative), the point is a local maximum; and when it has both positive and negative eigenvalues, the point is a saddle point.

However, computing and inverting the Hessian matrix is computationally expensive, requiring

O(d^2)

space and

O(d^3)

time for matrix inversion, where

d

is the number of parameters. This makes Newton's method impractical for high-dimensional problems like deep learning

◦

vs Gradient Descent

Newton Method has no learning rate and converges super fast

But, computing Hessian is very slow and does not really make sense for nonconvex objectives

Perceptron

•

Recall Perceptron Loss

Apply GD to Perceptron,

The Perceptron algorithm is a simple online learning algorithm for binary classification. Given a training set of labeled examples, it iteratively updates the weight vector whenever it makes a prediction error on a training example.

The algorithm starts with an initial weight vector (often

w^{(0)} = 0

) and processes training examples one at a time. For each example

(x_i, y_i)

, if the prediction

\text{sign}(w^\top x_i)

matches the true label

y_i

, no update is made. However, if there's a misclassification (i.e.,

y_i \cdot w^\top x_i \leq 0

), the weight vector is updated according to

w^{(t+1)} = w^{(t)} + \eta y_i x_i

, where

\eta

is the learning rate (often set to 1).

•

SGD

In summary, Perceptron Algorithm is

The Perceptron algorithm makes sense as an application of Stochastic Gradient Descent (SGD) to the Perceptron loss function for several important reasons:

•

Natural gradient descent interpretation: The Perceptron update rule w(t+1)=w(t)+ηyixiw^{(t+1)} = w^{(t)} + \eta y_i x_iw(t+1)=w(t)+ηyi​xi​ emerges naturally from applying gradient descent to the Perceptron loss. When there's a misclassification (i.e., yi⋅w⊤xi≤0y_i \cdot w^\top x_i \leq 0yi​⋅w⊤xi​≤0), the gradient of the loss with respect to www is −yixi-y_i x_i−yi​xi​, so moving in the opposite direction (adding ηyixi\eta y_i x_iηyi​xi​) reduces the loss.

•

Online learning efficiency: By processing one example at a time (SGD approach), the Perceptron algorithm can update weights immediately after each prediction error, rather than waiting to compute gradients over the entire dataset. This makes it extremely efficient for large-scale problems and allows the algorithm to start making better predictions quickly.

•

Geometric intuition: Each update moves the decision boundary (defined by w⊤x=0w^\top x = 0w⊤x=0) in the direction that correctly classifies the misclassified example. Adding yixiy_i x_iyi​xi​ to www effectively rotates the decision boundary toward correctly classifying point xix_ixi​ with label yiy_iyi​.

•

Convergence guarantee for separable data: If the training data is linearly separable, the Perceptron algorithm is guaranteed to converge to a solution that correctly classifies all training examples in a finite number of steps. This theoretical guarantee makes the algorithm particularly attractive for problems where linear separability holds.

Logistic Regression

Logistic regression is a fundamental algorithm for binary classification that predicts the probability that an input belongs to a particular class. Unlike linear regression which outputs continuous values, logistic regression outputs probabilities between 0 and 1 by applying the logistic function (also called the sigmoid function) to a linear combination of features.

The logistic function is defined as

\sigma(z) = \frac{1}{1 + e^{-z}}

, which takes any real-valued input

z

and maps it to the range

(0, 1)

. In logistic regression, we compute

z = w^\top x

and then apply the sigmoid function to get the predicted probability:

P(y = 1 | x) = \sigma(w^\top x) = \frac{1}{1 + e^{-w^\top x}}

•

Why it is called Logistic Regression?

Despite its name, logistic regression is actually a classification algorithm, not a regression algorithm. The term "regression" comes from its historical development and the fact that it uses a regression-like approach (finding optimal weights) to predict probabilities, but the ultimate goal is classification rather than predicting continuous values.

•

Key Properties of the Logistic FunctionThe sigmoid function has several important properties that make it suitable for binary classification:

◦

Bounded output: The output is always between 0 and 1, making it interpretable as a probability.

◦

Smooth and differentiable: This allows us to use gradient-based optimization methods.

◦

S-shaped curve: The function transitions smoothly from 0 to 1, with the steepest slope at z=0z = 0z=0.

◦

Symmetry: σ(−z)=1−σ(z)\sigma(-z) = 1 - \sigma(z)σ(−z)=1−σ(z), which relates the probabilities of the two classes.

•

Decision Rule

For classification, we typically use a threshold (usually 0.5) to make predictions: predict

y = 1

\sigma(w^\top x) \geq 0.5

, and

y = -1

(or 0) otherwise. Since

\sigma(z) \geq 0.5

when

z \geq 0

, this is equivalent to predicting

y = 1

when

w^\top x \geq 0

, which gives us a linear decision boundary.

•

Loss Function: Cross-Entropy Loss

To train a logistic regression model, we use the cross-entropy loss (also called log loss or logistic loss).

For a single example

(x_i, y_i)

where

y_i \in \{0, 1\}

, the loss is:

L(y_i, \hat{y}_i) = -y_i \log(\hat{y}_i) - (1 - y_i) \log(1 - \hat{y}_i)

where

\hat{y}_i = \sigma(w^\top x_i)

is the predicted probability.

This loss function penalizes confident wrong predictions heavily while giving small penalties for correct predictions.

For the entire training set, the objective is to minimize the average cross-entropy loss:

F(w) = -\frac{1}{n} \sum_{i=1}^n [y_i \log(\sigma(w^\top x_i)) + (1 - y_i) \log(1 - \sigma(w^\top x_i))]

•

How to regress with discrete labels?

Instead of directly predicting discrete class labels, logistic regression predicts continuous probability values between 0 and 1 using the sigmoid function, which can then be thresholded for classification.

This approach has several advantages: it provides a smooth, differentiable objective function (cross-entropy loss) that can be optimized using gradient descent, it naturally handles the constraint that probabilities must be between 0 and 1, and it gives us confidence estimates rather than just hard classifications. The sigmoid function acts as a bridge between the continuous linear combination

w^\top x

and the discrete labels, allowing us to use regression-like optimization techniques for classification problems.

•

MLE Solution

•

Gradient Descent for Logistic Regression

The gradient of the cross-entropy loss with respect to

w

has a particularly elegant form. For a single example, the gradient is:

\nabla_w L = (\sigma(w^\top x_i) - y_i) x_i

This means the update rule for gradient descent is:

w^{(t+1)} = w^{(t)} - \eta \sum_{i=1}^n (\sigma(w^\top x_i) - y_i) x_i

Or for SGD with a single example:

w^{(t+1)} = w^{(t)} - \eta (\sigma(w^\top x_i) - y_i) x_i

Notice the similarity to the Perceptron update rule, but with the key difference that the error term

(\sigma(w^\top x_i) - y_i)

is continuous rather than binary, allowing for more nuanced updates based on the confidence of the prediction.

◦

SGD

•

Apply Newton Method to Logistic Regression

•

Advantages of Logistic Regression

◦

Probabilistic interpretation: Unlike Perceptron, logistic regression provides probability estimates, not just class predictions.

◦

Convex optimization: The cross-entropy loss is convex, guaranteeing that gradient descent will find the global minimum.

◦

Well-calibrated probabilities: The output probabilities can be used for ranking and decision-making under uncertainty.

◦

Regularization: Easy to add regularization (L1 or L2) to prevent overfitting.

◦

Interpretability: The weights directly indicate the importance and direction of influence of each feature.

•

Comparison with Perceptron

While both Perceptron and logistic regression find linear decision boundaries, there are important differences:

◦

Perceptron only updates on misclassified examples, while logistic regression updates on all examples based on the magnitude of error.

◦

Perceptron's loss is not differentiable everywhere, while cross-entropy loss is smooth.

◦

Logistic regression converges to the maximum likelihood solution, while Perceptron just finds any separating hyperplane (if one exists).

◦

Logistic regression works well even when data is not linearly separable, while Perceptron may not converge in such cases.

Summary

Linear Models for Classification

•

STEP 1: Find hyperplanes (Set Models)

•

STEP 2: Pick surrogate loss

•

STEP 3: Find ERM

List view

Wiki

Overview of ML & Classification

Lecture Summary

2025/12/22

Done

Overview of ML & Classification

Classification & Logistic Regression

Lecture Summary

2026/01/02

Done

Classification & Logistic Regression

Lecture Summary

2026/01/02

Done

Multiclass Classification & Kernel Method

Lecture Summary

2026/01/03

Done

Multiclass Classification & Kernel Method

Support Vector Machine (SVM) / Decision Tree / Boosting

Lecture Summary

2026/02/02

Done

Support Vector Machine (SVM) / Decision Tree / Boosting

Lecture Summary

2026/02/02

Done

Clustering & Gaussian Mixture Model

Lecture Summary

In progress

Clustering & Gaussian Mixture Model

Lecture Summary

In progress

Machine Learning (Fall 2025)