Linear Regression

Published at

2025/12/29

Last edited time

2025/12/29 09:43

Created

2025/12/23 08:24

Section

Status

Done

Series

From the Bottom

Linear Regression

•

Key differences from classification

◦

Continuous vs. discrete output values

Classification outputs discrete class labels (e.g., spam/not spam), while regression outputs continuous numerical values (e.g., predicting house prices or temperature)

◦

Different error measurement approaches (Linear Regression inherently produces prediction errors)

◦

Distinct learning algorithms and optimization techniques

•

Intercept & Slope

The intercept (b) represents the y-value when x=0, while the slope (w) indicates how much y changes for each unit increase in x

In linear regression, we aim to find optimal values for these parameters that minimize prediction error

•

How to measure error

◦

The classification error is inappropriate for continious outcomes.

Squared Error:

\text{SE}(y, \hat{y}) = (y - \hat{y})^2

where y is the actual value and ŷ is the predicted value. Squaring ensures all errors are positive and penalizes larger errors more heavily.

Absolute Error:

\text{AE}(y, \hat{y}) = |y - \hat{y}|

where y is the actual value and ŷ is the predicted value. The absolute value ensures all errors are positive, but unlike squared error, it penalizes all errors linearly regardless of their magnitude.

Setup & Algorithm

•

Goal: Pick the model (unknown parameter) that minimizes the average/total prediction error on training dataset 

→ we cannot use test dataset while training

◦

Setups

▪

Training Data: A set of n input-output pairs {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)} where xᵢ represents features and yᵢ represents the target value

▪

Model: Linear function f(x) = wx + b, where w is the weight (slope) and b is the bias (intercept)

▪

Note:

•

for notation convenience,

•

append 1 to each \(x\) as the first feature:

\tilde{x} = [\,1 \;\; x_1 \;\; x_2 \;\; \ldots \;\; x_D\,]^{\top}

•

let

\tilde{w} = [\,w_0 \;\; w_1 \;\; w_2 \;\; \ldots \;\; w_D\,]^{\top}

•

a concise representation of all (D + 1) parameters

•

The model becomes simply

f(x) = \tilde{w}^{T}\tilde{x}

◦

Goal: Minimize total squared error

\mathrm{RSS}(\tilde{w}) = || \tilde{X}\tilde{w} - y||_2^2

\tilde{w}^* = \argmin_{\tilde{w} \in \mathbb{R}^{D+1}} \mathrm{RSS}(\tilde{w})

\mathrm{RSS}(\tilde{w}) = \sum_{i=1}^{n} (y_i - f(x_i))^2 = \sum_{i=1}^{n} (y_i - \tilde{w}^{\top}\tilde{x}_i)^2

•

Objective Function
The objective function (also called loss function or cost function) is a mathematical function that quantifies how well our model's predictions match the actual target values in the training data.

For linear regression, we use the Residual Sum of Squares (RSS) as our objective function:

◦

Key points about the objective function:

▪

Measures prediction error: It calculates the sum of squared differences between predicted values and actual values

▪

Optimization target: Our goal is to find parameters that minimize this function

▪

Convex property: RS is a convex function in linear regression, which guarantees a unique global minimum

→ because of this, we can efficiently find the optimal solution using closed-form equations or gradient-based methods without worrying about getting stuck in suboptimal local minima.

•

in case D = 1)

We consider a simple linear regression model with one input variable:

\hat{y}_n = w_0 + w_1 x_n

where w_0 is the intercept and w_1 is the slope.

Since the training objective is to minimize the Residual Sum of Squares (RSS),

To minimize the objective, we search for stationary points, i.e., parameter values where the gradient of the objective is zero:

\nabla_{\tilde{w}} \mathrm{RSS}(\tilde{w}) = 0

In linear regression, the RSS objective function is convex, which means it has a bowl-shaped curve with a single global minimum. This convex property guarantees that any stationary point (where the gradient equals zero) must be the global minimum, not a local minimum or maximum. Therefore, when we solve ∇RSS = 0, we can be confident that the solution we find is the optimal one that minimizes prediction error.

This is done by taking partial derivatives of RSS with respect to each parameter.

Setting the derivatives to zero gives:

\frac{\partial \mathrm{RSS}}{\partial w_0} = \sum_n (w_0 + w_1 x_n - y_n) = 0

\frac{\partial \mathrm{RSS}}{\partial w_1} = \sum_n (w_0 + w_1 x_n - y_n)x_n = 0

w_0 \sum_n x_n + w_1 \sum_n x_n^2 = \sum_n x_n y_n

•

Why the stationary points are the minimum?

In linear regression with a convex RSS objective function, any stationary point where the gradient equals zero is guaranteed to be a global minimum, not a local minimum or saddle point. This is because convex functions have the mathematical property that they curve upward everywhere, creating a single bowl-shaped surface with no valleys or peaks. Therefore, when we solve the normal equations by setting ∇RSS = 0, we can be confident that the solution represents the optimal parameters that minimize prediction error across all possible parameter values.

•

Generally, (Expading D=1 → D = some_value)

As a result, the minimizer w* is,

•

What if (X~⊤X~)−1(\tilde{X}^{\top}\tilde{X})^{-1}(X~⊤X~)−1 is not invertible?

◦

No Solution / or Infinitely many solutions

◦

To resolve this issue,

Add something positive to avoid some eigenvalues is 0

→ This is because after eigendecomposition,

Inverting the matrix is similar to just Inversion of Lambdas.

so, If the matrix is singular (not invertible), we add a small positive constant λ to the diagonal elements, forming (X̃ᵀX̃ + λI)⁻¹. This technique, known as ridge regression or L2 regularization, ensures that all eigenvalues are positive and the matrix becomes invertible.

•

Lambda is Hyper-parameter, can be tuned cross-validation

Parametic Methods vs Non-parametic Methods

•

Parametric Methods

◦

Parametric methods assume a specific functional form for the model (e.g., linear function f(x) = wx + b) with a fixed number of parameters that do not grow with the size of the training data. Once trained, the model is fully defined by these parameters, making predictions fast and memory-efficient, but potentially limited if the assumed form doesn't match the true underlying relationship.

•

Non-parametric Methods

◦

Non-parametric methods do not assume a fixed functional form and instead let the data determine the model's complexity, with the number of parameters typically growing with the training set size. Examples include k-Nearest Neighbors (k-NN) and kernel regression, which can capture more complex patterns but require storing training data and may be computationally expensive during prediction.

Linear Regression with Nonlinear Basis

•

In case linear model a straight line is not a good fit

the solution is nonlinearly transformed features

Use a nonlinear mapping

Then apply linear regression

•

Nonlinear Mapping

•

Regression with Nonlinear Basis

We are using very similar equation with linear regression, the only difference is using nonlinear mapping function

◦

Example Applications

▪

Polynomial Basis: φ(x) = [1, x, x², x³, ..., xᴹ] allows fitting polynomial curves of degree M

▪

Radial Basis Functions (RBF): φ(x) = exp(-||x - μ||²/2σ²) places Gaussian-shaped basis functions centered at different locations

▪

Fourier Basis: φ(x) = [1, sin(x), cos(x), sin(2x), cos(2x), ...] useful for periodic patterns

◦

Key InsightThe model remains linear in the parameters w̃, even though the relationship between input x and output y is nonlinear. This allows us to still use the closed-form solution w̃* = (X̃ᵀX̃)⁻¹X̃ᵀy, where X̃ now contains the transformed features φ(x) instead of the original features x.

•

Example

◦

Why not using FANCY linear mapping?

▪

For example,

▪

A linear feature map, no matter how complicated it looks, is still just a linear transformation given by multiplying by a matrix (A). In other words,

\phi(x)=Ax

▪

Linear regression or linear classification models fundamentally use expressions of the form (w^T x). When we apply a linear feature map, it becomes

w^T A x = (A^T w)^T x

▪

This is equivalent to simply redefining the parameter as w′=ATww' = A^T ww′=ATw, which means that the set of functions the model can represent does not actually change.

▪

Therefore, even with a “fancy” linear feature map, the model can still only represent linear decision boundaries such as lines or hyperplanes in higher dimensions.

▪

Operations like rotating coordinates, scaling, or taking linear combinations all stay within the same linear function class, so the model still cannot learn nonlinear patterns such as XOR or circular/donut-shaped structures.

▪

In conclusion, a linear feature map may look sophisticated, but it does not add new expressive power to the model.

Overfitting

Overfitting occurs when a model learns the training data too well, including its noise and random fluctuations, rather than capturing the underlying pattern. This results in excellent performance on training data but poor generalization to new, unseen data.

•

Understanding Overfitting through the Polynomial Example:

◦

Low Complexity (M=0, straight line): The model is too simple and cannot capture the curved pattern in the data. This is called underfitting - the model performs poorly on both training and test data because it lacks the capacity to represent the true relationship.

◦

Appropriate Complexity (M=3): The polynomial of degree 3 captures the general trend of the data well without fitting to every individual point. This model balances bias (underfitting) and variance (overfitting), resulting in good performance on both training and test data.

◦

High Complexity (M=9): The 9th-degree polynomial passes through or very close to every training point, achieving near-zero training error. However, between the training points, the curve oscillates wildly to accommodate the noise. When applied to test data, this model performs poorly because it has memorized the training data rather than learned the underlying pattern.

•

Key Indicators of Overfitting:

◦

Training error becomes very small or zero

◦

Test error remains high or increases

◦

Large gap between training and test performance

◦

Model parameters have very large magnitudes

◦

The fitted function shows unrealistic oscillations or extreme behavior

•

Why Does Overfitting Happen?

◦

Model Complexity: Using too many features or too high polynomial degree gives the model excessive flexibility to fit noise

◦

Insufficient Training Data: With few data points relative to model complexity, the model can perfectly fit the training data without learning the true pattern

◦

Training for Too Long: In iterative methods, continuing to train beyond the optimal point can lead to overfitting

•

Solutions to Prevent Overfitting:

More Training Data: Collect additional data to better represent the true underlying pattern

Model Selection:

•

Cross-validation: Use validation data to select the appropriate model complexity

Regularization: Add a penalty term (like ridge regression with λ) to discourage large parameter values

Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function that discourages overly complex models. Instead of just minimizing the RSS (Residual Sum of Squares), we now minimize:

\text{Objective} = \text{RSS} + \lambda \sum_{j=1}^{M} w_j^2

where λ (lambda) is the regularization parameter that controls the strength of the penalty.

•

How Regularization Works

◦

The Penalty Term: The term λ∑w²ⱼ penalizes large parameter values. When parameters become very large (as in the M=9 overfitting case), this penalty term increases significantly, forcing the optimization to find a balance between fitting the data well and keeping parameters small.

◦

Effect of λ:

▪

When λ = 0: No regularization, which can lead to overfitting with high-degree polynomials

▪

When λ is small: Light penalty, allowing some flexibility while reducing extreme parameter values

▪

When λ is large: Strong penalty, forcing parameters toward zero and creating a simpler, smoother model

▪

When λ → ∞: Parameters are forced to near zero, resulting in underfitting

Regularization transforms the problem from "fit the training data as perfectly as possible" to "fit the training data reasonably well while keeping the model simple." This trade-off, controlled by λ, is crucial for building models that generalize well to new data. The optimal λ value is typically found using cross-validation on a validation dataset.

Mathematically, with regularization, the closed-form solution becomes:

w^* = (\tilde{X}^{\top}\tilde{X} + \lambda I)^{-1}\tilde{X}^{\top}y

Notice how adding λI to the matrix not only helps with invertibility (as mentioned earlier) but also shrinks the parameter values toward zero, creating a smoother, more generalizable model.

•

Why Regularization Doesn't Affect

◦

When finding w*, the equation is the same as in linear regression, with lambda added to make X^TX invertible.

◦

Summary

General Idea to derive ML algorithms

•

Let's review all the key concepts we've covered so far.

List view

Wiki

Overview of ML & Classification

Lecture Summary

2025/12/22

Done

Overview of ML & Classification

Classification & Logistic Regression

Lecture Summary

2026/01/02

Done

Classification & Logistic Regression

Lecture Summary

2026/01/02

Done

Multiclass Classification & Kernel Method

Lecture Summary

2026/01/03

Done

Multiclass Classification & Kernel Method

Support Vector Machine (SVM) / Decision Tree / Boosting

Lecture Summary

2026/02/02

Done

Support Vector Machine (SVM) / Decision Tree / Boosting

Lecture Summary

2026/02/02

Done

Clustering & Gaussian Mixture Model

Lecture Summary

In progress

Clustering & Gaussian Mixture Model

Lecture Summary

In progress

Machine Learning (Fall 2025)