Changyu Lee

Linear Regression

Published at
2025/12/29
Last edited time
2025/12/29 09:43
Created
2025/12/23 08:24
Section
ML
Status
Done
Series
From the Bottom
Tags
Lecture Summary
AI summary
Linear regression predicts continuous outcomes, contrasting with classification's discrete outputs. It involves minimizing prediction errors using metrics like squared and absolute errors. The model's parameters, intercept and slope, are optimized to reduce the Residual Sum of Squares (RSS). Overfitting can occur with excessive model complexity, leading to poor generalization. Solutions include collecting more data, model selection through cross-validation, and regularization to prevent large parameter values, ensuring a balance between fitting the data and maintaining model simplicity.
Keywords
ML
Overfitting
Linear Regression
Language
ENG
Week
Linear regression predicts continuous numerical values by finding optimal parameters (intercept and slope) that minimize prediction error
The Residual Sum of Squares (RSS) serves as the objective function, which is convex and guarantees a unique global minimum
Closed-form solution exists for finding optimal parameters: w* = (X̃ᵀX̃)⁻¹X̃ᵀy
Nonlinear basis functions (polynomial, RBF, Fourier) allow linear regression to model nonlinear relationships while remaining linear in parameters
Overfitting occurs when models become too complex, fitting training data noise rather than underlying patterns, resulting in poor generalization
Solutions to overfitting include collecting more data, model selection through cross-validation, and regularization (ridge regression)
Regularization adds penalty term λ∑w²ⱼ to discourage large parameters, balancing data fitting with model simplicity
Linear feature maps do not increase model expressiveness since they can be absorbed into parameter redefinition

Linear Regression

Key differences from classification
Continuous vs. discrete output values
Classification outputs discrete class labels (e.g., spam/not spam), while regression outputs continuous numerical values (e.g., predicting house prices or temperature)
Different error measurement approaches (Linear Regression inherently produces prediction errors)
Distinct learning algorithms and optimization techniques
Intercept & Slope
The intercept (b) represents the y-value when x=0, while the slope (w) indicates how much y changes for each unit increase in x
In linear regression, we aim to find optimal values for these parameters that minimize prediction error
How to measure error
The classification error is inappropriate for continious outcomes.
1.
Squared Error:
SE(y,y^)=(yy^)2\text{SE}(y, \hat{y}) = (y - \hat{y})^2
where y is the actual value and ŷ is the predicted value. Squaring ensures all errors are positive and penalizes larger errors more heavily.
2.
Absolute Error:
AE(y,y^)=yy^\text{AE}(y, \hat{y}) = |y - \hat{y}|
where y is the actual value and ŷ is the predicted value. The absolute value ensures all errors are positive, but unlike squared error, it penalizes all errors linearly regardless of their magnitude.

Setup & Algorithm

Goal: Pick the model (unknown parameter) that minimizes the average/total prediction error on training dataset
→ we cannot use test dataset while training
Setups
Training Data: A set of n input-output pairs {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)} where xᵢ represents features and yᵢ represents the target value
Model: Linear function f(x) = wx + b, where w is the weight (slope) and b is the bias (intercept)
Note:
for notation convenience,
append 1 to each \(x\) as the first feature:
x~=[1    x1    x2        xD]\tilde{x} = [\,1 \;\; x_1 \;\; x_2 \;\; \ldots \;\; x_D\,]^{\top}
let
w~=[w0    w1    w2        wD]\tilde{w} = [\,w_0 \;\; w_1 \;\; w_2 \;\; \ldots \;\; w_D\,]^{\top}
a concise representation of all (D + 1) parameters
The model becomes simply
f(x)=w~Tx~f(x) = \tilde{w}^{T}\tilde{x}
Goal: Minimize total squared error
RSS(w~)=X~w~y22\mathrm{RSS}(\tilde{w}) = || \tilde{X}\tilde{w} - y||_2^2
w~=arg minw~RD+1RSS(w~)\tilde{w}^* = \argmin_{\tilde{w} \in \mathbb{R}^{D+1}} \mathrm{RSS}(\tilde{w})
RSS(w~)=i=1n(yif(xi))2=i=1n(yiw~x~i)2\mathrm{RSS}(\tilde{w}) = \sum_{i=1}^{n} (y_i - f(x_i))^2 = \sum_{i=1}^{n} (y_i - \tilde{w}^{\top}\tilde{x}_i)^2
Objective Function The objective function (also called loss function or cost function) is a mathematical function that quantifies how well our model's predictions match the actual target values in the training data.
For linear regression, we use the Residual Sum of Squares (RSS) as our objective function:
Key points about the objective function:
Measures prediction error: It calculates the sum of squared differences between predicted values and actual values
Optimization target: Our goal is to find parameters that minimize this function
Convex property: RS is a convex function in linear regression, which guarantees a unique global minimum
→ because of this, we can efficiently find the optimal solution using closed-form equations or gradient-based methods without worrying about getting stuck in suboptimal local minima.
in case D = 1)
We consider a simple linear regression model with one input variable:
y^n=w0+w1xn\hat{y}_n = w_0 + w_1 x_n
where w_0 is the intercept and w_1 is the slope.
Since the training objective is to minimize the Residual Sum of Squares (RSS),
To minimize the objective, we search for stationary points, i.e., parameter values where the gradient of the objective is zero:
w~RSS(w~)=0\nabla_{\tilde{w}} \mathrm{RSS}(\tilde{w}) = 0
In linear regression, the RSS objective function is convex, which means it has a bowl-shaped curve with a single global minimum. This convex property guarantees that any stationary point (where the gradient equals zero) must be the global minimum, not a local minimum or maximum. Therefore, when we solve ∇RSS = 0, we can be confident that the solution we find is the optimal one that minimizes prediction error.
This is done by taking partial derivatives of RSS with respect to each parameter.
Setting the derivatives to zero gives:
RSSw0=n(w0+w1xnyn)=0\frac{\partial \mathrm{RSS}}{\partial w_0} = \sum_n (w_0 + w_1 x_n - y_n) = 0
RSSw1=n(w0+w1xnyn)xn=0\frac{\partial \mathrm{RSS}}{\partial w_1} = \sum_n (w_0 + w_1 x_n - y_n)x_n = 0
w0nxn+w1nxn2=nxnynw_0 \sum_n x_n + w_1 \sum_n x_n^2 = \sum_n x_n y_n
Why the stationary points are the minimum?
In linear regression with a convex RSS objective function, any stationary point where the gradient equals zero is guaranteed to be a global minimum, not a local minimum or saddle point. This is because convex functions have the mathematical property that they curve upward everywhere, creating a single bowl-shaped surface with no valleys or peaks. Therefore, when we solve the normal equations by setting ∇RSS = 0, we can be confident that the solution represents the optimal parameters that minimize prediction error across all possible parameter values.
Generally, (Expading D=1 → D = some_value)
As a result, the minimizer w* is,
What if (X~X~)1(\tilde{X}^{\top}\tilde{X})^{-1} is not invertible?
No Solution / or Infinitely many solutions
To resolve this issue,
1.
Add something positive to avoid some eigenvalues is 0
→ This is because after eigendecomposition,
Inverting the matrix is similar to just Inversion of Lambdas.
so, If the matrix is singular (not invertible), we add a small positive constant λ to the diagonal elements, forming (X̃ᵀX̃ + λI)⁻¹. This technique, known as ridge regression or L2 regularization, ensures that all eigenvalues are positive and the matrix becomes invertible.
Lambda is Hyper-parameter, can be tuned cross-validation

Parametic Methods vs Non-parametic Methods

Parametric Methods
Parametric methods assume a specific functional form for the model (e.g., linear function f(x) = wx + b) with a fixed number of parameters that do not grow with the size of the training data. Once trained, the model is fully defined by these parameters, making predictions fast and memory-efficient, but potentially limited if the assumed form doesn't match the true underlying relationship.
Non-parametric Methods
Non-parametric methods do not assume a fixed functional form and instead let the data determine the model's complexity, with the number of parameters typically growing with the training set size. Examples include k-Nearest Neighbors (k-NN) and kernel regression, which can capture more complex patterns but require storing training data and may be computationally expensive during prediction.

Linear Regression with Nonlinear Basis

In case linear model a straight line is not a good fit
the solution is nonlinearly transformed features
1.
Use a nonlinear mapping
2.
Then apply linear regression
Nonlinear Mapping
Regression with Nonlinear Basis
We are using very similar equation with linear regression, the only difference is using nonlinear mapping function
Example Applications
Polynomial Basis: φ(x) = [1, x, x², x³, ..., xᴹ] allows fitting polynomial curves of degree M
Radial Basis Functions (RBF): φ(x) = exp(-||x - μ||²/2σ²) places Gaussian-shaped basis functions centered at different locations
Fourier Basis: φ(x) = [1, sin(x), cos(x), sin(2x), cos(2x), ...] useful for periodic patterns
Key InsightThe model remains linear in the parameters w̃, even though the relationship between input x and output y is nonlinear. This allows us to still use the closed-form solution w̃* = (X̃ᵀX̃)⁻¹X̃ᵀy, where X̃ now contains the transformed features φ(x) instead of the original features x.
Example
Why not using FANCY linear mapping?
For example,
A linear feature map, no matter how complicated it looks, is still just a linear transformation given by multiplying by a matrix (A). In other words,
ϕ(x)=Ax\phi(x)=Ax
Linear regression or linear classification models fundamentally use expressions of the form (w^T x). When we apply a linear feature map, it becomes
wTAx=(ATw)Txw^T A x = (A^T w)^T x
This is equivalent to simply redefining the parameter as w=ATww' = A^T w, which means that the set of functions the model can represent does not actually change.
Therefore, even with a “fancy” linear feature map, the model can still only represent linear decision boundaries such as lines or hyperplanes in higher dimensions.
Operations like rotating coordinates, scaling, or taking linear combinations all stay within the same linear function class, so the model still cannot learn nonlinear patterns such as XOR or circular/donut-shaped structures.
In conclusion, a linear feature map may look sophisticated, but it does not add new expressive power to the model.

Overfitting

Overfitting occurs when a model learns the training data too well, including its noise and random fluctuations, rather than capturing the underlying pattern. This results in excellent performance on training data but poor generalization to new, unseen data.
Understanding Overfitting through the Polynomial Example:
Low Complexity (M=0, straight line): The model is too simple and cannot capture the curved pattern in the data. This is called underfitting - the model performs poorly on both training and test data because it lacks the capacity to represent the true relationship.
Appropriate Complexity (M=3): The polynomial of degree 3 captures the general trend of the data well without fitting to every individual point. This model balances bias (underfitting) and variance (overfitting), resulting in good performance on both training and test data.
High Complexity (M=9): The 9th-degree polynomial passes through or very close to every training point, achieving near-zero training error. However, between the training points, the curve oscillates wildly to accommodate the noise. When applied to test data, this model performs poorly because it has memorized the training data rather than learned the underlying pattern.
Key Indicators of Overfitting:
Training error becomes very small or zero
Test error remains high or increases
Large gap between training and test performance
Model parameters have very large magnitudes
The fitted function shows unrealistic oscillations or extreme behavior
Why Does Overfitting Happen?
Model Complexity: Using too many features or too high polynomial degree gives the model excessive flexibility to fit noise
Insufficient Training Data: With few data points relative to model complexity, the model can perfectly fit the training data without learning the true pattern
Training for Too Long: In iterative methods, continuing to train beyond the optimal point can lead to overfitting
Solutions to Prevent Overfitting:
1.
More Training Data: Collect additional data to better represent the true underlying pattern
2.
Model Selection:
Cross-validation: Use validation data to select the appropriate model complexity
3.
Regularization: Add a penalty term (like ridge regression with λ) to discourage large parameter values
Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function that discourages overly complex models. Instead of just minimizing the RSS (Residual Sum of Squares), we now minimize:
Objective=RSS+λj=1Mwj2\text{Objective} = \text{RSS} + \lambda \sum_{j=1}^{M} w_j^2
where λ (lambda) is the regularization parameter that controls the strength of the penalty.
How Regularization Works
The Penalty Term: The term λ∑w²ⱼ penalizes large parameter values. When parameters become very large (as in the M=9 overfitting case), this penalty term increases significantly, forcing the optimization to find a balance between fitting the data well and keeping parameters small.
Effect of λ:
When λ = 0: No regularization, which can lead to overfitting with high-degree polynomials
When λ is small: Light penalty, allowing some flexibility while reducing extreme parameter values
When λ is large: Strong penalty, forcing parameters toward zero and creating a simpler, smoother model
When λ → ∞: Parameters are forced to near zero, resulting in underfitting
Regularization transforms the problem from "fit the training data as perfectly as possible" to "fit the training data reasonably well while keeping the model simple." This trade-off, controlled by λ, is crucial for building models that generalize well to new data. The optimal λ value is typically found using cross-validation on a validation dataset.
Mathematically, with regularization, the closed-form solution becomes:
w=(X~X~+λI)1X~yw^* = (\tilde{X}^{\top}\tilde{X} + \lambda I)^{-1}\tilde{X}^{\top}y
Notice how adding λI to the matrix not only helps with invertibility (as mentioned earlier) but also shrinks the parameter values toward zero, creating a smoother, more generalizable model.
Why Regularization Doesn't Affect
When finding w*, the equation is the same as in linear regression, with lambda added to make X^TX invertible.
Summary

General Idea to derive ML algorithms

Let's review all the key concepts we've covered so far.
List view
Search
Wiki
Overview of ML & Classification
Lecture Summary
ML
2025/12/22
Done
Overview of ML & Classification
Lecture Summary
ML
2025/12/22
Done