Changyu Lee

Multiclass Classification & Kernel Method

Published at
2026/01/03
Last edited time
2026/01/03 03:23
Created
2026/01/03 02:41
Section
ML
Status
Done
Series
From the Bottom
Tags
Lecture Summary
AI summary
Multiclass classification extends binary classification to multiple classes using softmax/cross-entropy loss, with strategies like One-vs-All and One-vs-One. Kernel methods enable linear algorithms to efficiently handle high-dimensional data through the kernel trick, allowing computations in implicit feature spaces without explicit transformations. Valid kernels must be symmetric and positive semi-definite, and the Gram matrix encodes pairwise evaluations essential for kernel-based algorithms.
Keywords
ML
Language
ENG
Week
4
1 more property
Multiclass classification extends binary classification to K classes, with each class k having its own weight vector wkw_k and predictions made by y^=argmaxkwkx\hat{y} = \arg\max_k w_k^\top x
Cross-entropy loss (softmax loss) generalizes logistic loss to multiple classes: L(y,x)=logP(yx)=wyx+log(j=1Kewjx)L(y, x) = -\log P(y | x) = -w_y^\top x + \log(\sum_{j=1}^K e^{w_j^\top x}), providing probability distributions over all classes
Reduction to binary classification enables multiclass algorithms through One-vs-All (K classifiers), One-vs-One (K(K1)/2K(K-1)/2 classifiers), or tree-based approaches (log K evaluations)
The kernel method allows linear algorithms to work in high-dimensional feature spaces without explicit computation through kernel functions K(x,x)=ϕ(x),ϕ(x)K(x, x') = \langle \phi(x), \phi(x') \rangle
The Gram matrix encodes all pairwise kernel evaluations and must be positive semi-definite for valid kernels, enabling kernel-based algorithms to operate entirely through inner products
The kernel trick leverages the representer theorem to express solutions as w=i=1nαiyiϕ(xi)w = \sum_{i=1}^n \alpha_i y_i \phi(x_i), making predictions via f(x)=i=1nαiyiK(xi,x)f(x) = \sum_{i=1}^n \alpha_i y_i K(x_i, x) without computing ϕ(x)\phi(x) explicitly
Valid kernels must satisfy Mercer's theorem (symmetric and positive semi-definite), with common examples including linear, polynomial, and RBF (Gaussian) kernels
Kernel methods enable efficient computation in infinite-dimensional spaces but face computational challenges with O(n2)O(n^2) storage and O(n3)O(n^3) inversion costs for large datasets

Multiclass Classification & Kernel Method

Multiclass Classification

Multinomial Classification

Recall
How can we swtich binary classification into multiclass classification?
1.
What should a linear model look like for multiclass tasks?
Think of wkxw^\top_k x as a score for class kk
Each class kk has its own weight vector wkw_k, and we predict the class with the highest score: y^=argmaxkwkx\hat{y} = \arg\max_k w_k^\top x
This approach naturally extends the binary linear classifier to handle multiple classes simultaneously
The decision boundaries are now defined by pairwise hyperplanes between classes: wix=wjxw_i^\top x = w_j^\top x
2.
How do we generalize perceptron / hinge / logistic loss? (Focus on logistic loss)
Multiclass logistic loss a.k.a. Cross-Entropy Loss
To estimate the probability distribution over all KK classes, we define the probability of class kk for a given input xx as:
P(y=kx)=ewkxj=1KewjxP(y = k | x) = \frac{e^{w_k^\top x}}{\sum_{j=1}^K e^{w_j^\top x}}
This is the softmax function, which generalizes the logistic sigmoid to multiple classes. It ensures all class probabilities are positive and sum to 1, creating a valid probability distribution.
For multiclass classification with KK classes, the cross-entropy loss can be written in several equivalent forms:
L(y,x)=logP(yx)=log(ewyxj=1Kewjx)L(y, x) = -\log P(y | x) = -\log\left(\frac{e^{w_y^\top x}}{\sum_{j=1}^K e^{w_j^\top x}}\right)
Expanding using logarithm properties:
L(y,x)=wyx+log(j=1Kewjx)L(y, x) = -w_y^\top x + \log\left(\sum_{j=1}^K e^{w_j^\top x}\right)
With one-hot encoding where yk=1y_k = 1 for the true class and yj=0y_j = 0 otherwise:
L(y,y^)=k=1Kyklog(y^k)=k=1Kyklog(ewkxj=1Kewjx)L(y, \hat{y}) = -\sum_{k=1}^K y_k \log(\hat{y}_k) = -\sum_{k=1}^K y_k \log\left(\frac{e^{w_k^\top x}}{\sum_{j=1}^K e^{w_j^\top x}}\right)
For the entire training set:
F(w)=1ni=1nlogP(yixi)=1ni=1n[wyixi+log(j=1Kewjxi)]F(w) = -\frac{1}{n} \sum_{i=1}^n \log P(y_i | x_i) = \frac{1}{n} \sum_{i=1}^n \left[-w_{y_i}^\top x_i + \log\left(\sum_{j=1}^K e^{w_j^\top x_i}\right)\right]
Alternatively, we can write the loss in terms of the difference between weight vectors:
F(W)=wyx+log(j=1Kewjx)=log(j=1Ke(wjwy)x)F(W) = -w_y^\top x + \log\left(\sum_{j=1}^K e^{w_j^\top x}\right) = \log\left(\sum_{j=1}^K e^{(w_j - w_y)^\top x}\right)
This formulation emphasizes that only the relative differences between weight vectors matter for the predicted probabilities, not their absolute values.
The training objective minimizes the negative log-likelihood over all training examples:
w^=argminw1ni=1nlogP(yixi)=argminw1ni=1nlog(1+eyiwkxi)\hat{w} = \arg\min_w -\frac{1}{n} \sum_{i=1}^n \log P(y_i | x_i) = \arg\min_w \frac{1}{n} \sum_{i=1}^n \log(1 + e^{-y_i \cdot w_k^\top x_i})
This minimizes the sum of multiclass logistic losses. Each example contributes a loss based on how well the model predicts its true class relative to all other classes.
3.
Optimization (Apply SGD)
What is the gradient of Fn(W)?F_n(W)?
To find the gradient of the multiclass logistic loss function Fn(W)F_n(W) with respect to the weight matrix WW, we need to compute the derivative of the loss with respect to each class's weight vector wkw_k.
Starting from the loss function:
Fn(W)=wyx+log(j=1Kewjx)F_n(W) = -w_y^\top x + \log\left(\sum_{j=1}^K e^{w_j^\top x}\right)
The gradient with respect to wkw_k is:
wkFn(W)={x+ewkxj=1Kewjxxif k=yewkxj=1Kewjxxif ky\nabla_{w_k} F_n(W) = \begin{cases} -x + \frac{e^{w_k^\top x}}{\sum_{j=1}^K e^{w_j^\top x}} \cdot x & \text{if } k = y \\ \frac{e^{w_k^\top x}}{\sum_{j=1}^K e^{w_j^\top x}} \cdot x & \text{if } k \neq y \end{cases}
This can be written more compactly as:
wkFn(W)=(P(y=kx)1[k=y])x\nabla_{w_k} F_n(W) = (P(y = k | x) - \mathbb{1}[k = y]) \cdot x
where P(y=kx)=ewkxj=1KewjxP(y = k | x) = \frac{e^{w_k^\top x}}{\sum_{j=1}^K e^{w_j^\top x}} is the softmax probability for class kk.
So, the algorithm is like
Once we have learned the weight matrix WW (where each row wkw_k corresponds to the weight vector for class kk), we have several options for making predictions on new data points:
Deterministic Prediction: The most straightforward approach is to make a deterministic prediction by selecting the class with the highest score: y^=argmaxk[K]wkx\hat{y} = \arg\max_{k \in [K]} w_k^\top x. This means we compute the score wkxw_k^\top x for each class kk and predict the class that achieves the maximum score. This approach is simple, fast, and gives a single definite answer.
Randomized (Probabilistic) Prediction: Alternatively, we can make a randomized prediction by sampling from the probability distribution defined by the softmax function: P(y=kx)=ewkxj=1KewjxP(y = k | x) = \frac{e^{w_k^\top x}}{\sum_{j=1}^K e^{w_j^\top x}}. In this approach, we compute the probability for each class using the softmax function, then draw a random sample from this distribution. Classes with higher probabilities are more likely to be selected, but there's still a chance of selecting lower-probability classes.
This probabilistic approach is useful when we want to incorporate uncertainty into our predictions, need diverse outputs for applications like exploration in reinforcement learning, want to avoid always choosing the same class when multiple classes have similar probabilities, or are using the model as part of a larger probabilistic system.
Generalization of Cross-Entropy Loss
Scoring Function
A scoring function generalizes the notion of class scores beyond simple linear combinations. Instead of just computing wkxw_k^\top x for each class kk, we can define a more flexible scoring function sk(x;w)=fk(x,w)s_k(x; w) = f_k(x, w) that could be any parameterized function. This allows us to capture more complex relationships between inputs and class predictions, such as using neural networks where sk(x;w)=wkϕ(x)s_k(x; w) = w_k^\top \phi(x) with ϕ(x)\phi(x) being a learned feature representation.

Reduction to Binary Classification

Is there an even more general and simpler approach to derive multiclass classification algorithms?
→ Given a binary classification algorithm. can we turn it to a multiclass algorithm?
The answer is YES
One-vs-All (OvA): Train K binary classifiers where each classifier k distinguishes class k from all other classes. For prediction, run all K classifiers and choose the class with the highest confidence score: y^=argmaxkwkx\hat{y} = \arg\max_k w_k^\top x. Simple and efficient but suffers from class imbalance since each classifier sees many more negative examples than positive ones.
One-vs-One (OvO): Train K(K1)2\frac{K(K-1)}{2} binary classifiers, one for each pair of classes. Each classifier is trained only on examples from its two classes, ignoring all others. For prediction, use majority voting across all pairwise classifiers. More accurate for well-separated classes and naturally balanced, but requires many more classifiers and slower prediction time.
Tree-based: Organize classes in a hierarchical tree structure where each internal node represents a binary decision between groups of classes. At each node, train a binary classifier that splits the remaining classes into two subsets. For prediction, traverse the tree from root to leaf by making binary decisions at each node. Requires only logK\log K classifier evaluations at test time, making it very efficient for large K, but the tree structure must be carefully designed and errors propagate down the tree.
Comparison

Kernel Method

The Kernel Method is a powerful technique in machine learning that allows us to apply linear algorithms to non-linearly separable data without explicitly computing high-dimensional feature transformations. The key idea is to use a kernel function that computes inner products in a high-dimensional feature space implicitly, avoiding the computational cost of working directly in that space.

Motivation: From Linear to Nonlinear

Recall that linear classifiers work well when data is linearly separable, but many real-world problems require nonlinear decision boundaries. We previously saw that we can apply a nonlinear mapping ϕ:RdRD\phi: \mathbb{R}^d \rightarrow \mathbb{R}^D to transform the input features into a higher-dimensional space where the data becomes linearly separable.
For example, mapping x=(x1,x2)x = (x_1, x_2) to ϕ(x)=(x1,x2,x12,x22,x1x2)\phi(x) = (x_1, x_2, x_1^2, x_2^2, x_1 x_2) allows us to capture polynomial relationships. However, if we want to capture very complex patterns, we might need DD to be very large (even infinite), making explicit computation of ϕ(x)\phi(x) intractable.

The Kernel Trick

The kernel trick provides an elegant solution: instead of explicitly computing ϕ(x)\phi(x), we define a kernel function K(x,x)K(x, x') (the notation written in the picture is small k) that directly computes the inner product in the feature space:
K(x,x)=ϕ(x),ϕ(x)=ϕ(x)ϕ(x)K(x, x') = \langle \phi(x), \phi(x') \rangle = \phi(x)^\top \phi(x')
The remarkable property is that we can compute K(x,x)K(x, x') efficiently without ever explicitly computing ϕ(x)\phi(x). This allows us to work in very high-dimensional (even infinite-dimensional) feature spaces while maintaining computational efficiency.
Gram Matrix
In the context of kernel methods, the Gram matrix (also called the kernel matrix) is a fundamental concept that captures all pairwise kernel evaluations between training examples.
For a training set {x1,x2,,xn}\{x_1, x_2, \ldots, x_n\} and a kernel function K(x,x)K(x, x'), the Gram matrix GG is an n×nn \times n symmetric matrix where each entry GijG_{ij} is defined as:
Gij=K(xi,xj)=ϕ(xi),ϕ(xj)G_{ij} = K(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle
In matrix form, this can be written as:
G=[K(x1,x1)K(x1,x2)K(x1,xn)K(x2,x1)K(x2,x2)K(x2,xn)K(xn,x1)K(xn,x2)K(xn,xn)]G = \begin{bmatrix} K(x_1, x_1) & K(x_1, x_2) & \cdots & K(x_1, x_n) \\ K(x_2, x_1) & K(x_2, x_2) & \cdots & K(x_2, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ K(x_n, x_1) & K(x_n, x_2) & \cdots & K(x_n, x_n) \end{bmatrix}
Key Properties of the Gram Matrix:
Symmetry: Since K(xi,xj)=K(xj,xi)K(x_i, x_j) = K(x_j, x_i) for any valid kernel, the Gram matrix is symmetric: G=GG = G^\top.
Positive Semi-Definiteness: For any valid kernel function, the Gram matrix must be positive semi-definite (PSD), meaning for any vector cRnc \in \mathbb{R}^n, we have cGc0c^\top G c \geq 0. This is equivalent to Mercer's condition and ensures the kernel corresponds to a valid inner product.
Encodes all training information: The Gram matrix contains all the information needed to compute the solution in kernel-based algorithms. Once we have GG, we never need to explicitly access the feature representations ϕ(xi)\phi(x_i).
Role in Kernel Algorithms:
Many kernel-based algorithms can be expressed entirely in terms of the Gram matrix. For example, in kernelized ridge regression, the solution takes the form:
α=(G+λI)1y\alpha = (G + \lambda I)^{-1} y
where α\alpha is the vector of dual coefficients, λ\lambda is the regularization parameter, II is the identity matrix, and yy is the vector of labels.
To make predictions on a new point xx, we compute:
f(x)=i=1nαiK(xi,x)=k(x)αf(x) = \sum_{i=1}^n \alpha_i K(x_i, x) = k(x)^\top \alpha
where k(x)=[K(x1,x),K(x2,x),,K(xn,x)]k(x) = [K(x_1, x), K(x_2, x), \ldots, K(x_n, x)]^\top is the vector of kernel evaluations between the new point and all training points.
Computational Considerations:
Computing the Gram matrix requires O(n2)O(n^2) kernel evaluations and O(n2)O(n^2) storage, which can be prohibitive for very large datasets.
Many algorithms require inverting or decomposing the Gram matrix, which takes O(n3)O(n^3) time.
For large-scale problems, approximate methods like random features or Nyström approximation are often used to avoid computing the full Gram matrix.
Example: For a simple 2D dataset with three points and a linear kernel K(x,x)=xxK(x, x') = x^\top x':
If x1=[1,0]x_1 = [1, 0]^\top, x2=[0,1]x_2 = [0, 1]^\top, x3=[1,1]x_3 = [1, 1]^\top, then:
G=[101011112]G = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 2 \end{bmatrix}
This matrix is symmetric and positive semi-definite, confirming that the linear kernel is valid.

Why Kernels Work: The Representer Theorem

Many machine learning algorithms can be expressed entirely in terms of inner products between data points. For example, in the dual formulation of many problems, the solution takes the form:
w=i=1nαiyiϕ(xi)w = \sum_{i=1}^n \alpha_i y_i \phi(x_i)
To make a prediction on a new point xx, we compute:
f(x)=wϕ(x)=i=1nαiyiϕ(xi)ϕ(x)=i=1nαiyiK(xi,x)f(x) = w^\top \phi(x) = \sum_{i=1}^n \alpha_i y_i \phi(x_i)^\top \phi(x) = \sum_{i=1}^n \alpha_i y_i K(x_i, x)
Notice that we only need to compute kernel evaluations K(xi,x)K(x_i, x), never the explicit feature vectors ϕ(x)\phi(x). This is the essence of the kernel trick.

Valid Kernels: Mercer's Theorem

Not every function can be a valid kernel. For K(x,x)K(x, x') to correspond to an inner product in some feature space, it must satisfy certain conditions. Mercer's theorem states that a function KK is a valid kernel if and only if it is:
Symmetric: K(x,x)=K(x,x)K(x, x') = K(x', x)
Positive semi-definite: For any set of points {x1,,xn}\{x_1, \ldots, x_n\} and any coefficients {c1,,cn}\{c_1, \ldots, c_n\}, we have i=1nj=1ncicjK(xi,xj)0\sum_{i=1}^n \sum_{j=1}^n c_i c_j K(x_i, x_j) \geq 0

Kernelized Perceptron

With feature maps
When we apply a feature map ϕ(x)\phi(x) to the Perceptron algorithm, we need to work in the transformed feature space. The standard Perceptron update becomes w(t+1)=w(t)+ηyiϕ(xi)w^{(t+1)} = w^{(t)} + \eta y_i \phi(x_i) when there's a misclassification.
However, if the feature space is high-dimensional (or even infinite-dimensional, as with some kernel functions), explicitly computing and storing ww in this space becomes impractical or impossible. This is where the dual formulation of the Perceptron becomes essential.
Dual Representation of the Weight Vector
The key insight is that the weight vector ww can always be expressed as a linear combination of the feature-mapped training examples. Starting from w(0)=0w^{(0)} = 0 and applying the Perceptron update rule repeatedly, we can write:
w=i=1nαiyiϕ(xi)w = \sum_{i=1}^n \alpha_i y_i \phi(x_i)
where αi\alpha_i represents how many times example ii was involved in an update (or more generally, the weight given to that example). This is called the dual representation because instead of storing the primal variables (the weight vector ww in feature space), we store the dual variables (αi\alpha_i coefficients).
Prediction in Dual Form
To make a prediction on a new point xx, we need to compute sign(wϕ(x))\text{sign}(w^\top \phi(x)). Substituting the dual representation:
wϕ(x)=(i=1nαiyiϕ(xi))ϕ(x)=i=1nαiyiϕ(xi)ϕ(x)w^\top \phi(x) = \left(\sum_{i=1}^n \alpha_i y_i \phi(x_i)\right)^\top \phi(x) = \sum_{i=1}^n \alpha_i y_i \phi(x_i)^\top \phi(x)
Notice that this only requires computing inner products ϕ(xi)ϕ(x)\phi(x_i)^\top \phi(x) in the feature space. This is exactly where the kernel trick comes in: we can replace these inner products with kernel evaluations:
wϕ(x)=i=1nαiyiK(xi,x)w^\top \phi(x) = \sum_{i=1}^n \alpha_i y_i K(x_i, x)
So the prediction becomes y^=sign(i=1nαiyiK(xi,x))\hat{y} = \text{sign}\left(\sum_{i=1}^n \alpha_i y_i K(x_i, x)\right) without ever explicitly computing ϕ(x)\phi(x).
Dual Perceptron Update Rule
In the dual formulation, we maintain and update the coefficients αi\alpha_i instead of the weight vector ww. The algorithm proceeds as follows:
Initialize αi=0\alpha_i = 0 for all i=1,,ni = 1, \ldots, n
For each training example (xj,yj)(x_j, y_j), compute the prediction: y^j=sign(i=1nαiyiK(xi,xj))\hat{y}_j = \text{sign}\left(\sum_{i=1}^n \alpha_i y_i K(x_i, x_j)\right)
If y^jyj\hat{y}_j \neq y_j (misclassification), update: αjαj+1\alpha_j \leftarrow \alpha_j + 1
Repeat until convergence (no more misclassifications) or for a fixed number of iterations
Each time we misclassify example jj, we increment αj\alpha_j by 1, which is equivalent to adding yjϕ(xj)y_j \phi(x_j) to the weight vector in the primal formulation.
Computational Advantages
The dual formulation with kernels provides several key advantages:
Works with infinite-dimensional features: We never need to explicitly compute or store ϕ(x)\phi(x), only kernel evaluations K(xi,xj)K(x_i, x_j)
Efficient for certain problems: When the number of training examples nn is much smaller than the feature dimension dd, storing nn coefficients is more efficient than storing a dd-dimensional weight vector
Enables nonlinear decision boundaries: By choosing appropriate kernels (like RBF or polynomial), we can learn complex decision boundaries while maintaining the simplicity of the linear Perceptron algorithm
For example
The kernelized Perceptron is M-independent because its computational complexity and memory requirements do not depend on the dimensionality MM of the feature space ϕ(x)\phi(x). In the dual formulation, we only store the coefficients αi\alpha_i for each training example, requiring O(n)O(n) space where nn is the number of training examples, regardless of the feature dimension MM. All computations are performed using kernel evaluations K(xi,xj)K(x_i, x_j), which can be computed directly from the input space without ever explicitly constructing the feature vectors ϕ(xi)\phi(x_i) in the (potentially infinite-dimensional) feature space.
For example, with the RBF kernel K(x,x)=exp(γxx2)K(x, x') = \exp(-\gamma \|x - x'\|^2), the corresponding feature space is infinite-dimensional, yet we can still train and make predictions efficiently because we only need to evaluate the kernel function. This is in stark contrast to the primal formulation, which would require storing and updating a weight vector ww of dimension MM, which becomes impractical or impossible when MM is very large or infinite.
The kernelized Perceptron is a non-parametric method because the complexity of the model grows with the amount of training data rather than being fixed in advance. The decision function is expressed as f(x)=i=1nαiyiK(xi,x)f(x) = \sum_{i=1}^n \alpha_i y_i K(x_i, x), which explicitly involves all training examples (or at least those with non-zero αi\alpha_i). The model "memorizes" the training examples rather than summarizing them into a fixed set of parameters.
While parametric models like standard linear regression have a fixed number of parameters (the dimension of ww), the kernelized Perceptron's representation involves nn dual coefficients αi\alpha_i, where nn is the number of training examples. As we collect more data, the model's capacity grows. Non-parametric methods can represent increasingly complex functions as more training data becomes available, allowing the kernelized Perceptron to capture more intricate decision boundaries without changing the model structure.
List view
Search
Wiki
Overview of ML & Classification
Lecture Summary
ML
2025/12/22
Done
Overview of ML & Classification
Lecture Summary
ML
2025/12/22
Done