Multiclass Classification & Kernel Method

Published at

2026/01/03

Last edited time

2026/01/03 03:23

Created

2026/01/03 02:41

Section

Status

Done

Series

From the Bottom

Multiclass Classification & Kernel Method

Multiclass Classification

Multinomial Classification

Recall

How can we swtich binary classification into multiclass classification?

What should a linear model look like for multiclass tasks?

•

Think of wk⊤xw^\top_k xwk⊤​x as a score for class kkk

•

Each class kkk has its own weight vector wkw_kwk​, and we predict the class with the highest score: y^=arg⁡max⁡kwk⊤x\hat{y} = \arg\max_k w_k^\top xy^​=argmaxk​wk⊤​x

•

This approach naturally extends the binary linear classifier to handle multiple classes simultaneously

•

The decision boundaries are now defined by pairwise hyperplanes between classes: wi⊤x=wj⊤xw_i^\top x = w_j^\top xwi⊤​x=wj⊤​x

How do we generalize perceptron / hinge / logistic loss? (Focus on logistic loss)

•

Multiclass logistic loss a.k.a. Cross-Entropy Loss

To estimate the probability distribution over all

K

classes, we define the probability of class

k

for a given input

x

as:

P(y = k | x) = \frac{e^{w_k^\top x}}{\sum_{j=1}^K e^{w_j^\top x}}

This is the softmax function, which generalizes the logistic sigmoid to multiple classes. It ensures all class probabilities are positive and sum to 1, creating a valid probability distribution.

For multiclass classification with

K

classes, the cross-entropy loss can be written in several equivalent forms:

L(y, x) = -\log P(y | x) = -\log\left(\frac{e^{w_y^\top x}}{\sum_{j=1}^K e^{w_j^\top x}}\right)

Expanding using logarithm properties:

L(y, x) = -w_y^\top x + \log\left(\sum_{j=1}^K e^{w_j^\top x}\right)

With one-hot encoding where

y_k = 1

for the true class and

y_j = 0

otherwise:

L(y, \hat{y}) = -\sum_{k=1}^K y_k \log(\hat{y}_k) = -\sum_{k=1}^K y_k \log\left(\frac{e^{w_k^\top x}}{\sum_{j=1}^K e^{w_j^\top x}}\right)

For the entire training set:

F(w) = -\frac{1}{n} \sum_{i=1}^n \log P(y_i | x_i) = \frac{1}{n} \sum_{i=1}^n \left[-w_{y_i}^\top x_i + \log\left(\sum_{j=1}^K e^{w_j^\top x_i}\right)\right]

Alternatively, we can write the loss in terms of the difference between weight vectors:

F(W) = -w_y^\top x + \log\left(\sum_{j=1}^K e^{w_j^\top x}\right) = \log\left(\sum_{j=1}^K e^{(w_j - w_y)^\top x}\right)

This formulation emphasizes that only the relative differences between weight vectors matter for the predicted probabilities, not their absolute values.

The training objective minimizes the negative log-likelihood over all training examples:

\hat{w} = \arg\min_w -\frac{1}{n} \sum_{i=1}^n \log P(y_i | x_i) = \arg\min_w \frac{1}{n} \sum_{i=1}^n \log(1 + e^{-y_i \cdot w_k^\top x_i})

This minimizes the sum of multiclass logistic losses. Each example contributes a loss based on how well the model predicts its true class relative to all other classes.

Optimization (Apply SGD)

•

What is the gradient of Fn(W)?F_n(W)?Fn​(W)?

To find the gradient of the multiclass logistic loss function

F_n(W)

with respect to the weight matrix

W

, we need to compute the derivative of the loss with respect to each class's weight vector

w_k

Starting from the loss function:

F_n(W) = -w_y^\top x + \log\left(\sum_{j=1}^K e^{w_j^\top x}\right)

The gradient with respect to

w_k

is:

\nabla_{w_k} F_n(W) = \begin{cases} -x + \frac{e^{w_k^\top x}}{\sum_{j=1}^K e^{w_j^\top x}} \cdot x & \text{if } k = y \\ \frac{e^{w_k^\top x}}{\sum_{j=1}^K e^{w_j^\top x}} \cdot x & \text{if } k \neq y \end{cases}

This can be written more compactly as:

\nabla_{w_k} F_n(W) = (P(y = k | x) - \mathbb{1}[k = y]) \cdot x

where

P(y = k | x) = \frac{e^{w_k^\top x}}{\sum_{j=1}^K e^{w_j^\top x}}

is the softmax probability for class

k

So, the algorithm is like

Once we have learned the weight matrix

W

(where each row

w_k

corresponds to the weight vector for class

k

), we have several options for making predictions on new data points:

Deterministic Prediction: The most straightforward approach is to make a deterministic prediction by selecting the class with the highest score:

\hat{y} = \arg\max_{k \in [K]} w_k^\top x

. This means we compute the score

w_k^\top x

for each class

k

and predict the class that achieves the maximum score. This approach is simple, fast, and gives a single definite answer.

Randomized (Probabilistic) Prediction: Alternatively, we can make a randomized prediction by sampling from the probability distribution defined by the softmax function:

P(y = k | x) = \frac{e^{w_k^\top x}}{\sum_{j=1}^K e^{w_j^\top x}}

. In this approach, we compute the probability for each class using the softmax function, then draw a random sample from this distribution. Classes with higher probabilities are more likely to be selected, but there's still a chance of selecting lower-probability classes.

This probabilistic approach is useful when we want to incorporate uncertainty into our predictions, need diverse outputs for applications like exploration in reinforcement learning, want to avoid always choosing the same class when multiple classes have similar probabilities, or are using the model as part of a larger probabilistic system.

•

Generalization of Cross-Entropy Loss

◦

Scoring Function

A scoring function generalizes the notion of class scores beyond simple linear combinations. Instead of just computing

w_k^\top x

for each class

k

, we can define a more flexible scoring function

s_k(x; w) = f_k(x, w)

that could be any parameterized function. This allows us to capture more complex relationships between inputs and class predictions, such as using neural networks where

s_k(x; w) = w_k^\top \phi(x)

with

\phi(x)

being a learned feature representation.

Reduction to Binary Classification

Is there an even more general and simpler approach to derive multiclass classification algorithms?

→ Given a binary classification algorithm. can we turn it to a multiclass algorithm?

•

The answer is YES

•

One-vs-All (OvA): Train K binary classifiers where each classifier k distinguishes class k from all other classes. For prediction, run all K classifiers and choose the class with the highest confidence score: y^=arg⁡max⁡kwk⊤x\hat{y} = \arg\max_k w_k^\top xy^​=argmaxk​wk⊤​x. Simple and efficient but suffers from class imbalance since each classifier sees many more negative examples than positive ones.

•

One-vs-One (OvO): Train K(K−1)2\frac{K(K-1)}{2}2K(K−1)​ binary classifiers, one for each pair of classes. Each classifier is trained only on examples from its two classes, ignoring all others. For prediction, use majority voting across all pairwise classifiers. More accurate for well-separated classes and naturally balanced, but requires many more classifiers and slower prediction time.

•

Tree-based: Organize classes in a hierarchical tree structure where each internal node represents a binary decision between groups of classes. At each node, train a binary classifier that splits the remaining classes into two subsets. For prediction, traverse the tree from root to leaf by making binary decisions at each node. Requires only log⁡K\log KlogK classifier evaluations at test time, making it very efficient for large K, but the tree structure must be carefully designed and errors propagate down the tree.

•

Comparison

Kernel Method

The Kernel Method is a powerful technique in machine learning that allows us to apply linear algorithms to non-linearly separable data without explicitly computing high-dimensional feature transformations. The key idea is to use a kernel function that computes inner products in a high-dimensional feature space implicitly, avoiding the computational cost of working directly in that space.

Motivation: From Linear to Nonlinear

Recall that linear classifiers work well when data is linearly separable, but many real-world problems require nonlinear decision boundaries. We previously saw that we can apply a nonlinear mapping

\phi: \mathbb{R}^d \rightarrow \mathbb{R}^D

to transform the input features into a higher-dimensional space where the data becomes linearly separable.

For example, mapping

x = (x_1, x_2)

\phi(x) = (x_1, x_2, x_1^2, x_2^2, x_1 x_2)

allows us to capture polynomial relationships. However, if we want to capture very complex patterns, we might need

D

to be very large (even infinite), making explicit computation of

\phi(x)

intractable.

The Kernel Trick

The kernel trick provides an elegant solution: instead of explicitly computing

\phi(x)

, we define a kernel function

K(x, x')

(the notation written in the picture is small k) that directly computes the inner product in the feature space:

K(x, x') = \langle \phi(x), \phi(x') \rangle = \phi(x)^\top \phi(x')

The remarkable property is that we can compute

K(x, x')

efficiently without ever explicitly computing

\phi(x)

. This allows us to work in very high-dimensional (even infinite-dimensional) feature spaces while maintaining computational efficiency.

•

Gram Matrix 

In the context of kernel methods, the Gram matrix (also called the kernel matrix) is a fundamental concept that captures all pairwise kernel evaluations between training examples.

For a training set

\{x_1, x_2, \ldots, x_n\}

and a kernel function

K(x, x')

, the Gram matrix

G

is an

n \times n

symmetric matrix where each entry

G_{ij}

is defined as:

G_{ij} = K(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle

In matrix form, this can be written as:

G = \begin{bmatrix} K(x_1, x_1) & K(x_1, x_2) & \cdots & K(x_1, x_n) \\ K(x_2, x_1) & K(x_2, x_2) & \cdots & K(x_2, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ K(x_n, x_1) & K(x_n, x_2) & \cdots & K(x_n, x_n) \end{bmatrix}

Key Properties of the Gram Matrix:

◦

Symmetry: Since K(xi,xj)=K(xj,xi)K(x_i, x_j) = K(x_j, x_i)K(xi​,xj​)=K(xj​,xi​) for any valid kernel, the Gram matrix is symmetric: G=G⊤G = G^\topG=G⊤.

◦

Positive Semi-Definiteness: For any valid kernel function, the Gram matrix must be positive semi-definite (PSD), meaning for any vector c∈Rnc \in \mathbb{R}^nc∈Rn, we have c⊤Gc≥0c^\top G c \geq 0c⊤Gc≥0. This is equivalent to Mercer's condition and ensures the kernel corresponds to a valid inner product.

◦

Encodes all training information: The Gram matrix contains all the information needed to compute the solution in kernel-based algorithms. Once we have GGG, we never need to explicitly access the feature representations ϕ(xi)\phi(x_i)ϕ(xi​).

Role in Kernel Algorithms:

Many kernel-based algorithms can be expressed entirely in terms of the Gram matrix. For example, in kernelized ridge regression, the solution takes the form:

\alpha = (G + \lambda I)^{-1} y

where

\alpha

is the vector of dual coefficients,

\lambda

is the regularization parameter,

I

is the identity matrix, and

y

is the vector of labels.

To make predictions on a new point

x

, we compute:

f(x) = \sum_{i=1}^n \alpha_i K(x_i, x) = k(x)^\top \alpha

where

k(x) = [K(x_1, x), K(x_2, x), \ldots, K(x_n, x)]^\top

is the vector of kernel evaluations between the new point and all training points.

Computational Considerations:

◦

Computing the Gram matrix requires O(n2)O(n^2)O(n2) kernel evaluations and O(n2)O(n^2)O(n2) storage, which can be prohibitive for very large datasets.

◦

Many algorithms require inverting or decomposing the Gram matrix, which takes O(n3)O(n^3)O(n3) time.

◦

For large-scale problems, approximate methods like random features or Nyström approximation are often used to avoid computing the full Gram matrix.

Example: For a simple 2D dataset with three points and a linear kernel

K(x, x') = x^\top x'

x_1 = [1, 0]^\top

x_2 = [0, 1]^\top

x_3 = [1, 1]^\top

, then:

G = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 2 \end{bmatrix}

This matrix is symmetric and positive semi-definite, confirming that the linear kernel is valid.

Why Kernels Work: The Representer Theorem

Many machine learning algorithms can be expressed entirely in terms of inner products between data points. For example, in the dual formulation of many problems, the solution takes the form:

w = \sum_{i=1}^n \alpha_i y_i \phi(x_i)

To make a prediction on a new point

x

, we compute:

f(x) = w^\top \phi(x) = \sum_{i=1}^n \alpha_i y_i \phi(x_i)^\top \phi(x) = \sum_{i=1}^n \alpha_i y_i K(x_i, x)

Notice that we only need to compute kernel evaluations

K(x_i, x)

, never the explicit feature vectors

\phi(x)

. This is the essence of the kernel trick.

Valid Kernels: Mercer's Theorem

Not every function can be a valid kernel. For

K(x, x')

to correspond to an inner product in some feature space, it must satisfy certain conditions. Mercer's theorem states that a function

K

is a valid kernel if and only if it is:

•

Symmetric: K(x,x′)=K(x′,x)K(x, x') = K(x', x)K(x,x′)=K(x′,x)

•

Positive semi-definite: For any set of points {x1,…,xn}\{x_1, \ldots, x_n\}{x1​,…,xn​} and any coefficients {c1,…,cn}\{c_1, \ldots, c_n\}{c1​,…,cn​}, we have ∑i=1n∑j=1ncicjK(xi,xj)≥0\sum_{i=1}^n \sum_{j=1}^n c_i c_j K(x_i, x_j) \geq 0∑i=1n​∑j=1n​ci​cj​K(xi​,xj​)≥0

Kernelized Perceptron

•

With feature maps

When we apply a feature map

\phi(x)

to the Perceptron algorithm, we need to work in the transformed feature space. The standard Perceptron update becomes

w^{(t+1)} = w^{(t)} + \eta y_i \phi(x_i)

when there's a misclassification.

However, if the feature space is high-dimensional (or even infinite-dimensional, as with some kernel functions), explicitly computing and storing

w

in this space becomes impractical or impossible. This is where the dual formulation of the Perceptron becomes essential.

◦

Dual Representation of the Weight Vector

The key insight is that the weight vector

w

can always be expressed as a linear combination of the feature-mapped training examples. Starting from

w^{(0)} = 0

and applying the Perceptron update rule repeatedly, we can write:

w = \sum_{i=1}^n \alpha_i y_i \phi(x_i)

where

\alpha_i

represents how many times example

i

was involved in an update (or more generally, the weight given to that example). This is called the dual representation because instead of storing the primal variables (the weight vector

w

in feature space), we store the dual variables (

\alpha_i

coefficients).

◦

Prediction in Dual Form

To make a prediction on a new point

x

, we need to compute

\text{sign}(w^\top \phi(x))

. Substituting the dual representation:

w^\top \phi(x) = \left(\sum_{i=1}^n \alpha_i y_i \phi(x_i)\right)^\top \phi(x) = \sum_{i=1}^n \alpha_i y_i \phi(x_i)^\top \phi(x)

Notice that this only requires computing inner products

\phi(x_i)^\top \phi(x)

in the feature space. This is exactly where the kernel trick comes in: we can replace these inner products with kernel evaluations:

w^\top \phi(x) = \sum_{i=1}^n \alpha_i y_i K(x_i, x)

So the prediction becomes

\hat{y} = \text{sign}\left(\sum_{i=1}^n \alpha_i y_i K(x_i, x)\right)

without ever explicitly computing

\phi(x)

◦

Dual Perceptron Update Rule

In the dual formulation, we maintain and update the coefficients

\alpha_i

instead of the weight vector

w

. The algorithm proceeds as follows:

▪

Initialize αi=0\alpha_i = 0αi​=0 for all i=1,…,ni = 1, \ldots, ni=1,…,n

▪

For each training example (xj,yj)(x_j, y_j)(xj​,yj​), compute the prediction: y^j=sign(∑i=1nαiyiK(xi,xj))\hat{y}_j = \text{sign}\left(\sum_{i=1}^n \alpha_i y_i K(x_i, x_j)\right)y^​j​=sign(∑i=1n​αi​yi​K(xi​,xj​))

▪

If y^j≠yj\hat{y}_j \neq y_jy^​j​=yj​ (misclassification), update: αj←αj+1\alpha_j \leftarrow \alpha_j + 1αj​←αj​+1

▪

Repeat until convergence (no more misclassifications) or for a fixed number of iterations

Each time we misclassify example

j

, we increment

\alpha_j

by 1, which is equivalent to adding

y_j \phi(x_j)

to the weight vector in the primal formulation.

◦

Computational Advantages

The dual formulation with kernels provides several key advantages:

▪

Works with infinite-dimensional features: We never need to explicitly compute or store ϕ(x)\phi(x)ϕ(x), only kernel evaluations K(xi,xj)K(x_i, x_j)K(xi​,xj​)

▪

Efficient for certain problems: When the number of training examples nnn is much smaller than the feature dimension ddd, storing nnn coefficients is more efficient than storing a ddd-dimensional weight vector

▪

Enables nonlinear decision boundaries: By choosing appropriate kernels (like RBF or polynomial), we can learn complex decision boundaries while maintaining the simplicity of the linear Perceptron algorithm

For example

The kernelized Perceptron is M-independent because its computational complexity and memory requirements do not depend on the dimensionality

M

of the feature space

\phi(x)

. In the dual formulation, we only store the coefficients

\alpha_i

for each training example, requiring

O(n)

space where

n

is the number of training examples, regardless of the feature dimension

M

. All computations are performed using kernel evaluations

K(x_i, x_j)

, which can be computed directly from the input space without ever explicitly constructing the feature vectors

\phi(x_i)

in the (potentially infinite-dimensional) feature space.

For example, with the RBF kernel

K(x, x') = \exp(-\gamma \|x - x'\|^2)

, the corresponding feature space is infinite-dimensional, yet we can still train and make predictions efficiently because we only need to evaluate the kernel function. This is in stark contrast to the primal formulation, which would require storing and updating a weight vector

w

of dimension

M

, which becomes impractical or impossible when

M

is very large or infinite.

The kernelized Perceptron is a non-parametric method because the complexity of the model grows with the amount of training data rather than being fixed in advance. The decision function is expressed as

f(x) = \sum_{i=1}^n \alpha_i y_i K(x_i, x)

, which explicitly involves all training examples (or at least those with non-zero

\alpha_i

). The model "memorizes" the training examples rather than summarizing them into a fixed set of parameters.

While parametric models like standard linear regression have a fixed number of parameters (the dimension of

w

), the kernelized Perceptron's representation involves

n

dual coefficients

\alpha_i

, where

n

is the number of training examples. As we collect more data, the model's capacity grows. Non-parametric methods can represent increasingly complex functions as more training data becomes available, allowing the kernelized Perceptron to capture more intricate decision boundaries without changing the model structure.