•
Multiclass classification extends binary classification to K classes, with each class k having its own weight vector and predictions made by
•
Cross-entropy loss (softmax loss) generalizes logistic loss to multiple classes: , providing probability distributions over all classes
•
Reduction to binary classification enables multiclass algorithms through One-vs-All (K classifiers), One-vs-One ( classifiers), or tree-based approaches (log K evaluations)
•
The kernel method allows linear algorithms to work in high-dimensional feature spaces without explicit computation through kernel functions
•
The Gram matrix encodes all pairwise kernel evaluations and must be positive semi-definite for valid kernels, enabling kernel-based algorithms to operate entirely through inner products
•
The kernel trick leverages the representer theorem to express solutions as , making predictions via without computing explicitly
•
Valid kernels must satisfy Mercer's theorem (symmetric and positive semi-definite), with common examples including linear, polynomial, and RBF (Gaussian) kernels
•
Kernel methods enable efficient computation in infinite-dimensional spaces but face computational challenges with storage and inversion costs for large datasets
Multiclass Classification & Kernel Method
Multiclass Classification
Multinomial Classification
Recall
How can we swtich binary classification into multiclass classification?
1.
What should a linear model look like for multiclass tasks?
•
Think of as a score for class
•
Each class has its own weight vector , and we predict the class with the highest score:
•
This approach naturally extends the binary linear classifier to handle multiple classes simultaneously
•
The decision boundaries are now defined by pairwise hyperplanes between classes:
2.
How do we generalize perceptron / hinge / logistic loss? (Focus on logistic loss)
•
Multiclass logistic loss a.k.a. Cross-Entropy Loss
To estimate the probability distribution over all classes, we define the probability of class for a given input as:
This is the softmax function, which generalizes the logistic sigmoid to multiple classes. It ensures all class probabilities are positive and sum to 1, creating a valid probability distribution.
For multiclass classification with classes, the cross-entropy loss can be written in several equivalent forms:
Expanding using logarithm properties:
With one-hot encoding where for the true class and otherwise:
For the entire training set:
Alternatively, we can write the loss in terms of the difference between weight vectors:
This formulation emphasizes that only the relative differences between weight vectors matter for the predicted probabilities, not their absolute values.
The training objective minimizes the negative log-likelihood over all training examples:
This minimizes the sum of multiclass logistic losses. Each example contributes a loss based on how well the model predicts its true class relative to all other classes.
3.
Optimization (Apply SGD)
•
What is the gradient of
To find the gradient of the multiclass logistic loss function with respect to the weight matrix , we need to compute the derivative of the loss with respect to each class's weight vector .
Starting from the loss function:
The gradient with respect to is:
This can be written more compactly as:
where is the softmax probability for class .
So, the algorithm is like
Once we have learned the weight matrix (where each row corresponds to the weight vector for class ), we have several options for making predictions on new data points:
Deterministic Prediction: The most straightforward approach is to make a deterministic prediction by selecting the class with the highest score: . This means we compute the score for each class and predict the class that achieves the maximum score. This approach is simple, fast, and gives a single definite answer.
Randomized (Probabilistic) Prediction: Alternatively, we can make a randomized prediction by sampling from the probability distribution defined by the softmax function: . In this approach, we compute the probability for each class using the softmax function, then draw a random sample from this distribution. Classes with higher probabilities are more likely to be selected, but there's still a chance of selecting lower-probability classes.
This probabilistic approach is useful when we want to incorporate uncertainty into our predictions, need diverse outputs for applications like exploration in reinforcement learning, want to avoid always choosing the same class when multiple classes have similar probabilities, or are using the model as part of a larger probabilistic system.
•
Generalization of Cross-Entropy Loss
◦
Scoring Function
A scoring function generalizes the notion of class scores beyond simple linear combinations. Instead of just computing for each class , we can define a more flexible scoring function that could be any parameterized function. This allows us to capture more complex relationships between inputs and class predictions, such as using neural networks where with being a learned feature representation.
Reduction to Binary Classification
Is there an even more general and simpler approach to derive multiclass classification algorithms?
→ Given a binary classification algorithm. can we turn it to a multiclass algorithm?
•
The answer is YES
•
One-vs-All (OvA): Train K binary classifiers where each classifier k distinguishes class k from all other classes. For prediction, run all K classifiers and choose the class with the highest confidence score: . Simple and efficient but suffers from class imbalance since each classifier sees many more negative examples than positive ones.
•
One-vs-One (OvO): Train binary classifiers, one for each pair of classes. Each classifier is trained only on examples from its two classes, ignoring all others. For prediction, use majority voting across all pairwise classifiers. More accurate for well-separated classes and naturally balanced, but requires many more classifiers and slower prediction time.
•
Tree-based: Organize classes in a hierarchical tree structure where each internal node represents a binary decision between groups of classes. At each node, train a binary classifier that splits the remaining classes into two subsets. For prediction, traverse the tree from root to leaf by making binary decisions at each node. Requires only classifier evaluations at test time, making it very efficient for large K, but the tree structure must be carefully designed and errors propagate down the tree.
•
Comparison
Kernel Method
The Kernel Method is a powerful technique in machine learning that allows us to apply linear algorithms to non-linearly separable data without explicitly computing high-dimensional feature transformations. The key idea is to use a kernel function that computes inner products in a high-dimensional feature space implicitly, avoiding the computational cost of working directly in that space.
Motivation: From Linear to Nonlinear
Recall that linear classifiers work well when data is linearly separable, but many real-world problems require nonlinear decision boundaries. We previously saw that we can apply a nonlinear mapping to transform the input features into a higher-dimensional space where the data becomes linearly separable.
For example, mapping to allows us to capture polynomial relationships. However, if we want to capture very complex patterns, we might need to be very large (even infinite), making explicit computation of intractable.
The Kernel Trick
The kernel trick provides an elegant solution: instead of explicitly computing , we define a kernel function (the notation written in the picture is small k) that directly computes the inner product in the feature space:
The remarkable property is that we can compute efficiently without ever explicitly computing . This allows us to work in very high-dimensional (even infinite-dimensional) feature spaces while maintaining computational efficiency.
•
Gram Matrix
In the context of kernel methods, the Gram matrix (also called the kernel matrix) is a fundamental concept that captures all pairwise kernel evaluations between training examples.
For a training set and a kernel function , the Gram matrix is an symmetric matrix where each entry is defined as:
In matrix form, this can be written as:
Key Properties of the Gram Matrix:
◦
Symmetry: Since for any valid kernel, the Gram matrix is symmetric: .
◦
Positive Semi-Definiteness: For any valid kernel function, the Gram matrix must be positive semi-definite (PSD), meaning for any vector , we have . This is equivalent to Mercer's condition and ensures the kernel corresponds to a valid inner product.
◦
Encodes all training information: The Gram matrix contains all the information needed to compute the solution in kernel-based algorithms. Once we have , we never need to explicitly access the feature representations .
Role in Kernel Algorithms:
Many kernel-based algorithms can be expressed entirely in terms of the Gram matrix. For example, in kernelized ridge regression, the solution takes the form:
where is the vector of dual coefficients, is the regularization parameter, is the identity matrix, and is the vector of labels.
To make predictions on a new point , we compute:
where is the vector of kernel evaluations between the new point and all training points.
Computational Considerations:
◦
Computing the Gram matrix requires kernel evaluations and storage, which can be prohibitive for very large datasets.
◦
Many algorithms require inverting or decomposing the Gram matrix, which takes time.
◦
For large-scale problems, approximate methods like random features or Nyström approximation are often used to avoid computing the full Gram matrix.
Example: For a simple 2D dataset with three points and a linear kernel :
If , , , then:
This matrix is symmetric and positive semi-definite, confirming that the linear kernel is valid.
Why Kernels Work: The Representer Theorem
Many machine learning algorithms can be expressed entirely in terms of inner products between data points. For example, in the dual formulation of many problems, the solution takes the form:
To make a prediction on a new point , we compute:
Notice that we only need to compute kernel evaluations , never the explicit feature vectors . This is the essence of the kernel trick.
Valid Kernels: Mercer's Theorem
Not every function can be a valid kernel. For to correspond to an inner product in some feature space, it must satisfy certain conditions. Mercer's theorem states that a function is a valid kernel if and only if it is:
•
Symmetric:
•
Positive semi-definite: For any set of points and any coefficients , we have
Kernelized Perceptron
•
With feature maps
When we apply a feature map to the Perceptron algorithm, we need to work in the transformed feature space. The standard Perceptron update becomes when there's a misclassification.
However, if the feature space is high-dimensional (or even infinite-dimensional, as with some kernel functions), explicitly computing and storing in this space becomes impractical or impossible. This is where the dual formulation of the Perceptron becomes essential.
◦
Dual Representation of the Weight Vector
The key insight is that the weight vector can always be expressed as a linear combination of the feature-mapped training examples. Starting from and applying the Perceptron update rule repeatedly, we can write:
where represents how many times example was involved in an update (or more generally, the weight given to that example). This is called the dual representation because instead of storing the primal variables (the weight vector in feature space), we store the dual variables ( coefficients).
◦
Prediction in Dual Form
To make a prediction on a new point , we need to compute . Substituting the dual representation:
Notice that this only requires computing inner products in the feature space. This is exactly where the kernel trick comes in: we can replace these inner products with kernel evaluations:
So the prediction becomes without ever explicitly computing .
◦
Dual Perceptron Update Rule
In the dual formulation, we maintain and update the coefficients instead of the weight vector . The algorithm proceeds as follows:
▪
Initialize for all
▪
For each training example , compute the prediction:
▪
If (misclassification), update:
▪
Repeat until convergence (no more misclassifications) or for a fixed number of iterations
Each time we misclassify example , we increment by 1, which is equivalent to adding to the weight vector in the primal formulation.
◦
Computational Advantages
The dual formulation with kernels provides several key advantages:
▪
Works with infinite-dimensional features: We never need to explicitly compute or store , only kernel evaluations
▪
Efficient for certain problems: When the number of training examples is much smaller than the feature dimension , storing coefficients is more efficient than storing a -dimensional weight vector
▪
Enables nonlinear decision boundaries: By choosing appropriate kernels (like RBF or polynomial), we can learn complex decision boundaries while maintaining the simplicity of the linear Perceptron algorithm
For example
The kernelized Perceptron is M-independent because its computational complexity and memory requirements do not depend on the dimensionality of the feature space . In the dual formulation, we only store the coefficients for each training example, requiring space where is the number of training examples, regardless of the feature dimension . All computations are performed using kernel evaluations , which can be computed directly from the input space without ever explicitly constructing the feature vectors in the (potentially infinite-dimensional) feature space.
For example, with the RBF kernel , the corresponding feature space is infinite-dimensional, yet we can still train and make predictions efficiently because we only need to evaluate the kernel function. This is in stark contrast to the primal formulation, which would require storing and updating a weight vector of dimension , which becomes impractical or impossible when is very large or infinite.
The kernelized Perceptron is a non-parametric method because the complexity of the model grows with the amount of training data rather than being fixed in advance. The decision function is expressed as , which explicitly involves all training examples (or at least those with non-zero ). The model "memorizes" the training examples rather than summarizing them into a fixed set of parameters.
While parametric models like standard linear regression have a fixed number of parameters (the dimension of ), the kernelized Perceptron's representation involves dual coefficients , where is the number of training examples. As we collect more data, the model's capacity grows. Non-parametric methods can represent increasingly complex functions as more training data becomes available, allowing the kernelized Perceptron to capture more intricate decision boundaries without changing the model structure.
List view
Search


















