SVM assignment#
Additional Resources#
SVM and Duality#
Mathematical Foundations:#
Hypothesis Representation:
SVM aims to find a hyperplane that separates data points of different classes in a feature space:
\[ w^T x + b = 0 \]where \(w\) is the weight vector, \(x\) is the feature vector, and \(b\) is the bias term.
Margin and Support Vectors:
SVM maximizes the margin, which is the reciprocal of the magnitude of the weight vector:
\[ \text{Margin} = \frac{1}{\|w\|} \]SVM Objective Function:
The objective is to maximize \(\frac{1}{2} \|w\|^2\) subject to the constraint that \(\forall i\), \(y^{(i)}(w^T x^{(i)} + b) \geq 1\). This leads to the optimization problem:
\[ \text{Minimize } \frac{1}{2} \|w\|^2 \text{ subject to } y^{(i)}(w^T x^{(i)} + b) \geq 1, \forall i \]
Modified SVM Objective Function with Slack Variables:
Introduction to Slack Variables:
In real-world scenarios, it’s common for data points to be challenging to separate perfectly, leading to misclassifications or data points within the margin.
Slack variables (\(\epsilon_i\)) are introduced to handle such situations.
Trade-off with Margin and Misclassifications:
The modified objective function combines the margin term, \(\frac{1}{2}\|w\|^2\), with a term that penalizes misclassifications and points within the margin, \(C \sum_{i=1}^{m} \epsilon_i\).
The hyperparameter \(C\) allows you to control the trade-off between maximizing the margin and allowing classification errors (misclassifications and points within the margin).
Practical Concepts:#
Constraints with Slack Variables:
The constraints ensure that data points are correctly classified (or within the margin) based on the value of \(\epsilon_i\).
The inequality \(y^{(i)}(w^T x^{(i)} + b) \geq 1 - \epsilon_i\) enforces correct classification if \(\epsilon_i = 0\) and allows for some tolerance (\(\epsilon_i > 0\)) for misclassifications.
Hyperparameter \(C\):
The hyperparameter \(C\) controls the balance between maximizing the margin and minimizing misclassifications.
Smaller \(C\) values emphasize maximizing the margin, potentially allowing more misclassifications (\(\epsilon_i > 0\)).
Larger \(C\) values emphasize minimizing misclassifications, possibly reducing the margin width and resulting in smaller \(\epsilon_i\) values.
Optimization:
The optimization process aims to find optimal values for \(w\), \(b\), and \(\epsilon_i\) by minimizing the modified objective function while satisfying the constraints.
The optimal solution balances margin width, misclassifications, and the penalty associated with \(\epsilon_i\), based on the value of \(C\).
Functional & Geometric Margin#
Concept |
Definition |
Interpretation |
---|---|---|
Functional Margin (\(f(x_i)\)) |
\(f(x_i) = y_i(\mathbf{w} \cdot \mathbf{x}_i + b)\) |
- Positive \(f(x_i)\) indicates correct classification. - Negative \(f(x_i)\) indicates misclassification. - Magnitude of \(f(x_i)\) measures confidence in classification. |
Geometric Margin |
\( \text{Geometric Margin} = \frac{f(x_i)}{|\mathbf{w}|} \) |
- Measures perpendicular distance from data point to hyperplane. - Normalized by magnitude of \(\mathbf{w}\) to be scale-independent. - Represents the actual margin or gap between data point and hyperplane. |
Parameter Update Rules#
The update rules for the weights \(w\) and bias \(b\) in stochastic gradient descent with a learning rate \(\eta\) (eta) and per-instance loss \(L_i(w, b)\) are as follows:
Here, \(L_i(w, b)\) is the loss function for the \(i\)-th example, and the total loss \(L(w, b)\) is typically defined as:
In this context:
\(\eta\) (eta) is the learning rate
\(C\) is a regularization parameter
\(\frac{1}{2\lambda}\|w\|^2\) will be used to simplify the computation of its derivative with respect to \(w\)
\(\lambda\) (lambda) is related to the regularization parameter \(C\).
Derivative of Loss Function#
The derivative with respect to \( w \):
And the derivative with respect to \( b \):
Key Concepts#
Kernel Functions : In order to make the mathematics possible, Support Vector Machines use something called Kernel Functions to systematically find Support Vector Classifiers in higher dimensions.