SVM assignment

SVM assignment#

Additional Resources#

SVM and Duality#

Mathematical Foundations:#

Hypothesis Representation:
- SVM aims to find a hyperplane that separates data points of different classes in a feature space:
\[ w^T x + b = 0 \]

where \(w\) is the weight vector, \(x\) is the feature vector, and \(b\) is the bias term.
Margin and Support Vectors:
- SVM maximizes the margin, which is the reciprocal of the magnitude of the weight vector:
\[ \text{Margin} = \frac{1}{\|w\|} \]
SVM Objective Function:
- The objective is to maximize \(\frac{1}{2} \|w\|^2\) subject to the constraint that \(\forall i\), \(y^{(i)}(w^T x^{(i)} + b) \geq 1\). This leads to the optimization problem:
\[ \text{Minimize } \frac{1}{2} \|w\|^2 \text{ subject to } y^{(i)}(w^T x^{(i)} + b) \geq 1, \forall i \]

Modified SVM Objective Function with Slack Variables:

Introduction to Slack Variables:
- In real-world scenarios, it’s common for data points to be challenging to separate perfectly, leading to misclassifications or data points within the margin.
- Slack variables (\(\epsilon_i\)) are introduced to handle such situations.
Trade-off with Margin and Misclassifications:
- The modified objective function combines the margin term, \(\frac{1}{2}\|w\|^2\), with a term that penalizes misclassifications and points within the margin, \(C \sum_{i=1}^{m} \epsilon_i\).
- The hyperparameter \(C\) allows you to control the trade-off between maximizing the margin and allowing classification errors (misclassifications and points within the margin).

Practical Concepts:#

Constraints with Slack Variables:
- The constraints ensure that data points are correctly classified (or within the margin) based on the value of \(\epsilon_i\).
- The inequality \(y^{(i)}(w^T x^{(i)} + b) \geq 1 - \epsilon_i\) enforces correct classification if \(\epsilon_i = 0\) and allows for some tolerance (\(\epsilon_i > 0\)) for misclassifications.
Hyperparameter \(C\):
- The hyperparameter \(C\) controls the balance between maximizing the margin and minimizing misclassifications.
- Smaller \(C\) values emphasize maximizing the margin, potentially allowing more misclassifications (\(\epsilon_i > 0\)).
- Larger \(C\) values emphasize minimizing misclassifications, possibly reducing the margin width and resulting in smaller \(\epsilon_i\) values.
Optimization:
- The optimization process aims to find optimal values for \(w\), \(b\), and \(\epsilon_i\) by minimizing the modified objective function while satisfying the constraints.
- The optimal solution balances margin width, misclassifications, and the penalty associated with \(\epsilon_i\), based on the value of \(C\).

Functional & Geometric Margin#

Concept	Definition	Interpretation
Functional Margin (\(f(x_i)\))	\(f(x_i) = y_i(\mathbf{w} \cdot \mathbf{x}_i + b)\)	- Positive \(f(x_i)\) indicates correct classification. - Negative \(f(x_i)\) indicates misclassification. - Magnitude of \(f(x_i)\) measures confidence in classification.
Geometric Margin	\( \text{Geometric Margin} = \frac{f(x_i)}{\|\mathbf{w}\|} \)	- Measures perpendicular distance from data point to hyperplane. - Normalized by magnitude of \(\mathbf{w}\) to be scale-independent. - Represents the actual margin or gap between data point and hyperplane.

Parameter Update Rules#

The update rules for the weights \(w\) and bias \(b\) in stochastic gradient descent with a learning rate \(\eta\) (eta) and per-instance loss \(L_i(w, b)\) are as follows:

\[ w_{t+1} = w_t - \eta \nabla_w L_i(w, b) \]

\[ b_{t+1} = b_t - \eta \nabla_b L_i(w, b) \]

Here, \(L_i(w, b)\) is the loss function for the \(i\)-th example, and the total loss \(L(w, b)\) is typically defined as:

\[ L(w, b) = \frac{1}{2}\|w\|^2 + \frac{1}{C} \sum_{i=1}^{n} L_i(w, b) \]

In this context:

\(\eta\) (eta) is the learning rate
\(C\) is a regularization parameter
\(\frac{1}{2\lambda}\|w\|^2\) will be used to simplify the computation of its derivative with respect to \(w\)
\(\lambda\) (lambda) is related to the regularization parameter \(C\).

Derivative of Loss Function#

\[ L(w,b) = \sum_{i} \max(0, 1 - y_i (w^T x_i + b)) + \frac{\lambda}{2} \| w \|_2^2 \]

The derivative with respect to \( w \):

\[\begin{split} \frac{\partial L}{\partial w} = \left\{ \begin{array}{ll} -y_i x_i + \lambda w & \mbox{if } 1 - y_i (w^T x_i + b) > 0 \\ \lambda w & \mbox{otherwise} \end{array} \right. \end{split}\]

And the derivative with respect to \( b \):

\[\begin{split} \frac{\partial L}{\partial b} = \left\{ \begin{array}{ll} -y_i & \mbox{if } 1 - y_i (w^T x_i + b) > 0 \\ 0 & \mbox{otherwise} \end{array} \right. \end{split}\]

Key Concepts#

Kernel Functions : In order to make the mathematics possible, Support Vector Machines use something called Kernel Functions to systematically find Support Vector Classifiers in higher dimensions.