Introduction to Support Vector Machines

This notebook try to give a quick intro to SVM and how to use it in particular with Scikit-learn. It is based and adapted from different sources, in particular from the Notebook on SVM of the course on Advanced Statistical Computing at Vanderbilt University's Department of Biostatistic (Bios366). Please see the reference section for the complete list of sources.

Loading the basics

Let's first load what we need to run

In [298]:
import numpy as np
import pandas as pd
import pylab as plt
pylab.rcParams['figure.figsize'] = (8.0,5.0)
plt.xkcd() 
Out[298]:
<matplotlib.rc_context at 0x11a6a9050>
In [299]:
def remove_axis(plot):
    ax = plot.gca()
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.yaxis.set_ticks_position('none')
    ax.xaxis.set_ticks_position('none')
    

Intuition behind Support Vector Machines

Let's assume we have the follwing data which is linearly separable

In [300]:
g1 = np.random.multivariate_normal((-1,-1), np.eye(2)*0.2, 10)
g2 = np.random.multivariate_normal((1,1), np.eye(2)*0.2, 10)
remove_axis(plt)
plt.scatter(*g1.T, color='r')
plt.scatter(*g2.T, color='b')
plt.xlim(-2,2); plt.ylim(-2,2)
Out[300]:
(-2, 2)

We want so separte the with an hyperplane (line in 2D). There are many possibilities:

One possible separation is a line that passes very close to points in both groups. We could also place some lines by graphic intuition.

In [315]:
plt.xkcd()
x,y = np.transpose([g1[np.where(g1.T[1]==g1.max(0)[1])[0][0]], 
                       g2[np.where(g2.T[1]==g2.min(0)[1])[0][0]]])

remove_axis(plt)
plt.scatter(*g1.T, color='r')
plt.scatter(*g2.T, color='b')
x,y = np.transpose([g1[np.where(g1.T[1]==g1.max(0)[1])[0][0]], 
                       g2[np.where(g2.T[1]==g2.min(0)[1])[0][0]]])
b0,b1 = np.linalg.lstsq(np.c_[[1,1],x], y)[0]
xspace = np.linspace(-3,3,100)
plt.plot(xspace, b0 + 0.05 + (b1-.2)*xspace, 'k--')
y = g2[np.where(g2.T[1]==g2.min(0)[1])[0][0]]
plt.plot([0,0],[-2,2] , 'k--')
plt.plot([-2,2],[2,-2] , 'k--')
plt.xlim(-2,2); plt.ylim(-2,2)
Out[315]:
(-2, 2)

Which separating hyperplane is the best?

The one based on LS seems like a poor choice of decision boundary, even though it separates the groups, because it may not be a robust solution. The others seem better, but which one should pick? Intuitively, a wide margin around the dividing line seems to make this more likely. Let's look at an example:

In [303]:
frame = plt.gca()
frame.axes.yaxis.set_ticklabels([])
frame.axes.xaxis.set_ticklabels([])

x1 = -1, 0
x2 = 1, 1
remove_axis(plt)
plt.scatter(*x1, s=300, marker='+')
plt.scatter(*x2, s=300, marker='+', color='r')
plt.plot([-1.5, 1.5],[0,1], 'k-')
Out[303]:
[<matplotlib.lines.Line2D at 0x12485d790>]

SVM avoids this by establishing a margin between the decision boundary and the nearest points in each group. This margin is maximized under SVM.

We look for a vector \(w\) is perpendicular to this decision boundary:

In [304]:
remove_axis(plt)
plt.scatter(*x1, s=300, marker='+')
plt.scatter(*x2, s=300, marker='+', color='r')
plt.plot([-1.5, 1.5],[0,1], 'k-')
plt.plot([-.5, .75], [1, 0], 'k--')
plt.annotate(r"w", xy=(-0.4, 1), fontsize=20)
Out[304]:
<matplotlib.text.Annotation at 0x124857850>
In [305]:
remove_axis(plt)
plt.scatter(*x1, s=300, marker='+')
plt.scatter(*x2, s=300, marker='+', color='r')
plt.plot([-1.5, 1.5],[0,1], 'k-')
plt.plot([-.5, .75], [1, 0], 'k--')
plt.plot([0.1,-0.05],[0.52,0.64], linewidth=4, color='r')
plt.plot([0.1,0.28],[0.52,0.4],linewidth=4)

plt.annotate(r"w", xy=(-0.4, 1),  fontsize=20)
plt.arrow(-1, 0, 3*(.35), .35, fc="b", ec="b", head_width=0.07, head_length=0.2)
plt.arrow(1, 1, -3*(.28), -.28, fc="r", ec="r", head_width=0.07, head_length=0.2)
Out[305]:
<matplotlib.patches.FancyArrow at 0x1248c4910>

Now consider another decision boundary, which intuitively appears to be a better choice.

In [310]:
remove_axis(plt)
plt.scatter(*x1, s=300, marker='+')
plt.scatter(*x2, s=300, marker='+', color='r')

plt.plot([-.5,.5], [1.5,-.5], 'k-')
Out[310]:
[<matplotlib.lines.Line2D at 0x1250c6d10>]
In [311]:
remove_axis(plt)
plt.scatter(*x1, s=300, marker='+')
plt.scatter(*x2, s=300, marker='+', color='r')
plt.plot([-.5,.5], [1.5,-.5], 'k-')
plt.plot([-1, 1], [-.5, 1.7], 'k--')
plt.annotate(r"w", xy=(0.6, 1.5), fontsize=20)
plt.plot([0.7,-0.05],[1.4,0.55], linewidth=4, color='r')
plt.plot([-0.8,-0.05],[-0.3,0.55], linewidth=4, color='b')

#plt.axis('tight')
plt.arrow(-1, 0, .1, -.2, fc="b", ec="b", head_width=0.07, head_length=0.1)
plt.arrow(1, 1, -.2, .37, fc="r", ec="r", head_width=0.07, head_length=0.1)
Out[311]:
<matplotlib.patches.FancyArrow at 0x12547ba50>

Problem Definition

The hyperplane that divides the training set \[ \mathbf{w}\cdot\mathbf{x} + b = 0\]

The classifier becomes:

\[ class(x) = sign(\mathbf{w}\cdot\mathbf{x} + b)\]

In [312]:
plt.rcdefaults()
remove_axis(plt)
plt.scatter(*x1, s=300, marker='+')
plt.scatter(*x2, s=300, marker='+', color='r')
plt.plot([-.5,.5], [1.5,-.5], 'k-')

plt.plot([-1.5,-0.5],[1,-1.0],'k--', color ='b')
plt.plot([0.5,1.5],[2,0.0],'k--', color ='r')
plt.annotate(r"$\mathbf{w}\cdot\mathbf{x} + b  = 0$", xy=(-0.6, 1.8), fontsize=16)
plt.annotate(r"$\mathbf{w}\cdot\mathbf{x} + b  = 1$", xy=(0.8, 1.8), fontsize=16)
plt.annotate(r"$\mathbf{w}\cdot\mathbf{x} + b  = -1$", xy=(-1.7, 1.5), fontsize=16)
Out[312]:
<matplotlib.text.Annotation at 0x1254c7810>

By using geometry, we find the distance between these two hyperplanes is \(\tfrac{2}{\|w\|}\), so we want to minimize \(\|w\|\) As we also have to prevent data points from falling into the margin, we add the following constraint:

\[y_i(w \cdot x_i - b) \ge 1, \text{ for all } 1 \le i \le n.\]

This can be re-written as:

Problem Definition

\[\arg\min_{(w,b)}\frac{1}{2}\|w\|^2\]

subject to (for any \(i = 1, \dots, n\)):

\[y_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1.\]

To understand how SVM incorporates a margin into its decision boundary, it helps to re-write our objective function in terms of the norm (length) of the \(w\)

\[\min_{w} \frac{1}{2} \sum_{j=1}^k w^2_j = \min_{w} \frac{1}{2} ||w||^2\]

When we take the inner product of two vectors, we are essentially projecting the values of one vector onto the other, in order to add them. In the case of our inner product \(w\prime x_i\), we are projecting the ith component of \(x\) onto the vector \(w\). We can therefore re-write this inner product in terms of multiplying vector lengths:

\[w \cdot x_i = p_i ||w||\]

where \(p_i\) is the projection of \(x_i\) onto \(w\). The objective function now becomes:

\[\min_{w} \frac{1}{2} ||w||^2\] \[\begin{aligned} \text{subject to }p_i ||w|| \ge 1 &\text{ if } y_i=1 \\ p_i ||w|| \le -1 &\text{ if } y_i=0 \end{aligned}\]

In order to see what the \(p_i\) values will be, we drop perpendicular lines down to the parameter vector \(w\). Notice that for this decision boundary, the resulting \(p_i\) are quite small (either positive or negative). In order to satisfy our constraint, this will force \(||w||\) to be large, which is not desirable given our objective.

We can confirm this in terms of our objective function, by showing the corresponding projections \(p_i\) to be large, which allows our parameter vector norm to be smaller.

Thus, the values of \(\{p_i\}\) define a margin that we are attempting to maximize to aid robust classificaction.

Soft Margin

If there exists no hyperplane that can split the "yes" and "no" examples, the ''Soft Margin'' method will choose a hyperplane that splits the examples as cleanly as possible, while still maximizing the distance to the nearest cleanly split examples. The method introduces non-negative slack variables, \(\xi_i\), which measure the degree of misclassification of the data \(x_i\) \[y_i(w \cdot x_i - b) \ge 1 - \xi_i \]

The objective function is then increased by a function which penalizes non-zero \(\xi_i\), and the optimization becomes a trade off between a large margin and a small error penalty. If the penalty function is linear, the optimization problem becomes: \[\arg\min_{\mathbf{w},\mathbf{\xi}, b } \left\{\frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^n \xi_i \right\}\]

subject to (for any \(i=1,\dots n\)): \[y_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1 - \xi_i, ~~~~\xi_i \ge 0\]

In [313]:
plt.rcdefaults()
remove_axis(plt)
points_blue = np.array([[-0.9,-0.9,-1.5,-1.5],[0.5,0.8,0.5,0]])
points_red = np.array([[0.9,1.6,1.5,1.5],[0.5,0.8,0.5,0]])

plt.scatter(*x1, s=300, marker='+')
plt.scatter(*points_blue, s=300, marker='+')
plt.scatter(*points_red, s = 300, marker = '+', color = 'r')

plt.scatter(*x2, s=300, marker='+', color='r')

plt.plot([-1,1], [2.5,-1.5], 'k-')
#plt.plot([1,0.5], [-1,-.5], 'k-')
plt.plot([-2.0,-0.5],[2.0,-1.0],'k--', color ='b')
plt.plot([0.5,2],[2,-1],'k--', color ='r')
plt.axis('tight')
plt.annotate(r"$\mathbf{w}\cdot\mathbf{x} + b  = 0$", xy=(-0.6, 1.8), fontsize=16)
plt.annotate(r"$\mathbf{w}\cdot\mathbf{x} + b  = 1$", xy=(0.8, 1.8), fontsize=16)
plt.annotate(r"$\mathbf{w}\cdot\mathbf{x} + b  = -1$", xy=(-1.7, 1.5), fontsize=16)

#plt.xlim(-2,2); plt.ylim(-1,2)
Out[313]:
<matplotlib.text.Annotation at 0x12576f990>

This choice of \(C\) involves a bias-variance tradeoff:

Enough Math! - SVM in Python with Scikit-Learn

Scikit-Learns

SVM Implementation of Scikit Learn

Data Representation

Everthing is a numpy array:

In [40]:
from  sklearn.datasets import load_digits 
digits = load_digits()
In [41]:
print("images shape %s" % str(digits.images.shape))
print("labels shape %s" % str(digits.target.shape))
print("target names %s" % str(digits.target_names))
images shape (1797, 8, 8)
labels shape (1797,)
target names [0 1 2 3 4 5 6 7 8 9]

In [42]:
plt.matshow(digits.images[0], cmap=plt.cm.Greys)
Out[42]:
<matplotlib.image.AxesImage at 0x10a013110>

Let's use a SVM to Classify the digits

Preprocessing

We need to reshape the data 8x8 images to 1x64 vectors

In [43]:
X = digits.data.reshape(-1,64)
print(X.shape)
y = digits.target
print(y.shape)
(1797, 64)
(1797,)

Split data into training and testing set

In [48]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)

print("X_train shape %s" % repr(X_train.shape))
print("y_train shape %s" % repr(y_train.shape))
print("X_test shape %s" % repr(X_test.shape))
print("y_test shape %s" % repr(y_test.shape))
X_train shape (1347, 64)
y_train shape (1347,)
X_test shape (450, 64)
y_test shape (450,)

Let's use the SVM (Finally)

Import the model and instantiate it (will explain the the parameter kernel later)

In [45]:
from sklearn.svm import SVC
svm = SVC(kernel='linear')

Train the model (fit in Scikit-learn lingo) using the known labels:

In [49]:
svm.fit(X_train, y_train)
Out[49]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

Apply the model - that is predict the label of testing data

In [53]:
plt.matshow(X_test[0,:].reshape(8,8),cmap=plt.cm.Greys)
Out[53]:
<matplotlib.image.AxesImage at 0x103b74110>
In [54]:
svm.predict(X_test[0,:])
Out[54]:
array([2])

Evaluate the model on the training and testing data

In [55]:
svm.score(X_train,y_train)
Out[55]:
1.0
In [56]:
svm.score(X_test,y_test)
Out[56]:
0.97111111111111115

What is happening ?

In this example extracted from Scikit learn samples we can see how the SVM finds and crete the separating surface:

In [374]:
from sklearn import svm
plt.xkcd()
# Source:  http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane.html#example-svm-plot-separating-hyperplane-py

remove_axis(plt)
np.random.seed(0)
X = np.r_[np.random.randn(20, 2) - [2, 2], np.random.randn(20, 2) + [2, 2]]
Y = ['r'] * 20 + ['b'] * 20

# fit the model
clf = svm.SVC(kernel='linear')
clf.fit(X, Y)

# get the separating hyperplane
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - (clf.intercept_[0]) / w[1]

# plot the parallels to the separating hyperplane that pass through the
# support vectors
b = clf.support_vectors_[0]
yy_down = a * xx + (b[1] - a * b[0])
b = clf.support_vectors_[-1]
yy_up = a * xx + (b[1] - a * b[0])

# plot the line, the points, and the nearest vectors to the plane
plt.plot(xx, yy, 'k-')
plt.plot(xx, yy_down, 'k--')
plt.plot(xx, yy_up, 'k--')

plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
           s=80, facecolors='none')
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired)

plt.axis('tight')
plt.show()

Kernels

So far we have worked with lineary separable data. What happnens if the data cannot be linearly separable?

In [404]:
g1 = np.random.multivariate_normal((0,0), np.eye(2) * np.array([0.8,1.2]), 100)
g2 = np.random.multivariate_normal((0,0), np.eye(2)*0.1, 100)


X  = np.vstack((g1,g2))
y  = np.array([1] * 100 + [0] * 100)
remove_axis(plt)

plt.scatter(*X.T, c=y, cmap=plt.cm.Paired )
#plt.scatter(*g2.T, color=[1]*100, cmap=plt.cm.Paired)

plt.axis('tight')
plt.xlim(-2,2); plt.ylim(-2,2)
plt.annotate("So...what now?",(1,2))
Out[404]:
<matplotlib.text.Annotation at 0x12af78c10>

Dual Form

Writing the classification rule in its unconstrained dual problem|dual form reveals that the classification task is only a function of the support vectors, the subset of the training data that lie on the margin.

Using the fact that \(\|\mathbf{w}\|^2 = \mathbf{w}\cdot \mathbf{w}\) and substituting \(\mathbf{w} = \sum_{i=1}^n{\alpha_i y_i\mathbf{x_i}}\), one can show that the dual of the SVM reduces to the following optimization problem:

Maximize (in \(\alpha_i\) ): \[\tilde{L}(\mathbf{\alpha})=\sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i, j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j=\sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i, j} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)\]

subject to (for any \(i = 1, \dots, n\)):

\[\alpha_i \geq 0,\, \]

and to the constraint from the minimization in \(b\)

\[ \sum_{i=1}^n \alpha_i y_i = 0.\]

Here the kernel is defined by \(k(\mathbf{x}_i,\mathbf{x}_j)=\mathbf{x}_i\cdot\mathbf{x}_j\).

\(W\) can be computed thanks to the \(\alpha\) terms:

\[\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i.\]

In [407]:
from IPython.display import YouTubeVideo
YouTubeVideo('3liCbRZPrZA')
Out[407]:

Kernels - Usage

Let's see how using can help us solving this problem

In [409]:
for fig_num, kernel in enumerate(('linear', 'rbf')):
    clf = svm.SVC(kernel=kernel, gamma=10)
    clf.fit(X_train, y_train)
    
    
    plt.figure(fig_num)
    plt.clf()
    
    remove_axis(plt)
    # Circle out the test data
    plt.scatter(X[:, 0], X[:, 1], c=y, zorder=10,  cmap=plt.cm.Paired)
    
    #plt.axis('tight')
    x_min = X[:, 0].min()
    x_max = X[:, 0].max()
    y_min = X[:, 1].min()
    y_max = X[:, 1].max()

    XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
    Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(XX.shape)
    plt.pcolormesh(XX, YY, Z > 0, cmap=plt.cm.Paired)
    plt.contour(XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'],
               levels=[-.5, 0, .5])

    plt.title(kernel)

Multiclass Classification

Example with One vs One the number of classifiers is = \(n * (n - 1) / 2\)

In [416]:
X = [[0], [1], [2], [3]]
Y = [0, 1, 2, 3]
clf = svm.SVC()
clf.fit(X, Y) 
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
dec = clf.decision_function([[1]]) # The SVC method decision_function gives per-class scores for each sample (or a single score per sample in the binary case).
print("Number of classifiers: %s" % str(dec.shape[1]))
Number of classifiers: 6

Unbalanced Datasets

Many times in real life datasets we encounter ourselves with unbalanced datasets. Luckly this SVM implementation can handle this situation using a different weight the parameter \(C\) of SVM formulation.

In [433]:
g1 = np.random.multivariate_normal((0,0), np.eye(2) * 0.1, 100)
g2 = np.random.multivariate_normal((0.7,0.5), np.eye(2) * 0.1, 500)

X  = np.vstack((g1,g2))
y  = np.array([1] * 100 + [0] * 500)
remove_axis(plt)

plt.scatter(*X.T, c=y, cmap=plt.cm.Paired )

clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(X, y)

w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-1, 1)
yy = a * xx - clf.intercept_[0] / w[1]

wclf = svm.SVC(kernel='linear', class_weight='auto')
wclf.fit(X, y)
ww = wclf.coef_[0]
wa = -ww[0] / ww[1]
wyy = wa * xx - wclf.intercept_[0] / ww[1]

h0 = pl.plot(xx, yy, 'k-', label='no weights')
h1 = pl.plot(xx, wyy, 'k--', label='with weights')
pl.scatter(X[:, 0], X[:, 1], c=y, cmap=pl.cm.Paired)
pl.legend()

pl.axis('tight')
pl.show()