This notebook try to give a quick intro to SVM and how to use it in particular with Scikit-learn. It is based and adapted from different sources, in particular from the Notebook on SVM of the course on Advanced Statistical Computing at Vanderbilt University's Department of Biostatistic (Bios366). Please see the reference section for the complete list of sources.
Let's first load what we need to run
import numpy as np
import pandas as pd
import pylab as plt
pylab.rcParams['figure.figsize'] = (8.0,5.0)
plt.xkcd()
def remove_axis(plot):
ax = plot.gca()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.yaxis.set_ticks_position('none')
ax.xaxis.set_ticks_position('none')
Let's assume we have the follwing data which is linearly separable
g1 = np.random.multivariate_normal((-1,-1), np.eye(2)*0.2, 10)
g2 = np.random.multivariate_normal((1,1), np.eye(2)*0.2, 10)
remove_axis(plt)
plt.scatter(*g1.T, color='r')
plt.scatter(*g2.T, color='b')
plt.xlim(-2,2); plt.ylim(-2,2)
We want so separte the with an hyperplane (line in 2D). There are many possibilities:
One possible separation is a line that passes very close to points in both groups. We could also place some lines by graphic intuition.
plt.xkcd()
x,y = np.transpose([g1[np.where(g1.T[1]==g1.max(0)[1])[0][0]],
g2[np.where(g2.T[1]==g2.min(0)[1])[0][0]]])
remove_axis(plt)
plt.scatter(*g1.T, color='r')
plt.scatter(*g2.T, color='b')
x,y = np.transpose([g1[np.where(g1.T[1]==g1.max(0)[1])[0][0]],
g2[np.where(g2.T[1]==g2.min(0)[1])[0][0]]])
b0,b1 = np.linalg.lstsq(np.c_[[1,1],x], y)[0]
xspace = np.linspace(-3,3,100)
plt.plot(xspace, b0 + 0.05 + (b1-.2)*xspace, 'k--')
y = g2[np.where(g2.T[1]==g2.min(0)[1])[0][0]]
plt.plot([0,0],[-2,2] , 'k--')
plt.plot([-2,2],[2,-2] , 'k--')
plt.xlim(-2,2); plt.ylim(-2,2)
The one based on LS seems like a poor choice of decision boundary, even though it separates the groups, because it may not be a robust solution. The others seem better, but which one should pick? Intuitively, a wide margin around the dividing line seems to make this more likely. Let's look at an example:
frame = plt.gca()
frame.axes.yaxis.set_ticklabels([])
frame.axes.xaxis.set_ticklabels([])
x1 = -1, 0
x2 = 1, 1
remove_axis(plt)
plt.scatter(*x1, s=300, marker='+')
plt.scatter(*x2, s=300, marker='+', color='r')
plt.plot([-1.5, 1.5],[0,1], 'k-')
SVM avoids this by establishing a margin between the decision boundary and the nearest points in each group. This margin is maximized under SVM.
We look for a vector \(w\) is perpendicular to this decision boundary:
remove_axis(plt)
plt.scatter(*x1, s=300, marker='+')
plt.scatter(*x2, s=300, marker='+', color='r')
plt.plot([-1.5, 1.5],[0,1], 'k-')
plt.plot([-.5, .75], [1, 0], 'k--')
plt.annotate(r"w", xy=(-0.4, 1), fontsize=20)
remove_axis(plt)
plt.scatter(*x1, s=300, marker='+')
plt.scatter(*x2, s=300, marker='+', color='r')
plt.plot([-1.5, 1.5],[0,1], 'k-')
plt.plot([-.5, .75], [1, 0], 'k--')
plt.plot([0.1,-0.05],[0.52,0.64], linewidth=4, color='r')
plt.plot([0.1,0.28],[0.52,0.4],linewidth=4)
plt.annotate(r"w", xy=(-0.4, 1), fontsize=20)
plt.arrow(-1, 0, 3*(.35), .35, fc="b", ec="b", head_width=0.07, head_length=0.2)
plt.arrow(1, 1, -3*(.28), -.28, fc="r", ec="r", head_width=0.07, head_length=0.2)
Now consider another decision boundary, which intuitively appears to be a better choice.
remove_axis(plt)
plt.scatter(*x1, s=300, marker='+')
plt.scatter(*x2, s=300, marker='+', color='r')
plt.plot([-.5,.5], [1.5,-.5], 'k-')
remove_axis(plt)
plt.scatter(*x1, s=300, marker='+')
plt.scatter(*x2, s=300, marker='+', color='r')
plt.plot([-.5,.5], [1.5,-.5], 'k-')
plt.plot([-1, 1], [-.5, 1.7], 'k--')
plt.annotate(r"w", xy=(0.6, 1.5), fontsize=20)
plt.plot([0.7,-0.05],[1.4,0.55], linewidth=4, color='r')
plt.plot([-0.8,-0.05],[-0.3,0.55], linewidth=4, color='b')
#plt.axis('tight')
plt.arrow(-1, 0, .1, -.2, fc="b", ec="b", head_width=0.07, head_length=0.1)
plt.arrow(1, 1, -.2, .37, fc="r", ec="r", head_width=0.07, head_length=0.1)
The hyperplane that divides the training set \[ \mathbf{w}\cdot\mathbf{x} + b = 0\]
The classifier becomes:
\[ class(x) = sign(\mathbf{w}\cdot\mathbf{x} + b)\]
plt.rcdefaults()
remove_axis(plt)
plt.scatter(*x1, s=300, marker='+')
plt.scatter(*x2, s=300, marker='+', color='r')
plt.plot([-.5,.5], [1.5,-.5], 'k-')
plt.plot([-1.5,-0.5],[1,-1.0],'k--', color ='b')
plt.plot([0.5,1.5],[2,0.0],'k--', color ='r')
plt.annotate(r"$\mathbf{w}\cdot\mathbf{x} + b = 0$", xy=(-0.6, 1.8), fontsize=16)
plt.annotate(r"$\mathbf{w}\cdot\mathbf{x} + b = 1$", xy=(0.8, 1.8), fontsize=16)
plt.annotate(r"$\mathbf{w}\cdot\mathbf{x} + b = -1$", xy=(-1.7, 1.5), fontsize=16)
By using geometry, we find the distance between these two hyperplanes is \(\tfrac{2}{\|w\|}\), so we want to minimize \(\|w\|\) As we also have to prevent data points from falling into the margin, we add the following constraint:
\[y_i(w \cdot x_i - b) \ge 1, \text{ for all } 1 \le i \le n.\]
This can be re-written as:
\[\arg\min_{(w,b)}\frac{1}{2}\|w\|^2\]
subject to (for any \(i = 1, \dots, n\)):
\[y_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1.\]
To understand how SVM incorporates a margin into its decision boundary, it helps to re-write our objective function in terms of the norm (length) of the \(w\)
\[\min_{w} \frac{1}{2} \sum_{j=1}^k w^2_j = \min_{w} \frac{1}{2} ||w||^2\]
When we take the inner product of two vectors, we are essentially projecting the values of one vector onto the other, in order to add them. In the case of our inner product \(w\prime x_i\), we are projecting the ith component of \(x\) onto the vector \(w\). We can therefore re-write this inner product in terms of multiplying vector lengths:
\[w \cdot x_i = p_i ||w||\]
where \(p_i\) is the projection of \(x_i\) onto \(w\). The objective function now becomes:
\[\min_{w} \frac{1}{2} ||w||^2\] \[\begin{aligned} \text{subject to }p_i ||w|| \ge 1 &\text{ if } y_i=1 \\ p_i ||w|| \le -1 &\text{ if } y_i=0 \end{aligned}\]
In order to see what the \(p_i\) values will be, we drop perpendicular lines down to the parameter vector \(w\). Notice that for this decision boundary, the resulting \(p_i\) are quite small (either positive or negative). In order to satisfy our constraint, this will force \(||w||\) to be large, which is not desirable given our objective.
We can confirm this in terms of our objective function, by showing the corresponding projections \(p_i\) to be large, which allows our parameter vector norm to be smaller.
Thus, the values of \(\{p_i\}\) define a margin that we are attempting to maximize to aid robust classificaction.
If there exists no hyperplane that can split the "yes" and "no" examples, the ''Soft Margin'' method will choose a hyperplane that splits the examples as cleanly as possible, while still maximizing the distance to the nearest cleanly split examples. The method introduces non-negative slack variables, \(\xi_i\), which measure the degree of misclassification of the data \(x_i\) \[y_i(w \cdot x_i - b) \ge 1 - \xi_i \]
The objective function is then increased by a function which penalizes non-zero \(\xi_i\), and the optimization becomes a trade off between a large margin and a small error penalty. If the penalty function is linear, the optimization problem becomes: \[\arg\min_{\mathbf{w},\mathbf{\xi}, b } \left\{\frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^n \xi_i \right\}\]
subject to (for any \(i=1,\dots n\)): \[y_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1 - \xi_i, ~~~~\xi_i \ge 0\]
plt.rcdefaults()
remove_axis(plt)
points_blue = np.array([[-0.9,-0.9,-1.5,-1.5],[0.5,0.8,0.5,0]])
points_red = np.array([[0.9,1.6,1.5,1.5],[0.5,0.8,0.5,0]])
plt.scatter(*x1, s=300, marker='+')
plt.scatter(*points_blue, s=300, marker='+')
plt.scatter(*points_red, s = 300, marker = '+', color = 'r')
plt.scatter(*x2, s=300, marker='+', color='r')
plt.plot([-1,1], [2.5,-1.5], 'k-')
#plt.plot([1,0.5], [-1,-.5], 'k-')
plt.plot([-2.0,-0.5],[2.0,-1.0],'k--', color ='b')
plt.plot([0.5,2],[2,-1],'k--', color ='r')
plt.axis('tight')
plt.annotate(r"$\mathbf{w}\cdot\mathbf{x} + b = 0$", xy=(-0.6, 1.8), fontsize=16)
plt.annotate(r"$\mathbf{w}\cdot\mathbf{x} + b = 1$", xy=(0.8, 1.8), fontsize=16)
plt.annotate(r"$\mathbf{w}\cdot\mathbf{x} + b = -1$", xy=(-1.7, 1.5), fontsize=16)
#plt.xlim(-2,2); plt.ylim(-1,2)
This choice of \(C\) involves a bias-variance tradeoff:
Everthing is a numpy array:
from sklearn.datasets import load_digits
digits = load_digits()
print("images shape %s" % str(digits.images.shape))
print("labels shape %s" % str(digits.target.shape))
print("target names %s" % str(digits.target_names))
plt.matshow(digits.images[0], cmap=plt.cm.Greys)
We need to reshape the data 8x8 images to 1x64 vectors
X = digits.data.reshape(-1,64)
print(X.shape)
y = digits.target
print(y.shape)
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)
print("X_train shape %s" % repr(X_train.shape))
print("y_train shape %s" % repr(y_train.shape))
print("X_test shape %s" % repr(X_test.shape))
print("y_test shape %s" % repr(y_test.shape))
Import the model and instantiate it (will explain the the parameter kernel later)
from sklearn.svm import SVC
svm = SVC(kernel='linear')
Train the model (fit in Scikit-learn lingo) using the known labels:
svm.fit(X_train, y_train)
Apply the model - that is predict the label of testing data
plt.matshow(X_test[0,:].reshape(8,8),cmap=plt.cm.Greys)
svm.predict(X_test[0,:])
Evaluate the model on the training and testing data
svm.score(X_train,y_train)
svm.score(X_test,y_test)
In this example extracted from Scikit learn samples we can see how the SVM finds and crete the separating surface:
from sklearn import svm
plt.xkcd()
# Source: http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane.html#example-svm-plot-separating-hyperplane-py
remove_axis(plt)
np.random.seed(0)
X = np.r_[np.random.randn(20, 2) - [2, 2], np.random.randn(20, 2) + [2, 2]]
Y = ['r'] * 20 + ['b'] * 20
# fit the model
clf = svm.SVC(kernel='linear')
clf.fit(X, Y)
# get the separating hyperplane
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - (clf.intercept_[0]) / w[1]
# plot the parallels to the separating hyperplane that pass through the
# support vectors
b = clf.support_vectors_[0]
yy_down = a * xx + (b[1] - a * b[0])
b = clf.support_vectors_[-1]
yy_up = a * xx + (b[1] - a * b[0])
# plot the line, the points, and the nearest vectors to the plane
plt.plot(xx, yy, 'k-')
plt.plot(xx, yy_down, 'k--')
plt.plot(xx, yy_up, 'k--')
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=80, facecolors='none')
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired)
plt.axis('tight')
plt.show()
So far we have worked with lineary separable data. What happnens if the data cannot be linearly separable?
g1 = np.random.multivariate_normal((0,0), np.eye(2) * np.array([0.8,1.2]), 100)
g2 = np.random.multivariate_normal((0,0), np.eye(2)*0.1, 100)
X = np.vstack((g1,g2))
y = np.array([1] * 100 + [0] * 100)
remove_axis(plt)
plt.scatter(*X.T, c=y, cmap=plt.cm.Paired )
#plt.scatter(*g2.T, color=[1]*100, cmap=plt.cm.Paired)
plt.axis('tight')
plt.xlim(-2,2); plt.ylim(-2,2)
plt.annotate("So...what now?",(1,2))
Writing the classification rule in its unconstrained dual problem|dual form reveals that the classification task is only a function of the support vectors, the subset of the training data that lie on the margin.
Using the fact that \(\|\mathbf{w}\|^2 = \mathbf{w}\cdot \mathbf{w}\) and substituting \(\mathbf{w} = \sum_{i=1}^n{\alpha_i y_i\mathbf{x_i}}\), one can show that the dual of the SVM reduces to the following optimization problem:
Maximize (in \(\alpha_i\) ): \[\tilde{L}(\mathbf{\alpha})=\sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i, j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j=\sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i, j} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)\]
subject to (for any \(i = 1, \dots, n\)):
\[\alpha_i \geq 0,\, \]
and to the constraint from the minimization in \(b\)
\[ \sum_{i=1}^n \alpha_i y_i = 0.\]
Here the kernel is defined by \(k(\mathbf{x}_i,\mathbf{x}_j)=\mathbf{x}_i\cdot\mathbf{x}_j\).
\(W\) can be computed thanks to the \(\alpha\) terms:
\[\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i.\]
from IPython.display import YouTubeVideo
YouTubeVideo('3liCbRZPrZA')
Let's see how using can help us solving this problem
for fig_num, kernel in enumerate(('linear', 'rbf')):
clf = svm.SVC(kernel=kernel, gamma=10)
clf.fit(X_train, y_train)
plt.figure(fig_num)
plt.clf()
remove_axis(plt)
# Circle out the test data
plt.scatter(X[:, 0], X[:, 1], c=y, zorder=10, cmap=plt.cm.Paired)
#plt.axis('tight')
x_min = X[:, 0].min()
x_max = X[:, 0].max()
y_min = X[:, 1].min()
y_max = X[:, 1].max()
XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])
# Put the result into a color plot
Z = Z.reshape(XX.shape)
plt.pcolormesh(XX, YY, Z > 0, cmap=plt.cm.Paired)
plt.contour(XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'],
levels=[-.5, 0, .5])
plt.title(kernel)
Example with One vs One the number of classifiers is = \(n * (n - 1) / 2\)
X = [[0], [1], [2], [3]]
Y = [0, 1, 2, 3]
clf = svm.SVC()
clf.fit(X, Y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
dec = clf.decision_function([[1]]) # The SVC method decision_function gives per-class scores for each sample (or a single score per sample in the binary case).
print("Number of classifiers: %s" % str(dec.shape[1]))
Many times in real life datasets we encounter ourselves with unbalanced datasets. Luckly this SVM implementation can handle this situation using a different weight the parameter \(C\) of SVM formulation.
g1 = np.random.multivariate_normal((0,0), np.eye(2) * 0.1, 100)
g2 = np.random.multivariate_normal((0.7,0.5), np.eye(2) * 0.1, 500)
X = np.vstack((g1,g2))
y = np.array([1] * 100 + [0] * 500)
remove_axis(plt)
plt.scatter(*X.T, c=y, cmap=plt.cm.Paired )
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(X, y)
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-1, 1)
yy = a * xx - clf.intercept_[0] / w[1]
wclf = svm.SVC(kernel='linear', class_weight='auto')
wclf.fit(X, y)
ww = wclf.coef_[0]
wa = -ww[0] / ww[1]
wyy = wa * xx - wclf.intercept_[0] / ww[1]
h0 = pl.plot(xx, yy, 'k-', label='no weights')
h1 = pl.plot(xx, wyy, 'k--', label='with weights')
pl.scatter(X[:, 0], X[:, 1], c=y, cmap=pl.cm.Paired)
pl.legend()
pl.axis('tight')
pl.show()
Scikit Learns makes easy to implement a SVM based text classification algorithm:
We are going to use the patologically used 20-News Group dataset: From the official page:
"The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. (...) The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering."
# Fetch the dataset
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
from pprint import pprint
pprint(list(newsgroups_train.target_names))
Now we are going to load the data and transform it into TF-IDF representation:
from sklearn.feature_extraction.text import TfidfVectorizer
categories = ['alt.atheism', 'comp.graphics']
newsgroups_train = fetch_20newsgroups(subset='train',categories=categories, remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test',categories=categories, remove=('headers', 'footers', 'quotes'))
vectorizer = TfidfVectorizer(max_df=0.5, stop_words='english')
print(" Shampe: (Samples x Features): %s" % str( vectors.shape))
print("Average non-zero components by sample: %f " % (vectors.nnz / float(vectors.shape[0] )))
Now we train using cross-validation:
from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.grid_search import GridSearchCV
X_train = vectorizer.fit_transform(newsgroups_train.data)
y_train = newsgroups_train.target
clf = SVC()
C_range = 10.0 ** np.arange(-2, 9)
gamma_range = 10.0 ** np.arange(-5, 4)
param_grid = dict(C=C_range)
cv = StratifiedKFold(y=y_train, n_folds=10)
grid = GridSearchCV(SVC(kernel='linear', class_weight='auto',probability=False), param_grid=param_grid, cv=cv,verbose=1)
grid.fit(X_train, y_train)
Before going further let's us just revivew the performance metrics we are going to use: if we have the following definitions:
\[ precision = \frac{TP}{TP+FP} \] In other words, from the classified as positive (or in specific class) which of them truly belong to that class.
\[ recall = \frac{TP}{TP + FN} \] Or in other words from all the positive cases (that is belonging to that class) how much of them were actually classified as such.
\[ F_1 = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}\]
The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0.
With this le'ts evaluate the model on the testing data
from sklearn import metrics
X_test = vectorizer.transform(newsgroups_test.data)
y_test = newsgroups_test.target
clf = grid.best_estimator_
print("The score on the test set %f" % clf.score(X_test,y_test))
pred = clf.predict(X_test)
#score = metrics.f1_score(y_test, pred)
# print("confusion matrix:" , sorted(categories))
# print(metrics.confusion_matrix(y_test, pred))
print(metrics.classification_report(y_test, pred,target_names= newsgroups_test.target_names ))
Supervised Learning: Support Vector Machines - Section of Bios366 Course
Coursera's Machine Learning course by Andrew Ng
scikit-learn
User's Guide SVM section
Scikit-learn tutorials for the Scipy 2013 conference by Jake Vanderplas