Machine Learning Basics with the Support Vector Machine Algorithm

Published in

Geek Culture

8 min readMay 8, 2021

Machine learning is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.

The process of learning begins with observations or data, such as examples, direct experience, or instruction, to look for patterns in data and make better decisions in the future based on the examples that we provide. The primary aim is to allow the computers to learn automatically without human intervention or assistance and adjust actions accordingly.

Now let’s breakdown Machine learning into 2 sub-categories;

Supervised Learning is the one, where you can consider the learning is guided by a teacher. We have a dataset that acts as a teacher and its role is to train the computer/machine. Once the machine gets trained it can start making a prediction or decision when completely new data is given to it.

For example, imagine a computer is a student and we are his/her teacher, we want the student (computer) to learn what a mango looks like. We will show the student several pictures of different fruits, consisting of some mangoes and pictures of other fruits like banana, apple, etc.

When we see a mango, we identify it as a “mango!” and when it’s not a mango, we identify it as “no, not a mango!” After doing this several times with the student, we show the student a picture and ask ‘is this a mango?’ The student will answer correctly most of the time by saying “mango!” or “no, not mango!” depending on what the picture is. This art of learning is supervised machine learning.

Supervised machine learning algorithms are used to solve classification or regression problems.

A classification problem has a discrete value as its output. For example, the analogy above of teaching a student to identify a mango. Its output is either “mango” or “not mango”, there is no middle ground.

The image above shows an example of what classification data might look like. We have a predictor (or set of predictors) and a label. In the image, we are trying to predict ‘mango’ (1) or ‘not mango’ (0) based on the student identification of fruits (the predictor).

A regression problem has a real number (a number with a decimal point) as its output. For example, regression predicts a continuous target variable Y. It allows you to estimate a value, such as housing prices or human weight, based on input data X.

We could use the data in the table below to estimate house prices given the number of rooms.

Unsupervised learning operates on only the input data without outputs or any labels. Unlike supervised learning, it does not have a teacher correcting the machine. The model learns through observation and finds structures in the data. Once the model is given a dataset, it automatically finds patterns and relationships in the dataset by creating clusters in it.

Suppose we presented images of apples, bananas, and mangoes to the model, so what it does, based on some patterns and relationships it creates clusters and divides the dataset into those clusters. Now if a new data is fed to the model, it adds it to one of the created clusters.

SUPPORT VECTOR MACHINE

Support Vector Machines (SVMs) are supervised learning models for classification and regression problems as support vector classification (SVC) and support vector regression (SVR). It is used for smaller datasets because it takes too long to process. They can solve linear and nonlinear problems and use the concept of Margin to classify data points.

How do SVMs work?

The SVM algorithm is based on the idea of finding a hyperplane that distinctly separates the data points into different groups. In SVM, we plot each data point in the dataset in N-dimensional space (N — the number of features). To separate data points into two groups, many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both groups. Maximizing the margin distance provides some support so that future data points can be classified with more confidence.

Let’s explore some terminologies in SVM

Hyperplanes

Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different groups. Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. When the number of features exceeds 3, It becomes difficult to imagine the hyperplane. See the image below to understand this concept.

What are Support Vectors?

Support Vectors are the data points that are on or closest to the hyperplane and determine the hyperplane’s position. Using these support vectors, we maximize the margin. Deleting the support vectors will change the position of the hyperplane. Support Vectors are equidistant from the hyperplane and help in structuring the SVM. They are called Support vectors because they support the hyperplane, if their position shifts, the hyperplane shifts as well.

Going further let’s work through an example to understand the SVM ideology:

For example, your boss receives a ton of messages on his work email and asked you to differentiate the work and non-work messages. Now you want to design a function (hyperplane) that will differentiate the two cases, such that whenever you received a non-work email it will be classified as spam, and when a work email is received it will be classified as not spam.

Now, we will find some line that splits the data into spam and not spam. This will be the line such that the distances from the closest point in each of the two groups will be farthest away. Check the plot below to signify the best line that marginalizes the two classes (provided that blue signify spam and red signify not spam).

If you picked line C your intuition was correct. Above, you can see that the margin for hyper-plane C has the maximum distance between data points of both classes compared to both A and B. Hence, we name the right hyper-plane as C.

The above illustration is a linearly separable case where SVM is trying to find the hyperplane that maximizes the margin, with the condition that both classes are classified correctly. But in reality, datasets are probably never linearly separable, so the condition of 100% correctly classified by a hyperplane will never be met.

Now let’s consider a non-linear separable case of spam and not spam messages in your boss’s email.

In the plot above, I am unable to segregate the two classes into spam and not spam using a straight line, as one of the blue points (‘spam’) lies in the territory of ‘not spam’ (red) class as an outlier. The SVM algorithm has a feature to ignore outliers and find the hyper-plane that has the maximum margin. Therefore, we can say, SVM classification is robust to outliers.

To find the hyperplane of a non-linear separable case like the above, the SVM algorithm has a technique called the kernel trick. The SVM kernel is a function that takes low dimensional input space and modifies it to a higher dimensional space i.e. it transforms not separable problem to separable problem. It is mostly useful in non-linear separation problems. Clearly, it does some complex data modifications, then finds out the process to differentiate the data based on the labels or outputs you’ve set out.

When we look at the hyper-plane in original input space it looks like a circle:

Implementing SVM using python

Use the code below to implement SVM using the sci-kit learn library:

#importing sci-kit learn and other important librariesfrom sklearn.datasets import make_circlesfrom sklearn import svmimport matplotlib.pyplot as plt%matplotlib inlinefrom mpl_toolkits.mplot3d import Axes3Dimport numpy as np
X,Y = make_circles(n_samples=500,noise=0.02)plt.scatter(X[:,0],X[:,1],c=Y)plt.show()def phi(X):""""Non Linear Transformation"""X1 = X[:,0]X2 = X[:,1]X3 = X1**2 + X2**2X_ = np.zeros((X.shape[0],3))print(X_.shape)X_[:,:-1] = XX_[:,-1] = X3return X_def plot3d(X,show=True):fig = plt.figure(figsize=(10,10))ax = fig.add_subplot(111,projection='3d')X1 = X[:,0]X2 = X[:,1]X3 = X[:,2]ax.scatter(X1,X2,X3,zdir='z',s=20,c=Y,depthshade=True)if(show==True):plt.show()return axax = plot3d(X_)
# using the rbf kernel function to use the kernel tricksvc = svm.SVC(kernel="rbf")svc.fit(X,Y)svc.score(X,Y)

Advantages and Disadvantages of SVM

Advantages:

It works really well with a clear margin of separation
It is effective in high dimensional spaces.
It is effective in cases where the number of dimensions is greater than the number of samples.
It uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.

Disadvantages:

It doesn’t perform well when we have large data set because the required training time is higher
It also doesn’t perform very well, when the data set has more noise i.e. target classes are overlapping
SVM doesn’t directly provide probability estimates, these are calculated using an expensive five-fold cross-validation. It is included in the related SVC method of Python scikit-learn library.

In this article, we have familiarised ourselves with the basics of Machine Learning and the SVM Algorithm. I hope you found this helpful. Thanks for reading!!

Kindly connect with me on LinkedIn and Twitter.

Resources

https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/

https://www.edureka.co/blog/what-is-machine-learning/

https://www.expert.ai/blog/machine-learning-definition/