Support Vector Machine
A support vector machine (SVM) is a supervised machine learning algorithm that finds the optimal hyperplane separating data points of different classes by maximising the margin between the boundary and the nearest training examples.
A support vector machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. Introduced by Vladimir Vapnik and colleagues at Bell Laboratories in the early 1990s, SVMs became one of the most widely applied machine learning algorithms in the decade preceding the deep learning era, and remain practically relevant today for tabular data problems, text classification, and scenarios where training data is limited.[^1] The core principle of an SVM is to identify the decision boundary — a hyperplane in the feature space — that most cleanly separates data points belonging to different classes, with the maximum possible margin between the boundary and the closest training examples from each class.
Geometric Intuition
To understand SVMs, it is helpful to begin with a two-dimensional example. Consider a set of data points in a plane, each coloured either red or blue, and the task of drawing a line that separates all red points from all blue points. Many such lines may exist. SVM selects the line that is furthest from the nearest red point and the nearest blue point simultaneously — that is, the line that maximises the margin between the two classes. The data points lying exactly on the margin boundaries are called support vectors, and they are the only training examples that determine the position of the decision boundary; all other training points are irrelevant once the boundary is found.[^2]
In a higher-dimensional feature space — with ten features rather than two — the decision boundary is a hyperplane (a flat subspace of dimension one less than the number of features). The maximum-margin hyperplane is found by solving a constrained quadratic optimisation problem. The mathematical formulation leads to a dual problem in which the solution depends only on dot products between pairs of training examples, a property that becomes critical when the kernel trick is applied.
The Kernel Trick
Many real-world datasets are not linearly separable — no hyperplane can cleanly divide the two classes in the original feature space. The kernel trick addresses this by implicitly projecting data into a higher-dimensional space where linear separation becomes possible, without explicitly computing the coordinates in that higher space. This is achieved by replacing the dot products in the SVM optimisation with a kernel function that computes similarity between pairs of points.[^3]
Common kernel functions include the Radial Basis Function (RBF, also known as the Gaussian kernel), the polynomial kernel, and the sigmoid kernel. The RBF kernel is the most widely used default because it introduces a notion of local similarity — points that are nearby in the original feature space have high kernel values — and produces smooth, non-linear decision boundaries. By selecting an appropriate kernel, SVMs can model highly complex class boundaries while retaining the maximum-margin optimality guarantee.
Soft Margin and Regularisation
The formulations described above assume that perfect separation is possible. In practice, real datasets contain noise and overlapping classes, and insisting on perfect separation leads to overfitting. The soft-margin SVM introduces a regularisation parameter C that controls the trade-off between maximising the margin and minimising training errors. A small value of C allows more training examples to fall on the wrong side of the margin (a wider, more tolerant margin), reducing overfitting at the cost of some training accuracy. A large value of C penalises misclassifications heavily, producing a narrower margin that fits the training data more tightly. Selecting the optimal C value is typically done via cross-validation.[^4]
SVM for Regression
The standard SVM was designed for binary classification, but its principles extend to regression tasks under the name Support Vector Regression (SVR). In SVR, the goal is to find a function that predicts continuous output values while remaining within a specified tolerance band around the training targets. Data points outside the band contribute to the loss; those inside do not. This produces regression models that are insensitive to small errors, potentially offering better generalisation than least-squares regression in noisy settings.
SVMs vs. Deep Learning
SVMs were the dominant algorithm for many classification tasks throughout the 1990s and 2000s, particularly for text categorisation, image recognition, and bioinformatics. With the rise of deep learning from 2012 onward, neural networks superseded SVMs on large-scale perceptual tasks involving raw images, audio, and text. However, SVMs retain advantages in settings with small or medium-sized tabular datasets, where deep learning models tend to overfit. SVMs have also proven valuable in high-stakes domains such as medical diagnosis, where the kernel trick provides a theoretically grounded way to handle feature spaces with hundreds of measured variables but only thousands of patient samples.
| Dimension | SVM | Deep Learning | |---|---|---| | Data requirements | Works well with limited data | Typically requires large datasets | | Interpretability | Moderate (support vectors identifiable) | Low (black box) | | Scalability | Slow on very large datasets | Scales well with data and compute | | Feature engineering | Often requires manual features | Learns features automatically | | Best use case | Tabular data, small datasets | Images, audio, text at scale |
References
- Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
- Burges, C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2), 121–167.
- Scholkopf, B., & Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.
- Hsu, C.-W., Chang, C.-C., & Lin, C.-J. (2016). A Practical Guide to Support Vector Classification. National Taiwan University Technical Report.