Classification and Regularization Using Linear Models in Machine Learning.

Salman Ibne Eunus
CodeX
Published in
4 min readJun 17, 2021

--

In this blog, we will talk about how to apply Linear models to solve classification problems in machine learning. First of all, we will talk about binary classification and then we will move to multi-class classification later.

The mathematical formula for binary classification to make prediction is given below —

ŷ = x[0] * z[0] + x[1] * z[1] + … + x[p] * z[p] + b > 0

The formula is quite similar to the one used in linear regresssion, but here the weighted sum of the features is just returned. The threshold of the predicted value is considered to be zero in binary classification. If the function is less than zero, the class is predicted as -1 and if it is greater than zero, the class is predicted as +1. This common rule is used in case of all linear models for classification.

The decision boundary is a linear function of the input, that means a linear binary classifier separates the two classes using a line, plane or hyperplane. The algorithms for learning linear models mainly differ the way in which they measure how well a particular combination of coefficients and intercept fits the training data and what type of regularization parameter they use.

Different algorithms follow different techniques to measure how to fit the training set well. It is not always possible to adjust ‘x’ and ‘z’ to reduce the number of misclassification the algorithms produce due to mathematical complexities. For many applications, different choices for item ‘1’ called as loss functions are of little importance.

The two most commonly used linear classification algorithms are logistic regression and linear support vector machines. Logistic regresssion is a classification algorithm and not a regression algorithm, although the name suggests so.

Decision boundary of a linear SVM
Decision boundary of a Logistic Regression

Any new data points that lie above the black line will be classified as class 1, while any point that lies below the line will be classified as 0.

Now we will discuss a bit about multi-class classification using linear models.

A common method to extend a binary classification algorithm to multi-class classification algorithm is the one-vs.-rest approach. It means that, a binary model is learned for each class that tries to separate that class from all other classes, resulting in as many binary models as there are classes. therefore, to make a prediction, all binary classifiers run on a test point. The classifier with the highest score on its single class wins and this class label is returned as the prediction value.

If we have one binary classifier per class, it results in having one vector of coefficients (x) and one intercept (b) for each class. The class for which the result of classification confidence formula given here is the highest assigned class label as shown below —

x[0] * z[0] + x[1] * z[1] + … + x[p] * z[p] + b

The mathematics behind multi-class logistic regression differes from the one-vs.-rest approach, but they also result in one coefficient vector and intercept per class, and the same method of prediction is used.

Decision boundary of a multi-class classification

Take a look at my jupyter notebook here to see the code implementation for a binary classifier using linear models.

As you will see in the above code, the main parameter of linear models is the regularization parameter, called alpha in the regresssion models and C in Linear Support Vector Machine and logistic regression models. When the values of alpha is large and the values of C is small, it refers to a quite simple linear model. Moreover, for the regression models, tuning these parameters are quite important. In most cases, C and alpha are searcherd for on a logarithmic scale. Another important decision we need to make is whether we want to use L1 or L2 regularization. If you think that only a few of the features are important, you can use L1, or else you should use L2 by default. L1 caqn be also useful at times if the interpretability of the model is important. As L1 will use only a few features, it is easier to descirbe which features are significant to the model and what the effects of these features are.

In addition, before we end this blog, you must know that linear models are very fast to train and also very fast to predict. Thay scale to very large datasets and work well with sparse data. If your data consists of hundreds of thousands or millions of samples, you might want to investigate using the solver=’sag’ option in LogisticRegression and Ridge, which can be faster than the default on large datasets. Other options are the SGDClassifier class and the SGDRegressor class, which implement even more scalable versions of the linear models. Another strength of linear models is that they are comparatively easier to comprehend how a prediction is made using the formulaes we have seen above for classification. Linear models often perform well when the number of features is large compared to the number of samples. They are also often used on very large datasets, simply because it’s not feasible to train other models. However, in lower-dimensional spaces, other models might yield better generalization performance.

--

--

Salman Ibne Eunus
CodeX

Data Scientist|Robotics Engineer||AI Researcher| Bioinformatics