Five Machine Learning Algorithms Every Data Scientist Must Know

Data Science relies on Machine Learning algorithms to make predictions about future values of key variables from past data. In Machine Learning, a distinction is usually made between Supervised Learning, where the training dataset possesses output labels that help with model evaluation, and Unsupervised Learning, where these output labels are not available. This post focuses on using Supervised Learning techniques to make predictions about datasets. These predictions can either take the form of classification, where that particular record/instance is grouped as belonging to one of several different classes, or regression, where a continuous numerical value (real number) is predicted for that record based on its values for the set of predictor variables. In essence, Machine Learning Algorithms create a uniform rule that maps the relationship between the values of the features (predictor variables) and the final values of the labels (predicted variables). They then use this rule to predict the label value for new records, for which the feature values are given but the label is unknown.

A large number of predictive Machine Learning algorithms have now been developed, each with their own strengths and weaknesses. The task of choosing the right algorithm to solve a given problem, and tuning its parameters to obtain the most generally accurate result while avoiding the risk of overfitting, is an art unto itself. As far as choosing the right algorithm, it is necessary to have at least a cursory understanding of the most common algorithms in the Data Science space, so as to have an idea for an initial solution. Here are five of the most frequently applied predictive Machine Learning algorithms in Data Science:

Linear Regression

Graphical representation of the results from an example of simple linear regression

Fig 1: An example of simple linear regression in 2D space. This example uses one feature variable (on the X-axis) to predict values for one output variable (on the Y-axis).

Linear Regression (specifically simple/multiple univariate linear regression) is perhaps the most basic regression-based Machine Learning technique that still finds great utility in Data Science. Simple linear regression refers to the use case where there is only one output variable (label) being predicted from one input variable (feature). In Machine Learning however, the most common use case is Multiple Univariate Linear Regression, where multiple features are used to extrapolate a value for a single output label. This is a key first tool in Regression-type Machine Learning problems, and Multiple Linear Regression can be used as a good benchmark to find improved Regression models with better prediction accuracies.

Univariate Linear Regression, by default, uses the Method of Least Squares to plot the line of best fit, where the objective is to minimise the sum of squares of the deviation of each training instance from the line; although this “Cost Function” can be modified depending on the nature of the problem. The “Univariate” term in Multiple Univariate Regression refers to the fact that the output variable is a scalar, i.e. only a single output value is being predicted. This happens to be the most common type of Regression Machine Learning problem, with examples like predicting the price of a house or the value of a stock, as there’s only a single output scalar being forecast. Multivariate Linear Regression (also called General Linear Regression) however, is the situation where the output variable is a Vector quantity, and this more complex type of regression is generally not encountered in introductory Machine Learning.

k-Nearest Neighbours (kNN)

Illustration of the kNN Algorithm

Fig 2: An example of classification using the k-Nearest Neighbours algorithm. The objective is to classify the test instance (Green circle) as belonging to either the Red or Blue class depending on a number of its nearest neighbours. As seen from the circular boundaries, the class label could change depending on k, the number of nearest neighbours chosen to compare with.

k-Nearest Neighbours (kNN) is another example of a relatively simple algorithm that finds great utility in Data Science. kNN can be used for both Classification and Regression problems. The k-Nearest Neighbours method is a specific type of a class of algorithms called Lazy Learning, or Instance-Based Learning. In this method, the set of values for the feature variables are plotted in multi-dimensional vector space, and each point vector is either labelled in a Classification-type problem or assigned its numerical label value in a Regression-type problem. The new records to be predicted are also plotted in this same vector space, and they are either classified with the same label as the majority of their k nearest neighbours (k is usually an odd integer), or regressed with the average of the numerical values of these k nearest neighbours. This is another simple but powerful method to perform classification and regression, and can also be used as a benchmark for more complex algorithms.

The k-Nearest Neighbours algorithm finds great value in the field of Recommendation Engines, which are used by companies like Netflix and Amazon to suggest movies or products their users may be interested in. This is because these use cases tend to have continuously updating datasets, where training data from even a short while ago can be rendered obsolete because of new entries. That necessitates a system that captures local variations in the data structure so it can deal with changes in the problem domain, and kNN has showed success in that regard.

Decision Trees and Random Forests

Simple decision tree about food

Fig 3: An example of a simple decision tree that predicts what a user should do about food depending on how much money he possesses and his hunger level.

Decision Trees are used to create models that pinpoint the value of the target variable depending on the answers to questions posed by the model regarding the values of the predictor variables. Decision trees essentially aim to establish a sequence of feature-based rules, following which most instances in the dataset can be classified or regressed with the right value depending on the values that these instances possessed for their feature variables.

Decision Trees are hence just a collection of IF Statements in programming. They require minimal to no data pre-processing, and can be generated directly from tabular data. In multi-dimensional vector space, decision trees essentially just create linear separations within the dataset, and classify any new instance according to the class label of the most frequently occurring class label on the same side of the separations.

One big advantage of Decision Trees is their completely transparent nature. The user can check the exact rule the system comes up with in order to classify or regress the input dataset, and that lends itself to good understanding of the model. A disadvantage of Decision Trees, however, is that because of their overly deterministic nature, they are often the machine learning models most prone to overfitting the data. That means they may not generalize well to new test datasets, as they have been configured only for the training set. This problem is overcome by using the Random Forest technique, which is basically an aggregation of Decision Trees, each using a random subset of feature variables, in order to limit the errors due to variance and bias. Random Forests, while taking longer to train, are usually an improvement over simplistic Decision Trees, and can be used to solve both classification and regression problems.

Naïve Bayes Classifiers & Bayesian Networks

Description of the Bayes' Theorem

Fig 4: The Bayes’ Theorem from Probability and Statistics.

Naïve Bayes and Bayesian Networks are a special kind of Machine Learning algorithm based on the Bayes’ Theorem from Probability and Statistics. Hence, these algorithms, unlike the other examples, are Probabilistic Machine Learning models. The Bayes’ Theorem provides a way to calculate the “posterior” probability of an event to occur, given the occurrence of a specific piece of “evidence”, by using “prior” probabilities of occurrences of these evidence events. The “Naïve” Bayes Classifier is called so because it operates under the assumption that the feature variables individually contributing to the target value are all independent of each other, whereas Bayesian Networks take into account the more complex possibility of co-dependence among the feature variables.

Depiction of the relationship structure of a Naïve Bayes classifier

Fig 5: The data structure of a Naïve Bayes Classifier. Each random variable Xi contributes to the target class variable C, but the Xi s are all independent of each other.

Depiction of the relationship data structure of an example Bayesian Network.

Fig 6: The data structure of a Bayesian Network. In this case, there are interdependencies among the feature variables D, G, I & S used to predict the target variable L.

As a probabilistic graphical model, the Bayesian algorithms are useful for forecasting values when the outcome is uncertain and probabilistic values taking into account all possibilities are better suited to the task. The Naïve Bayes approach is obviously much faster and less computationally expensive, so it can be used in Machine Learning problems where there is no co-dependence among the feature variables. The Bayesian Network can be utilised for more complex cause-effect relationships between the predictor variables themselves, as most other Machine Learning algorithms do not have the provision to take this factor into account.

Support Vector Machines

Support Vector Machines (SVM) are one of the most elegant and powerful plug-and-play Machine Learning models available for Data Science. Although originally designed for Classification problems, they can also be implemented for Regression. The idea behind Support Vector Machines is to use Kernels, which can transform a feature value set into higher-dimensional points in vector space, which are more easily separable than the points in their original multi-dimensional space. This ensures that many different kinds of datasets can be classified correctly, not just those which are linearly separable. Once the points are transformed into higher-dimensional space, a “hyperplane” is created to distinguish between the points belonging to different classes, and the hyperplane with the largest separation between the two classes is selected to make the classification. This can then be projected back onto the original dimensions of the dataset to visualise what the classification really looks like.

Example of a Support Vector Machine at work.

Fig 7: The initially non-linearly separable dataset can be mapped onto a higher dimension using a Support Vector Machine (SVM), where a simple linear hyperplane can be used to separate the labels and classify the dataset accurately. That separation can then be mapped back to the original dataset, where the non-linear separation can be visualised.

The Support Vector Machine algorithm, together with its modifications for regression and different kernel functions, is perhaps the most single most successful Supervised Machine Learning technique used in Data Science. SVMs are known to approach the accuracies of even Artificial Neural Networks when it comes to classification and regression problems, at a fraction of the computational cost; so they are used extensively across several domains with messy, difficult-to-separate datasets.

These five algorithms represent some of the most commonly used approaches to solve Data Science problems using Machine Learning. However, each of these techniques is only suitable to specific kinds of problems, so the creativity of choosing the right solution for any given problem has to come from the data scientist alone. Machine learning involves a lot of manual parameter tuning in order to obtain the best model, so selecting the right model and then tuning its parameters well is also important to get the best results for the project.