Machine Learning has become very popular over the past decade or so with many industries adopting new algorithms to automate processes and increase productivity.
Machine Learning is a branch of Artificial Intelligence (AI) and involves machines/computers learning from data and generating results without being explicitly programmed to do so. In traditional programming, we provide the input data and the rules to generate an output. In machine learning we let the algorithm figure out the rules by providing the input data and the answers.
Once these algorithms have been trained we can use them to make predictions on any new data presented to them.
Within this article, we will look at 6 commonly used machine learning algorithms that you need to know.
Types of Machine Learning Models
Machine Learning algorithms can be categorised into four main types: supervised, unsupervised, semi-supervised and reinforcement. The first two supervised and unsupervised are the main categories relating to the algorithms discussed in this article.
Supervised learning is designed to learn from example by using labelled datasets. This is where the input data has been paired with the correct outputs. Depending on the data type for the target (output) variable, supervised learning can be further broken down into classification for categorical data or regression for continuous data.
Unsupervised learning is the opposite and does not rely on labelled datasets. Instead, these algorithms are used to identify hidden patterns within the data that may not be easily visible to the human eye. Unsupervised learning is commonly applied during exploratory data analysis (EDA) and includes clustering of data based on similarities and dimensionality reduction, where multiple inputs can be reduced to a more meaningful number of inputs whilst maximising data variability.
Linear regression is a supervised machine learning algorithm that you will come across early in your data science journey and you may well have used it in the past.
In simple terms you are attempting to model the relationship between two (simple linear regression) or more variables (multiple linear regression). This allows us to understand the relationship between the variables, and also to derive an equation that we can use to predict our target variable.
With simple linear regression, we are attempting to understand the relationship between two variables: an independent variable (x), which is our explanatory variable, and a dependent variable (y), which is the variable being investigated. For example, we may be trying to work out how house prices (dependent variable) relate to the square footage of the house (independent variable).
In the case of Multiple Linear Regression, we have a many-to-one relationship where we have many independent variables (X1 — Xn) and a single dependent variable (y).
When we run a linear regression we get back equations in the form illustrated below, where
y is our dependent variable, in other words our target variable and
x1, x2, x3 ... xn are the independent variables.
b0 is the y-intercept, and
b1, b2, b3 ... bn are the coefficients (multipliers) for each variable.
Even though linear regression is simple to implement, you do have to be careful with outliers as they can have a huge impact on the derived regression results.
Decision Trees are a supervised machine learning algorithm and are fairly intuitive to use. We make use of them every day to make decisions, even though we don’t refer to them as decision trees.
For example, we may open our curtains in the morning and check what the weather is doing. If it is raining, we may want to wear a jacket or take an umbrella if it is not too windy. Similarly, if it is sunny then we may want to protect ourselves by using a hat or if it is not, then we may want to wear a warm jacket based on the temperature.
Decision trees are models that resemble a tree like structure containing decisions and possible outcomes. They consist of a root node, which forms the start of our tree, decision nodes which are used to split the data based on a condition, and leaf nodes which form the terminal points of the tree and the final outcome.
Once a decision tree has been formed, we can use it to predict values when new data is presented to it.
Decision trees can have a few issues such as overfitting or underfitting to the training data and can become cumbersome to interpret the larger they get.
Random Forest is a supervised ensemble machine learning algorithm that aggregates the results from multiple decision trees, and can be applied to classification and regression based problems.
Using the results from multiple decision trees is a simple concept and allows us to reduce the problem of overfitting and underfitting experienced with a single decision tree.
To create a Random Forest we first need to randomly select a subset of samples and features from the main dataset, a process known as “Bootstraping”. This data is then used to build a decision tree. Carrying out bootstrapping avoids issues of the decision trees being highly correlated and improves model performance.
In the example illustrated below, we may end up with different results from each decision tree. These results are then aggregated together, and through the process of majority voting (in classification) or averaging (in regression) we end up with the final result.
Additionally, it is worth bearing in mind that as the number of decision trees within a Random Forest increases, so too will the computational time and and resources used.
Artificial Neural Networks
At some point in your data science and machine learning journey you will eventually come across an article mentioning Artificial Neural Networks.
These are a very popular supervised (mostly) machine learning algorithm that are created from a series of functions and interconnected nodes (neurons). They are inspired by the way that the human brain functions — takes input, processes the data by identifying patterns within it, and then outputs the final result.
A typical artificial neural network consists of three main components: an input layer, hidden layer and an output layer.
The input layer is the first layer in the network and is where we pass in our input features.
The hidden layer exists between the input and output layers. it consists of multiple nodes which take inputs and transforms them using activation functions, weights and biases. The number of hidden layers can be greater than 1, and the more hidden layers a network has, the deeper it is said to be.
The output layer is the final layer within the network and represents the final result from running data through the network. This layer can consist of a single node, like in the image below, or multiple nodes such as in the case of a multi-class classification problem.
Artificial neural networks have been used in a number of applications including: bank fraud detection, image processing, recognising items from images, time-series forecasting and much more.
Support Vector Machines
Support Vector Machines (SVM) are a supervised machine learning algorithms that can be used for both classification and regression (SVR). They are a robust machine learning algorithm that is based on statistical learning and can be applied to both linear and non-linear problems. They also provide an alternative data-driven methodology to traditional Artificial Neural Networks.
In classification, Support Vector Machines attempt to find a hyperplane (a line in 2D space) that optimally separates the data and allows it to be categorised.
The closest points on either side of the hyperplane are known as the support vectors. These points have the most influence on the location and orientation of the hyperplane.
In the case of Support Vector Regression, it uses the same underlying principles and attempts to find the line of best fit through the data that has the maximum number of points within a defined threshold.
SVMs have a number of advantages such as being able to handle data that has a large number of features compared to observations, deal with non-linear data without becoming unstable. However, they can become computationally expensive when working with large datasets, which is due to the use of quadratic programming algorithms during optimisation.
K-Means clustering is a very commonly used unsupervised machine learning algorithm and is relatively easy to understand. It is essentially used to group similar data points together based on their properties, and from that, we can identify any potentially hidden patterns within the data.
Given a dataset, k-means clustering can be used to divide the data into ‘k’ number of clusters through minimisation of the distance between data points and the cluster centre point (centroid).
The centroid for each cluster is initialised at k random points within the data. The remaining data points are then assigned to the relevant cluster based on the distance to the nearest centroid.
The centroid is then adjusted to the central point of the cluster and the points surrounding it are reassigned.
This process continues until:
- There is no change in the centroids, in other words, they are stable
- The points remain in the same cluster
- The maximum number of iterations defined by the user has been reached
K-means clustering is simple to implement and understand and scales well to large datasets, however, the algorithm requires you to select the number of clusters and may not be suitable for all datasets.
The algorithms covered here are just a small sample of ones that you will come across in your data science journey. There are many videos, articles and books that go into great detail about these algorithms and I highly recommend diving deeper into the algorithms that interest you.