Touted as the next big thing, a Predictive analysis is all set to dominate the advanced analytics landscape in the next few years. Analytics India Salary Study 2017 conducted by the AnalytixLabs & Analytics India Magazine (AIM) reveals that Advanced analytics/Predictive modeling professionals are better paid as compared to their peers.
Annual Salary in Lacs
Source: AnalytixLabs & AIM
So let us understand in details how to build a Predictive model and know the most important algorithms to be learned in Predictive Analytics.
Predictive Analytics is a branch of advanced data analytics that involves the use of various techniques such as machine learning, statistical algorithms and other data mining techniques to forecast future events based on historical data.
The model is then applied to current data to predict what would be the next course of action or suggestion for the outcome.
There are various algorithms available in the categories of data mining, machine learning and statistics when you assemble your predictive analysis model. As you explore the data it becomes easier to take further decision.
How to build a predictive model?
Constructing a predictive model is simple:
- Get the data – from different sources from any ETL tool
Example: refer iris data – https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
- Split the dataset into two parts (sample and verification data)
Build the Sample Data:
Build a model from the sample data which provides the information on species of flower and its measurements.
The columns in the data contain 4 flower measurements in centimetres
Apply an algorithm to the training set to create a set of rules which will be used to fill in the target variable (the variable we are trying to predict)
Generally, there are many predictive analysis models and they can be categorized into 2 types:
- Classification – predicting a value that is discrete through the category and finite with no order
- Regression – predicting a value that is continuous through numeric quantity and infinite with ordering:
The widely used algorithms in data analysis are linear regression and neural network
Linear regression: The simple regression model assumes that the linear relationship exists between the input and the output variables.
Neural network: A neural network inspired by the human brain, a network of neurons that are interconnected that is it is a set of computational units, which takes a set of inputs and transfer the result to a predefined output. The computational units are ordered arranged in layers so that the features of an input vector can be connected with the feature of an output vector.
The idea behind this is often to coach neural networks to model the relationships within the provided data.
- Create a model which is based on the rules established by the algorithm during the training phase.
- Test the model on the verification data set – the data is fed to the model and the predicted values are compared to the actual values. Thus the model is tested for accuracy.
- Use the model on new incoming data and take action based on the output of the model.
Other important algorithms:
Predictive models come in various forms. There are different methods that can be used to create a model, and most of them are being developed all the time.
The most common predictive models are:
Linear models: It is a very widely used statistical algorithm to build a relationship model between two variables. One variable is called predictor variable whose value is gathered through experiments, while the other variable is called response variable whose value is derived from the predictor variable.
Decision trees (also known as Classification and Regression Trees or CART): It is a graph used to represent possibilities and their outcome in the form of a tree. The nodes in the graph represent an event or choice and the edges of the graph represent the decision rules or conditions.
Support Vector Machines (SVMs) In Machine Learning: The support vector machine searches the closest points and is known as “support vectors” ” the name is as a result of the actual fact that points are like vectors which the simplest line “depends on” or is “supported by” the nearest points.
Once it detects the closest points, it draws a line connecting them by doing vector subtraction (point A – point B). The support vector machine then declares the best separating line to be the line that bisects — and is perpendicular to — the connecting line.
Naive Bayes: It is a machine learning algorithm mostly used for classification problems. It is based on Bayes’ probability theorem or alternatively known as Bayes’ rule or Bayes’ law. It is used for text classification which involves high dimensional training data sets.
It is a simple algorithm and known for its effectiveness to quickly build models and make predictions by using this algorithm. Naive Bayes algorithm is primarily considered for solving text classification problem. Hence, recommend learning the algorithm thoroughly.
Examples: spam filtration, classifying news articles and sentimental analysis
Bayes’ Theorem represented by the following equation:
- : Probability (conditional probability) of occurrence of an event given the event is true
- and: Probabilities of the occurrence of an event and respectively
- : Probability of the occurrence of the event given the event is true
In the near future, increasing demand for Predictive Analytics may see professionals from other streams joining the bandwagon. If you want to obtain an edge over your peers and be a part of this new growth avenue, you can explore our NSE Certified Business Analytics course as well as PGD in Data Science.