Skip to main content

Featured

Churn Score : Predict the customer churn using Machine Learning

In today's world, in any industry, customers have multiple options to choose from like in Telecom, we have AT&T, Verizon, T- mobile, etc, in Media, we have Netflix, Amazon Prime, Apple TV, etc.  and similarly for others. In this highly competitive market, the industries face a large number of customer churn. Considering that obtaining a new customer costs 5-10 times more than maintaining a current one, hence,  retention of customers has now become even more important than acquiring a customer. Here we are going to take data from one of the telecom company, to understand the customer churn. What is the customer churn? If a client or subscriber ceases to do business with a company or service, its called Customer Churn.  Now let's take a deep dive to understand why the Customer Churn happens and how to predict it before happening. Understanding the data Before working on any model, we need to understand the data and try to a...

Random Forest in Bytes

 

Photo by Jens Lelie on Unsplash

This is the second post in In Bytes series. Refer this link to read about the first post.


Random Forest:

Random forest is an ensemble made by the combination of a large number of decision trees.

Ensemble:

An ensemble means a group of things viewed as a whole rather than individually. In ensembles, a collection of models is used to make predictions, rather than individual models.

In principle, ensembles can be made by combining all types of models. An ensemble can have a logistic regression, a neural network, and few decision trees working in unison.

While choosing the model , we need to check for two things- Diversity and Acceptability.

Diversity ensures that the models serve complementary purposes, which means that the individual models make predictions independent of each other. The advantages of this are different depending on the type of ensemble.

Diversity ensures that even if some trees overfit, the other trees in the ensemble will neutralise the effect. The independence among the trees results in a lower variance of the ensemble compared to a single tree

Acceptability implies that each model is at least better than a random model i.e. p>0.50.

Now, to understand how an ensemble makes decisions, consider an ensemble with 100 models comprising of decision trees, logistic regression models, etc. Given a new data point, each model will predict an output y for this data point. If this is binary classification, then you simply take the majority score. If more than 50% models say y = 0, you go with 0 and vice-versa.

If each of the individual models is acceptable, i.e.they’re wrong with a probability less than 50%, you can show that the probability of the ensemble being wrong (i.e. the majority vote going wrong) will be far lesser than that of any individual model.


Also, the ensembles avoid getting misled by the assumptions made by individual models. For example, ensembles (particularly random forests) successfully reduce the problem of overfitting. If a decision tree in an ensemble overfits, you let it. Chances are extremely low that more than 50% of the models have overfitted.

Bagging:


Bagging stands for bootstrapped aggregation. It is a technique for choosing random samples of observations from a dataset. Each of these samples is then used to train each tree in the forest.

You create a large number of models (say, 100 decision trees), each one on a different bootstrap sample, from the training set. To get the result, you aggregate the decisions taken by all the trees in the ensemble

Bootstrapping means creating bootstrap samples from a given data set. A bootstrap sample is created by sampling the given data set uniformly and with replacement. A bootstrap sample typically contains about 30-70% data from the data set. Aggregation implies combining the results of different models present in the ensemble.



Random forest selects a random sample of data points (bootstrap sample) to build each tree, and a random sample of features while splitting a node. Randomly selecting features ensures that each tree is diverse and is not impacted by the prominent features present in the dataset.
E.g. For dataset for heart attack, Blood Pressure and Weight will have high correlation with the predictor, since here we just take few attributes for a split , the model will not be impacted.


Advantages:

Diversity arises because you create each tree with a subset of the attributes/features/variables, i.e. you don’t consider all the attributes while making each tree. The choice of the attributes considered for each tree is random. This ensures that the trees are independent of each other. 

Stability arises because the answers given by a large number of trees average out. A random forest has a lower model variance than an ordinary individual tree. 

Immunity to the curse of dimensionality: Since each tree does not consider all the features, the feature space (the number of features a model has to consider) reduces. This makes the algorithm immune to the curse of dimensionality. A large feature space causes computational and complexity issues.

Parallelizability: You need a number of trees to make a forest. Since two trees are independently built on different data and attributes, they can be built separately. This implies that you can make full use of your multi-core CPU to build random forests. Suppose there are 4 cores and 100 trees to be built; each core can build 25 trees to make a forest.
 
Testing /training data and the OOB or out-of-bag error: You always want to avoid violating the fundamental tenet of learning: “not testing a model on what it has been trained on”. While building individual trees, you choose a random subset of the observations to train it. If you have 10,000 observations, each tree may only be made from 7000 (70%) randomly chosen observations. OOB is the mean prediction error on each training sample xᵢ, using only the trees that do not have xᵢ in their bootstrap sample. If you think about it, this is very similar to a cross-validation error. In a CV error, you can measure the performance on the subset of data the model hasn’t seen before. 

In fact, it has been proven that using an OOB estimate is as accurate as using a test data set of a size equal to the training set.


Comments