Skip to main content

Featured

Churn Score : Predict the customer churn using Machine Learning

In today's world, in any industry, customers have multiple options to choose from like in Telecom, we have AT&T, Verizon, T- mobile, etc, in Media, we have Netflix, Amazon Prime, Apple TV, etc.  and similarly for others. In this highly competitive market, the industries face a large number of customer churn. Considering that obtaining a new customer costs 5-10 times more than maintaining a current one, hence,  retention of customers has now become even more important than acquiring a customer. Here we are going to take data from one of the telecom company, to understand the customer churn. What is the customer churn? If a client or subscriber ceases to do business with a company or service, its called Customer Churn.  Now let's take a deep dive to understand why the Customer Churn happens and how to predict it before happening. Understanding the data Before working on any model, we need to understand the data and try to a...

Predicting Car Prices using Linear Regression


Linear regression is one of the widely used statistical techniques for analysis. It provides a relationship between the target and the predictors.

The below graph shows the linear relationship between the “dependent” variable Y, whose values we wish to predict and the “independent” variable X from which we wish to predict it.


There are two types of linear regression:

1. Simple Linear Regression :

  • It denotes the relationship between one variable and the target variable. The above graph represents the Simple Linear Regression.

2. Multiple Linear Regression : 

 

  • It denotes the relationship between 2 or more independent variables and the corresponding dependent variable. The independent variables can be continuous or categorical.
  • The below graph shows the multiple linear regression.



Properties of Linear Regression


Below are the main properties of linear regression:
  1. There is a linear relationship between X and Y.
  2. Error terms are normally distributed.
  3. Error terms are independent of each other. Change in one error term should not impact the other error terms.
  4. Error terms have constant variance i.e. variance shouldn’t increase or decrease as the error values change. Also, they shouldn’t follow any pattern.
Now let's jump to our problem statement.

Problem Statement

To understand the variables on which impact the pricing of cars.

You can download the raw data from the below link

Raw Data: CarData.csv

Python Implementation

Importing the libraries

Lets first import all the libraries required

Reading the data source

Here is the preview of the data





Columns and data type



Before any data analysis, we should perform Cleaning and Pre-Processing. You can refer to the Data Cleaning section in my previous EDA article.

Correlation and Data Visualization

The term “correlation” refers to a mutual relationship between the two variables. If two variables have a high correlation, we can keep only one as they will have same impact on the target value. This will also help us in reducing the dimensionality of the dataset.

There are multiple ways to visualize this correlation.

Heatmap


Pairplot

We plot a pairplot graph for each independent variable against the dependent variable (here column price). This shows if there is a linear relationship between the variables.

Note: In the below graphs we are trying to see the relationship for each variable with the dependent variable, hence we are using for loop to loop through all the variables against the dependent variable

   

Pearson correlation coefficient

It is a measure of the linear correlation between two variables X and Y. According to the Cauchy–Schwarz inequality, it has a value between +1 and −1, where 1 is a total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.



Line Graph




Numerical vs Categorical Columns

Regression analysis needs numerical variables. Categorical variables are variables that classify observations into groups. There are many to convert the categorical columns to numerical.

Binary Mapping

If the column has only two values, we can tag one value to 1 and other to 0.

The below columns have two types of values.

Creating Dummy Encoding

In this we convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column. This has the benefit of not weighting a value improperly but does have the downside of adding more columns to the data set.
In python , we can use panda's get_dummies method to convert the categorical column to dummy variables.


After converting to dummy columns



Custom Mapping

We can also map the values of the column based on the data and business needs.



Data post changes


Model Building

We are using sklearn library for creating the model. You can import the method train_test_split from sklearn.model_selection library.

Since here we don't have a separate validation data set, we are splitting the data into parts: test and train.

train: The data on which we train the model.
test: The data on which we test the model after the model is trained.

Also, we are performing the min-max scale, so that the data has same scale.

Understanding the relationship of variables with Dependent Variable

Let's start with one of the variables to see the linear relationship with the dependent variable.
Based on the prior knowledge obtained from EDA above, we know that "engine -size" is in direct relation with the price. 

So let's fit the model and see the metrics

The above code return below params.

const        -0.080076
enginesize    1.253896

This can be visualized in the linear regression equation.


We can see the OLS (Ordinary Least Square) summary using the below code.


As per the above, we have decent R-squared=0.76, P=0.0, AIC, and BIC are -ve. Now we need to understand how well the columns behave with each other.

So let's add more columns and see the OLS summary and see their impact.

"carlength" has too much of p-value i.e. greater than > 0.0.5, hence we will check how it behaves alone.

We can see that "carlength" here has multicollinearity with other columns. Its p-value increases in the influence of others. Let's see now how we can remove this multicollinearity.

Identifying multicollinearity

A model may be created from different independent variables but some of them might have some interrelated, which makes the variables redundant.
  1. Correlation: We can look at the different pairs to check if the variables are correlated.
  2. Variance Inflation Factor (VIF): Variable might be dependent on a combination of variables. It measures how well all other independent variables combine to describe one independent variable

Variance Inflation Factor (VIF)

VIF is given by below formula:


where 'i' refers to the i-th variable which is being represented as a linear combination of the rest of the independent variables.
The general rules for VIF are:
  • > 10:  High VIF value. The variable should be eliminated.
  • Between 10 to 5: It should be inspected and based on the context , the decision should be taken to remove the variable. 
  • Between 1- 5: Good VIF value.No elimination required.
  • 1: Not correlated.No elimination required.
In python, we can use method variance_inflation_factor provided by statsmodels.stats.outliers_influence library to get VIF values of the feature variables.

Sample Data for few of the columns:


Thus we need to identify the column which has high value and then remove it. If the columns have same VIF, we can consider the p-value also. If we have a higher p-value, that variable can be eliminated. Once the column is removed, again we need to perform OLS regression and again check the result summary. This process needs to be done until we have all the multicollinearity removed. If we have too many columns, performing VIF can be cumbersome. There is another technique,  RFE, which does this work for us.

Recursive Feature Elimination (RFE)

The method of Recursive Function Elimination (RFE) is an approach to selecting features. It operates by extracting attributes recursively and creating a template on the remaining attributes.


Result of RFE:
rfe.support_ provides us all the important variables which are required for the model.

Let's see the summary of our linear model

Prediction

We can predict the y-value as below


Error Term

It's better to visualize the error term to see if it's following the normal curve

Error seems to be fine as it has a normal distribution and has is centered at zero.

Model

Now let's see our model and how well it's performing.



The model seems to be performing good, let's dive into performance metrics now.

R2 score

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. 0% indicates that the model explains none of the variability of the response data around its mean.




Mean Squared Error

The mean squared error or mean squared deviation of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value.



Actual vs Predicted

Now lets see how well the model is predicting the values. The below graph shows that even though we have missed some points (error), overall we have a good prediction model.



Error Terms



Error follows a pattern as expected for Multi-Linear Regression


Conclusion

Thus based on above we conclude that the below mutlilinear equation is best fitted line for the model :

 0.7380 × curbweight + 0.5438 × enginefront - 0.0873 × wagon + 0.2759 × bmw + 0.1981 × buick + 0.1860 × jaguar - 0.0985


Comments

  1. I at last discovered extraordinary post here.I will get back here. I just added your blog to my bookmark locales. thanks.Quality presents is the vital on welcome the guests to visit the page, that is the thing that this website page is giving.
    data scientist course delhi

    ReplyDelete

Post a Comment