Predicting Car Prices using Linear Regression

By Sudhanshu Shukla October 21, 2019

Predicting Car Prices using Linear Regression

Linear regression is one of the widely used statistical techniques for analysis. It provides a relationship between the target and the predictors.

The below graph shows the linear relationship between the “dependent” variable Y, whose values we wish to predict and the “independent” variable X from which we wish to predict it.

There are two types of linear regression:

1. Simple Linear Regression :

It denotes the relationship between one variable and the target variable. The above graph represents the Simple Linear Regression.

2. Multiple Linear Regression :

It denotes the relationship between 2 or more independent variables and the corresponding dependent variable. The independent variables can be continuous or categorical.
The below graph shows the multiple linear regression.

Properties of Linear Regression

Below are the main properties of linear regression:

There is a linear relationship between X and Y.
Error terms are normally distributed.
Error terms are independent of each other. Change in one error term should not impact the other error terms.
Error terms have constant variance i.e. variance shouldn’t increase or decrease as the error values change. Also, they shouldn’t follow any pattern.

Now let's jump to our problem statement.

Problem Statement

To understand the variables on which impact the pricing of cars.

You can download the raw data from the below link

Raw Data: CarData.csv

Python Implementation

Importing the libraries

Lets first import all the libraries required

	#imoprting libraries
	import warnings
	import numpy as np
	import pandas as pd
	import matplotlib.pyplot as plt
	import seaborn as sns
	from IPython.display import display
	from scipy import stats
	import statsmodels.api as sm
	from sklearn.metrics import mean_squared_error, r2_score

view raw linear-regression-import.py hosted with ❤ by GitHub

Reading the data source

	#reading csv
	car=pd.read_csv("CarPrice.csv")

view raw lr-readcsv.py hosted with ❤ by GitHub

Here is the preview of the data

Columns and data type

Before any data analysis, we should perform Cleaning and Pre-Processing. You can refer to the Data Cleaning section in my previous EDA article.

Correlation and Data Visualization

The term “correlation” refers to a mutual relationship between the two variables. If two variables have a high correlation, we can keep only one as they will have same impact on the target value. This will also help us in reducing the dimensionality of the dataset.

There are multiple ways to visualize this correlation.

Heatmap

	#Finding the correlation
	plt.figure(figsize=(15,7))
	sns.heatmap(car.corr(), cmap="YlGnBu", annot = True)
	plt.show()

view raw lr-heatmap.py hosted with ❤ by GitHub

Pairplot

We plot a pairplot graph for each independent variable against the dependent variable (here column price). This shows if there is a linear relationship between the variables.

Note: In the below graphs we are trying to see the relationship for each variable with the dependent variable, hence we are using for loop to loop through all the variables against the dependent variable

	#Pairplot againt price column
	nr_rows = 4
	nr_cols = 4

	plt.figure(figsize=(nr_cols6,nr_rows3))

	for index in range(0,len(numerical_col),4):
	sns.pairplot(car, y_vars='price',x_vars=numerical_col[index:index+4])
	plt.show()

view raw lr-pariplot.py hosted with ❤ by GitHub

Pearson correlation coefficient

It is a measure of the linear correlation between two variables X and Y. According to the Cauchy–Schwarz inequality, it has a value between +1 and −1, where 1 is a total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.

	# Pearson Correlation with price
	nr_rows = 4
	nr_cols = 4

	fig, axs = plt.subplots(nr_rows, nr_cols, figsize=(nr_cols4,nr_rows3))

	li_num_feats = list(numerical_col)
	li_not_plot = ['price','car_ID']
	li_plot_num_feats = [c for c in list(numerical_col) if c not in li_not_plot]


	for r in range(0,nr_rows):
	for c in range(0,nr_cols):
	i = r*nr_cols+c
	if i < len(li_plot_num_feats):
	p=sns.regplot(car[li_plot_num_feats[i]], car['price'], ax = axs[r][c])
	stp = stats.pearsonr(car[li_plot_num_feats[i]], car['price'])
	str_title = "r = " + "{0:.2f}".format(stp[0]) + " " "p = " + "{0:.2f}".format(stp[1])
	axs[r][c].set_title(str_title,fontsize=10)
	p.set_xticklabels(p.get_xticklabels(),rotation=30)

	plt.tight_layout()
	plt.show()

view raw lr-pearson.py hosted with ❤ by GitHub

Line Graph

	# Line plot for each numerical values against price
	nr_rows = 6
	nr_cols = 3

	plt.figure(figsize=(nr_cols4,nr_rows3))

	li_num_feats = list(numerical_col)
	li_not_plot = ['price','car_ID']
	li_plot_num_feats = [c for c in list(li_num_feats) if c not in li_not_plot]

	for r in range(0,nr_rows):
	for c in range(0,nr_cols):
	i = r*nr_cols+c
	if i < len(li_plot_num_feats):
	plt.subplot(nr_rows, nr_cols, i+1)
	sns.lineplot(y='price', x=li_plot_num_feats[i], data=car,estimator=np.mean)
	plt.title("price vs " + li_plot_num_feats[i])
	plt.ylabel("price")
	plt.xlabel(li_plot_num_feats[i])
	plt.tight_layout()
	plt.show()

view raw lr-line.py hosted with ❤ by GitHub

Numerical vs Categorical Columns

Regression analysis needs numerical variables. Categorical variables are variables that classify observations into groups. There are many to convert the categorical columns to numerical.

Binary Mapping

If the column has only two values, we can tag one value to 1 and other to 0.

	# Defining the map function
	def binary_map(x,val):
	return 0 if x==val else 1

	#Set gas=1 and diesel=0
	car['diesel']=car['fueltype'].apply(lambda x: binary_map(x,'diesel'))

	#Viewing the data
	car['diesel'].value_counts()

view raw lr-binary.py hosted with ❤ by GitHub

The below columns have two types of values.

Creating Dummy Encoding

In this we convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column. This has the benefit of not weighting a value improperly but does have the downside of adding more columns to the data set.
In python , we can use panda's get_dummies method to convert the categorical column to dummy variables.

	# Creating dummy variables for non binary categorical column
	drivewheel=pd.get_dummies(car['drivewheel'],drop_first = True)

view raw dummy.py hosted with ❤ by GitHub

After converting to dummy columns

Custom Mapping

We can also map the values of the column based on the data and business needs.

	# We can quantify the column as it represnts no. of cylinders based on the bussiness knowledge
	word_to_num={'one':1,'two':2,'three':3,'four':4,'five':5,'six':6,'seven':7,'eight':8,'nine':9,'ten':10,'eleven':11,'twelve':12}

	word_to_num['two']

	#Applying the function
	car['cylindernumber']=car['cylindernumber'].apply(lambda x:word_to_num[x])

	# Viewing the data
	car['cylindernumber'].value_counts()

view raw lr-custom-map.py hosted with ❤ by GitHub

Data post changes

Model Building

We are using sklearn library for creating the model. You can import the method train_test_split from sklearn.model_selection library.

	# We specify this so that the train and test data set always have the same rows, respectively
	from sklearn.model_selection import train_test_split

	df_train, df_test = train_test_split(car, train_size = 0.7, test_size = 0.3, random_state = 100)

	#scaling the data to have same scale
	from sklearn.preprocessing import MinMaxScaler
	scaler = MinMaxScaler()

	#Taking only numerical variables, as we won't aply it on dummy variables
	df_train[num_variables] = scaler.fit_transform(df_train[num_variables])

	df_train.head()

view raw lr-test-train.py hosted with ❤ by GitHub

Since here we don't have a separate validation data set, we are splitting the data into parts: test and train.

train: The data on which we train the model.

test: The data on which we test the model after the model is trained.

Also, we are performing the min-max scale, so that the data has same scale.

Understanding the relationship of variables with Dependent Variable

Let's start with one of the variables to see the linear relationship with the dependent variable.

Based on the prior knowledge obtained from EDA above, we know that "engine -size" is in direct relation with the price.

So let's fit the model and see the metrics

	# Add a constant
	X_train_lm = sm.add_constant(X_train['enginesize'])

	# Create a first fitted model
	lr = sm.OLS(y_train, X_train_lm).fit()

	lr.params

view raw lr-params.py hosted with ❤ by GitHub

The above code return below params.

const        -0.080076
enginesize    1.253896

This can be visualized in the linear regression equation.

	# Let's visualise the data with a scatter plot and the fitted regression line
	plt.scatter(X_train_lm.iloc[:, 1], y_train)
	plt.plot(X_train_lm.iloc[:, 1], -0.080076 + 1.253896*X_train_lm.iloc[:, 1], 'r')
	plt.title("enginesize")
	plt.show()

view raw lr-linear-show hosted with ❤ by GitHub

We can see the OLS (Ordinary Least Square) summary using the below code.

	# Print a summary of the linear regression model obtained
	print(lr.summary())

view raw lr-summary.py hosted with ❤ by GitHub

As per the above, we have decent R-squared=0.76, P=0.0, AIC, and BIC are -ve. Now we need to understand how well the columns behave with each other.

So let's add more columns and see the OLS summary and see their impact.

	#Creating a common function to create regression summary
	def get_reg_summary(x_train_data):
	# Add a constant
	X_train_lm = sm.add_constant(x_train_data)
	# Create a first fitted model
	lr = sm.OLS(y_train, X_train_lm).fit()
	# Print a summary of the linear regression model obtained
	print(lr.summary())
	return lr,X_train_lm

	#Adding the carlength to see its impact
	lr,X_train_lm=get_reg_summary(X_train[['enginesize','curbweight','horsepower','carlength']])

view raw lr-reg_summary.py hosted with ❤ by GitHub

"carlength" has too much of p-value i.e. greater than > 0.0.5, hence we will check how it behaves alone.

lr,X_train_lm=get_reg_summary(X_train[['carlength']])

view raw lr-carlength.py hosted with ❤ by GitHub

We can see that "carlength" here has multicollinearity with other columns. Its p-value increases in the influence of others. Let's see now how we can remove this multicollinearity.

Identifying multicollinearity

A model may be created from different independent variables but some of them might have some interrelated, which makes the variables redundant.

Correlation: We can look at the different pairs to check if the variables are correlated.
Variance Inflation Factor (VIF): Variable might be dependent on a combination of variables. It measures how well all other independent variables combine to describe one independent variable

Variance Inflation Factor (VIF)

VIF is given by below formula:

where 'i' refers to the i-th variable which is being represented as a linear combination of the rest of the independent variables.
The general rules for VIF are:

> 10: High VIF value. The variable should be eliminated.
Between 10 to 5: It should be inspected and based on the context , the decision should be taken to remove the variable.
Between 1- 5: Good VIF value.No elimination required.
1: Not correlated.No elimination required.

In python, we can use method variance_inflation_factor provided by statsmodels.stats.outliers_influence library to get VIF values of the feature variables.


	# Check for the VIF values of the feature variables.
	from statsmodels.stats.outliers_influence import variance_inflation_factor
	def get_VIF(x_train_data):
	# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
	vif = pd.DataFrame()
	vif['Features'] = x_train_data.columns
	vif['VIF'] = [variance_inflation_factor(x_train_data.values, i) for i in range(x_train_data.shape[1])]
	vif['VIF'] = round(vif['VIF'], 2)
	vif = vif.sort_values(by = "VIF", ascending = False)
	display(vif)


	get_VIF(X_train)

view raw lr-vif.py hosted with ❤ by GitHub

Sample Data for few of the columns:

Thus we need to identify the column which has high value and then remove it. If the columns have same VIF, we can consider the p-value also. If we have a higher p-value, that variable can be eliminated. Once the column is removed, again we need to perform OLS regression and again check the result summary. This process needs to be done until we have all the multicollinearity removed. If we have too many columns, performing VIF can be cumbersome. There is another technique, RFE, which does this work for us.

Recursive Feature Elimination (RFE)

The method of Recursive Function Elimination (RFE) is an approach to selecting features. It operates by extracting attributes recursively and creating a template on the remaining attributes.

# Importing RFE and LinearRegression

from sklearn.feature_selection import RFE

from sklearn.linear_model import LinearRegression

lm = LinearRegression()

lm.fit(X_train, y_train)

rfe = RFE(lm, 15) # running RFE

rfe = rfe.fit(X_train, y_train)

list(zip(X_train.columns,rfe.support_,rfe.ranking_))

view raw lr-rfe.py hosted with ❤ by GitHub

Result of RFE:

rfe.support_ provides us all the important variables which are required for the model.

	# removed the columns which are not required.
	X_train.columns[~rfe.support_]

view raw lr-rfesupprt.py hosted with ❤ by GitHub

Let's see the summary of our linear model

Prediction

We can predict the y-value as below

	#prediction
	y_train_price = lr.predict(X_train_lm)

view raw lr-predict.py hosted with ❤ by GitHub

Error Term

It's better to visualize the error term to see if it's following the normal curve

	# Importing the required libraries for plots.
	import matplotlib.pyplot as plt
	import seaborn as sns
	%matplotlib inline

	# Plot the histogram of the error terms
	fig = plt.figure()
	sns.distplot((y_train - y_train_price), bins = 20)
	fig.suptitle('Error Terms', fontsize = 20) # Plot heading
	plt.xlabel('Errors', fontsize = 18) # X-label

view raw lr-error.py hosted with ❤ by GitHub

Error seems to be fine as it has a normal distribution and has is centered at zero.

Model

Now let's see our model and how well it's performing.

	# Making predictions
	y_pred = lr.predict(X_test_new)

view raw lr-model-1.py hosted with ❤ by GitHub

	# Now let's use our model to make predictions.
	# Creating X_test_new dataframe by dropping variables from X_test
	X_test_new = X_test[X_train.columns]

	# Adding a constant variable
	X_test_new = sm.add_constant(X_test_new)

view raw lr-model-2-test train hosted with ❤ by GitHub

	# Plotting y_test and y_pred to understand the spread

	fig = plt.figure()
	plt.scatter(y_test, y_pred)
	fig.suptitle('y_test vs y_pred', fontsize = 20) # Plot heading
	plt.xlabel('y_test', fontsize = 18) # X-label
	plt.ylabel('y_pred', fontsize = 16)

view raw lr-model-3-graph.py hosted with ❤ by GitHub

The model seems to be performing good, let's dive into performance metrics now.

R2 score

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. 0% indicates that the model explains none of the variability of the response data around its mean.

Mean Squared Error

The mean squared error or mean squared deviation of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value.

Actual vs Predicted

Now lets see how well the model is predicting the values. The below graph shows that even though we have missed some points (error), overall we have a good prediction model.

	#Actual vs Predicted
	c = [i for i in range(1,63,1)]
	fig = plt.figure()
	plt.plot(c,y_test, color="blue", linewidth=2.5, linestyle="-")
	plt.plot(c,y_pred, color="red", linewidth=2.5, linestyle="-")
	fig.suptitle('Actual and Predicted', fontsize=20) # Plot heading
	plt.ylabel('Price', fontsize=16) # Y-label

view raw lr-actual-prd.py hosted with ❤ by GitHub

Error Terms

Error follows a pattern as expected for Multi-Linear Regression

	# Error terms
	c = [i for i in range(1,63,1)]
	fig = plt.figure()
	plt.plot(c,y_test-y_pred, color="blue", linewidth=2.5, linestyle="-")
	fig.suptitle('Error Terms', fontsize=20) # Plot heading
	plt.ylabel('Price', fontsize=16) # Y-label

view raw lr-error-term.py hosted with ❤ by GitHub

Conclusion

Thus based on above we conclude that the below mutlilinear equation is best fitted line for the model :

0.7380 × curbweight + 0.5438 × enginefront - 0.0873 × wagon + 0.2759 × bmw + 0.1981 × buick + 0.1860 × jaguar - 0.0985

Comments

sakMay 30, 2020 at 1:52 AM
Machine learning Online Training
ReplyDelete
Replies
360digitmgdelhiMarch 2, 2021 at 2:01 AM
I at last discovered extraordinary post here.I will get back here. I just added your blog to my bookmark locales. thanks.Quality presents is the vital on welcome the guests to visit the page, that is the thing that this website page is giving.
data scientist course delhi
ReplyDelete
Replies

	# Importing RFE and LinearRegression
	from sklearn.feature_selection import RFE
	from sklearn.linear_model import LinearRegression


	lm = LinearRegression()
	lm.fit(X_train, y_train)

	rfe = RFE(lm, 15) # running RFE
	rfe = rfe.fit(X_train, y_train)

	list(zip(X_train.columns,rfe.support_,rfe.ranking_))

Featured