By Sudhanshu Shukla June 21, 2020

Basics of Machine Learning -Part II

Photo by Arseny Togulev

This is the second part of the series on the Basics of Machine Learning. Please refer to this link for part 1.

Assumptions of Simple Linear Regression

Below are the main assumptions of linear regression:

There is a linear relationship between X and Y.
Error terms are normally distributed.
Error terms are independent of each other. Change in one error term should not impact the other error terms.
Error terms have constant variance i.e. variance shouldn’t increase or decrease as the error values change. Also, they shouldn’t follow any pattern. If the variance is not constant or homoscedastic then the inferences made on the model would be unreliable.

Checking the Model Fit:

We should always make sure that the model fit is really a good fit and is not by a chance. There are many ways to test it.

T-Test:

The t-distribution is also a kind of normal distribution; it is also symmetric and single-peaked, but less concentrated around its peak. A t-distribution is shorter and flatter around the centre than a normal distribution.

Two simple conditions to determine when to use the t-statistic are :

Population standard deviation is unknown
The sample size is less than 30

If the above conditions are met we can use T-Test :

Formula :

t = \frac{x - μ}{s / \sqrt n}

Using the formula we can get the value and compare the value from T Table and reject or fail to reject the Hypothesis.

You can use this link for finding the T value for a given confidence level.

Analysis of variance (ANOVA)

An ANOVA test is a way to find out if a survey or experiment results are significant. In other words, they help you to figure out if you need to reject the null hypothesis or accept the alternate hypothesis.

Basically, you’re testing groups to see if there’s a difference between them.

There are two main types: one-way and two-way. Two-way tests can be with or without replication.

One-way ANOVA between groups: used when you want to test two groups to see if there’s a difference between them.
Two way ANOVA without replication: used when you have one group and you’re double-testing that same group. For example, you’re testing one set of individuals before and after they take medication to see if it works or not.
Two way ANOVA with replication: Two groups and the members of those groups are doing more than one thing. For example, two groups of patients from different hospitals trying two different therapies.

Anscombe's Quartet: Visualize your data

It is a group of four datasets that all share the same summary statistics, yet look completely different in graphs.

Before performing linear regression, we should always make sure to visualize the data and not always trust the summary statistics because it is sensitive to outliers and it models the linear relationships only.

SOURCE: J. MATEJKA AND G. FITZMAURICE (2017)

To read more on this refer to this link.

F Statistics

We can define F Statistic as:

F = Variation between the sample means/variation within the samples

In Python, using statsmodel.api we can calculate Ordinary Least Square as below.

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 4.020e+06
Date:                Fri, 13 Mar 2020   Prob (F-statistic):          2.83e-239
Time:                        13:54:01   Log-Likelihood:                -146.51
No. Observations:                 100   AIC:                             299.0
Df Residuals:                      97   BIC:                             306.8
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.3423      0.313      4.292      0.000       0.722       1.963
x1            -0.0402      0.145     -0.278      0.781      -0.327       0.247
x2            10.0103      0.014    715.745      0.000       9.982      10.038
==============================================================================
Omnibus:                        2.042   Durbin-Watson:                   2.274
Prob(Omnibus):                  0.360   Jarque-Bera (JB):                1.875
Skew:                           0.234   Prob(JB):                        0.392
Kurtosis:                       2.519   Cond. No.                         144.
==============================================================================

Here we can see that Prob (F-statistic) value is less than 0.5, hence the model is significant and is not just fit by any chance,

Here Prob (F-statistic) means the p-value of the F-statistic.

R Square :

The value of R-squared varies from 0 to 1, with 1 being a perfect fit to the data points and 0 being the worst fit to the data points.

R-squared value only tells you how much variance is being explained by the straight line that you have fit. It does not tell anything about the significance of fit. It just might happen that a model with a large R-squared value might still be insignificant, i.e. just a fit by chance.

Search This Blog

Naive Data

Featured

Churn Score : Predict the customer churn using Machine Learning

Basics of Machine Learning -Part II

Assumptions of Simple Linear Regression

Checking the Model Fit:

Analysis of variance (ANOVA)

F Statistics

R Square :

Comments

Post a Comment

Popular Posts

Performing Exploratory Data Analysis (EDA) using Python to predict Loan Defaulters

Predicting Car Prices using Linear Regression

Churn Score : Predict the customer churn using Machine Learning

Basics of Machine Learning -Part I

Understanding Artificial Neural Network using Python

Random Forest in Bytes

Decision Trees In Bytes