This is the second part of the series on the Basics of Machine Learning. Please refer to this
link for part 1.
Assumptions of Simple Linear Regression
Below are the main assumptions of linear regression:
- There is a linear relationship between X and Y.
- Error terms are normally distributed.
- Error terms are independent of each other. Change in one error term should not impact the other error terms.
- Error terms have constant variance i.e. variance shouldn’t increase or decrease as the error values change. Also, they shouldn’t follow any pattern. If the variance is not constant or homoscedastic then the inferences made on the model would be unreliable.
Checking the Model Fit:
We should always make sure that the model fit is really a good fit and is not by a chance. There are many ways to test it.
T-Test:
The t-distribution is also a kind of normal distribution; it is also symmetric and single-peaked, but less concentrated around its peak. A t-distribution is shorter and flatter around the centre than a normal distribution.
Two simple conditions to determine when to use the t-statistic are :
- Population standard deviation is unknown
- The sample size is less than 30
If the above conditions are met we can use T-Test :
Formula :
t=x–μs/√n
Using the formula we can get the value and compare the value from T Table and reject or fail to reject the Hypothesis.
You can use this
link for finding the T value for a given confidence level.
Analysis of variance (ANOVA)
An ANOVA test is a way to find out if a survey or experiment results are significant. In other words, they help you to figure out if you need to reject the null hypothesis or accept the alternate hypothesis.
Basically, you’re testing groups to see if there’s a difference between them.
There are two main types: one-way and two-way. Two-way tests can be with or without replication.
One-way ANOVA between groups: used when you want to test two groups to see if there’s a difference between them.
Two way ANOVA without replication: used when you have one group and you’re double-testing that same group. For example, you’re testing one set of individuals before and after they take medication to see if it works or not.
Two way ANOVA with replication: Two groups and the members of those groups are doing more than one thing. For example, two groups of patients from different hospitals trying two different therapies.
Anscombe's Quartet: Visualize your data
It is a group of four datasets that all share the same summary statistics, yet look completely different in graphs.
Before performing linear regression, we should always make sure to visualize the data and not always trust the summary statistics because it is sensitive to outliers and it models the linear relationships only.
|
SOURCE: J. MATEJKA AND G. FITZMAURICE (2017)
|
To read more on this refer to this
link.
F Statistics
We can define F Statistic as:
F = Variation between the sample means/variation within the samples
In Python, using statsmodel.api we can calculate Ordinary Least Square as below.
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 4.020e+06
Date: Fri, 13 Mar 2020 Prob (F-statistic): 2.83e-239
Time: 13:54:01 Log-Likelihood: -146.51
No. Observations: 100 AIC: 299.0
Df Residuals: 97 BIC: 306.8
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.3423 0.313 4.292 0.000 0.722 1.963
x1 -0.0402 0.145 -0.278 0.781 -0.327 0.247
x2 10.0103 0.014 715.745 0.000 9.982 10.038
==============================================================================
Omnibus: 2.042 Durbin-Watson: 2.274
Prob(Omnibus): 0.360 Jarque-Bera (JB): 1.875
Skew: 0.234 Prob(JB): 0.392
Kurtosis: 2.519 Cond. No. 144.
==============================================================================
Here we can see that Prob (F-statistic) value is less than 0.5, hence the model is significant and is not just fit by any chance,
Here Prob (F-statistic) means the p-value of the F-statistic.
R Square :
The value of R-squared varies from 0 to 1, with 1 being a perfect fit to the data points and 0 being the worst fit to the data points.
R-squared value only tells you how much variance is being explained by the straight line that you have fit. It does not tell anything about the significance of fit. It just might happen that a model with a large R-squared value might still be insignificant, i.e. just a fit by chance.
Comments
Post a Comment