By Sudhanshu Shukla May 22, 2020

Basics of Machine Learning -Part I

Photo by Franck V.

In this post series we are going to learn the basics of Machine Learning. Let's starts with what is Machine Learning:

“Machine Learning is the science of getting computers to learn and act like humans do, and improve their learning over time in autonomous fashion, by feeding them data and information in the form of observations and real-world interactions.”

Below are the top practical definitions from different sources:

“Machine Learning at its most basic is the practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the world.” – Nvidia
“Machine learning is the science of getting computers to act without being explicitly programmed.” – Stanford
“Machine learning is based on algorithms that can learn from data without relying on rules-based programming.”- McKinsey & Co.
“Machine learning algorithms can figure out how to perform important tasks by generalizing from examples.” – University of Washington
“The field of Machine Learning seeks to answer the question “How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?” – Carnegie Mellon University

Types of Machine Learning Algorithms which we use:

Regression: In regression, we predict the value based on the prior data, like predicting marks of a student based on their marks in the past five years. The value to be predicted is a continuous variable. Continuous variables are numeric variables that have an infinite number of values between any two values e.g. scores of a student, the length of a part or the date and time a payment is received.

Classification: We assign a label to the output value in classification. E.g. Check if the email is spam or ham.

Clustering: There is no pre-defined label in clustering. We just group the data based on some segmentation criteria. E.g. Customer Segmentation.

Types of Learning Methods in Machine Learning:

Supervised :

In the Supervised Learning Method, we have the past data with the labels to train the data.
Regression and classification algorithms fall under this category

Unsupervised:

There is no pre-defined label are assigned to past data.
In this we derive the information after processing the data, as we don't have any prior information.
Clustering algorithms fall under this category

Regression:

Regression tries to find the output variable based on past data.

We can plot the historical data points on a graph to see the relationship between the output(dependent variable) and the input variable (independent variable).

A linear regression model attempts to explain the relationship between a dependent and an independent variable using a straight line

The independent variable is also known as the predictor variable. And the dependent variables are also known as the output variables.

You can refer to this link for the implementation of Linear Regression in python.

Best Fit Line

Now our goal should be to identify the line, which fits the scatter plot in the best way.

To find the best line we should make sure that a given line has the lowest residual.

As shown in the above graph, we need to calculate the difference between the actual points and the predicted points ( points on the line ).

We can calculate the error (or residual) as: e_i = y_i- y_pred

Hence to find the best fit line we use Ordinary Least Squares Method, which provides the minimum error square.

RSS (Residual Sum of Squares) = e₁²+e₂²+e₃²........e_n²

We already know that a line can be represented by Y = β₀+β₁ X

Substituting the values , for Y , X and e

So the final equation becomes :

So we need to minimize this RSS, to get the best fit line.

Cost Function:

The above stated RSS is a Cost Function. Cost function varies according to the algorithm being used. In this segment, for fitting a straight line the cost function was the sum of squared errors, but it will vary from algorithm to algorithm.

To minimize the cost function we need to perform differentiation.

Let's take an example of function with one variable.

As shown in the above image , the function J(θ) is minimum when θ =0.

Similarly for 2 variable function (RSS function) J(m,c) = [y₁- (mx₁ +c)]² +....,we can get the minimum as:

Here we are left with two equations with two variables ( m ,c ) and these can be solved using linear equation method.

We can minimize a cost function used below techniques:

Closed form method: The function to be minimised is simply differentiated and equated to 0 to achieve a solution. The solution is also double differentiated to check if the solution is greater than 0.
Gradient Descent: Gradient descent is an optimization algorithm used to find the values of the parameters (coefficients) of a function (f) that minimizes a given cost function (cost).It is an iterative minimisation method which reaches the minima step by step (as shown in the figure below).We start with an initial assumed value of the parameter. This initial assumed value can be anything (say $X_{0}$ ). Then you assume $\alpha$ which is rate of learning. For that value ( $X_{0}$ ), you calculate the output of the differentiated function which we denote as $f^{'}(x)$ . Then the new value of the parameter becomes $x -f^{'}(x)*\alpha$ . We continue the process until the algorithm reaches an optimum point ( $X_{4}$ ) ; i.e the value of the parameter does not change effectively after this point.

The $\alpha$ parameter is the learning rate , and its magnitude defines the magnitude of the iterative measures. The range of α is (0,1), but large values of $\alpha$ , for example > 0.5, are not favored because the algorithm may miss the minima. If the alpha value is too high then the algorithm of gradient descent may miss the minima and start the iterative search again.

Gradient Descent in Python

Same can be implemented in python also:

# Takes in X, y, current m and c (both initialised to 0), num_iterations, learning rate
# returns gradient at current m and c for each pair of m and c
def gradient(X, y, m_current=0, c_current=0, iters=1000, learning_rate=0.01):
    N = float(len(y))
    gd_df = pd.DataFrame( columns = ['m_current', 'c_current','cost'])
    for i in range(iters):
        y_current = (m_current * X) + c_current
        cost = sum([data**2 for data in (y-y_current)]) / N
        m_gradient = -(2/N) * sum(X * (y - y_current))
        c_gradient = -(2/N) * sum(y - y_current)
        m_current = m_current - (learning_rate * m_gradient)
        c_current = c_current - (learning_rate * c_gradient)
        gd_df.loc[i] = [m_current,c_current,cost]
    return(gd_df)

TSS, RSS and R Square

We can calculate R² from TSS and RSS as below, R² explains the variance and how good the model is :

Below graphs show the significance of R² :

Negative R Square

$R^{2}$ compares the fit of the chosen model with that of a horizontal straight line (the null hypothesis). If the chosen model fits worse than a horizontal line, then $R^{2}$ is negative. Note that $R^{2}$ is not always the square of anything, so it can have a negative value without violating any rules of math. $R^{2}$ is negative only when the chosen model does not follow the trend of the data, so fits worse than a horizontal line.

Example: fit data to a linear regression model constrained so that the $Y$ intercept must equal $1500$ .

enter image description here

The model makes no sense at all given these data. It is clearly the wrong model, perhaps chosen by accident.

The fit of the model (a straight line constrained to go through the point (0,1500)) is worse than the fit of a horizontal line. Thus the sum-of-squares from the model $(S S_{res})$ is larger than the sum-of-squares from the horizontal line $(S S_{tot})$ .

$R^{2}$ is computed as $1 - \frac{S S_{res}}{S S_{tot}}$ . (here, $S S_{r e s}$ = residual error.)
When $S S_{res}$ is greater than $S S_{tot}$ , that equation computes a negative value for $R^{2}$ .

With linear regression with no constraints, $R^{2}$ must be positive (or zero) and equals the square of the correlation coefficient, $r$ A negative $R^{2}$ is only possible with linear regression when either the intercept or the slope are constrained so that the "best-fit" line (given the constraint) fits worse than a horizontal line. With nonlinear regression, the $R^{2}$ can be negative whenever the best-fit model (given the chosen equation, and its constraints, if any) fits the data worse than a horizontal line

The remaining topics are covered in the next blog.

Search This Blog

Naive Data

Featured

Churn Score : Predict the customer churn using Machine Learning

Basics of Machine Learning -Part I

Types of Machine Learning Algorithms which we use:

Types of Learning Methods in Machine Learning:

Regression:

Best Fit Line

Cost Function:

TSS, RSS and R Square

Next:

Comments

Post a Comment

Popular Posts

Performing Exploratory Data Analysis (EDA) using Python to predict Loan Defaulters

Predicting Car Prices using Linear Regression

Churn Score : Predict the customer churn using Machine Learning

Understanding Artificial Neural Network using Python

Random Forest in Bytes

Basics of Machine Learning -Part II

Decision Trees In Bytes