Statistics Terms for Data Science - PART I

on April 02, 2021

Photo by Chris Liverani

This is the first post in the series Statistics Terms for Data Science. In this series , we are going to understand the Statistics Concepts which are used in Data Science.

Probability:

Probability is a measure of the chances of an event happening.

You can't predict many events with complete certainty. We can only predict the chance of an event occurring i.e. how likely it will occur, using it. Probability will vary from 0 to 1, where 0 means that the event is impossible, and 1 implies a certain event.

Random Variables:

Random Variables , generally denoted by X, provides the association between the outcome of the experiment to something measurable.

For e.g. For a loan re-payment analysis, we come to know below outcome

1. Salary < 2 Lakhs ,fails to repay

2. Salary > 2 Lakhs , can repay

So here we have two outcomes, fails to repay and can repay , so we can assign a value to each outcome i.e. we can define the Random Variable X as:

This is useful in quantifying the data and we can perform statistical analysis on this.

Probability Distributions:

A probability distribution is a function that describes the likelihood of obtaining the possible values that a random variable can assume. In other words, the values of the variable vary based on the underlying probability distribution.

A probability distribution tells us the probability for all possible values of X.

E.g:

Expected Value:

An expected value for a variable X is the value of X we can "expect" to get after performing the experiment. It is also called the expectation, mathematical expectation, mean, average, or first moment.

Let

X

be a random variable with a finite number of finite outcomes

x_{1},x_{2},\ldots ,x_{k}

occurring with probabilities

p_{1},p_{2},\ldots ,p_{k},

respectively. The expected value of

X

is defined as

\operatorname {E} [X]=\sum _{i=1}^{k}x_{i}\,p_{i}=x_{1}p_{1}+x_{2}p_{2}+\cdots +x_{k}p_{k}.

E.g.

Let $X$ represent the outcome of a roll of a fair six-sided die. More specifically, $X$ will be the number of pips showing on the top face of the die after the toss. The possible values for $X$ are 1, 2, 3, 4, 5, and 6, all of which are equally likely with a probability of 1/6. The expectation of $X$ is

\operatorname {E} [X]=1\cdot {\frac {1}{6}}+2\cdot {\frac {1}{6}}+3\cdot {\frac {1}{6}}+4\cdot {\frac {1}{6}}+5\cdot {\frac {1}{6}}+6\cdot {\frac {1}{6}}=3.5.

The expected value should be interpreted as the average value you get after the experiment has been conducted an infinite number of times

Binomial Distribution:

A binomial distribution can be thought of as simply the probability of a SUCCESS or FAILURE outcome in an experiment or survey that is repeated multiple times. The binomial is a type of distribution that has two possible outcomes

Formula : P (X = r) =

^{n} C_{r} (p)^{r} (1 - p)^{n - r}

Binomial distributions must also meet the following three criteria:

The number of observations or trials is fixed. In other words, you can only figure out the probability of something happening if you do it a certain number of times. This is common sense—if you toss a coin once, your probability of getting a tails is 50%. If you toss a coin a 20 times, your probability of getting a tails is very, very close to 100%.
Each observation or trial is independent. In other words, none of your trials have an effect on the probability of the next trial.
The probability of success (tails, heads, fail or pass) is exactly the same from one trial to another.

Cumulative Probability:

Cumulative probability of X, denoted by F(x), is defined as the probability of the variable being less than or equal to x.

The cumulative distribution function of a real-valued random variable $X$ is the function given by

F_{X}(x)=\operatorname {P} (X\leq x)

(Eq.1)

where the right-hand side represents the probability that the random variable $X$ takes on a value less than or equal to $x$ . The probability that $X$ lies in the semi-closed interval $(a,b]$ , where $a<b$ , is therefore

\operatorname {P} (a<X\leq b)=F_{X}(b)-F_{X}(a)

(Eq.2)

Remaining topics will be discussed in next blog in this series.

Naive Data

Search This Blog

Statistics Terms for Data Science - PART I

Probability Distributions:

Expected Value:

E.g.

Binomial Distribution:

Cumulative Probability:

Comments

Post a Comment

Popular Posts

Advanced Cost Function Optimization Techniques

Beyond the Best-Fit Line: 5 Surprising Truths About Linear Regression

Random Forest : In Slides

SVM : In Slides

ML Refresher Series : Bais and Variance

Mastering Regression: A Comprehensive Q&A Guide to Linear and Logistic Models

3 Counter-Intuitive Ideas That Explain How AI ActuallyLearns

Lasso vs Ridge Regression: A Paper-and-Pen Explanation with Numbers

Occam's Razor in Machine Learning

The Model Selection: IN BYTES