Understanding The Mathematical Concepts of Data Science: Probability and Gaussian distribution.

8 min readMar 13, 2021

Quite a few people argue about how much math is needed for Data Science. You do not necessarily need high-level math to solve problems and drive insights from data. However, I believe you need to have sufficient knowledge in the field of mathematics and statistics to penetrate the data science field. Therefore, this article aims to provide an overview of mathematical concepts every Data scientist should know.

To understand the mathematical concepts of Data Science we need to answer this question; What is data? Since we know data scientists use data to drive insights and solve problems.

Data are units of information, usually numeric, that are collected through observation. Data are measured, collected, and analysed, whereupon it can be visualised using graphs, images, or other analysis tools. Since data can be analysed the importance of mathematical concepts can’t be over-emphasised in dealing with data to drive insights and solve problems. Therefore, we need to understand the importance of statistics in data science.

Statistics

Statistics is the collection, analysis, and interpretation of data. It is a fundamental tool of data scientists, who are expected to gather and analyse large amounts of structured and unstructured data and report on their findings. Therefore, it shouldn’t be a surprise that Data scientist need to know statistics. Key concepts include probability distribution, statistical significance, and regression, etc. In this article, we will be looking at probability and gaussian distribution.

Probability

Probability is an intuitive concept, we use it unconsciously in our daily life. Probability is the measure of how uncertainty an event occurs. For example, if there is a 40% chance of rain tomorrow, then the probability is 40%. The outcome of a random event cannot be determined before it occurs such as the result of tossing a coin, rolling a dice. However, the actual outcome of a probability result is considered to be determined by chance.

Now that we have specified the basis, let’s dive into the concepts of probability;

Empirical Probability (or experimental) of an event is an “estimate” that the event will happen based on how often the event occurs after collecting data or running an experiment (in a large number of trials). It is based specifically on direct observations. To find the empirical probability of any event E (like a coin landing heads up), we use the formula:

As a simple example of an experimental probability distribution, let’s estimate the probability of a coin landing on tails. We can take the following steps to estimate the probability of a coin landing on tails;

· Toss the coin many times (thus repeating the random experiment).

· Count the number of times the coin landed on tails.

· Divide the number of tails by the total number of times we tossed the coin.

Dividing the number of tails by the total number of times we tossed the coin gives us the probability of a coin landing tails up. Let’s say we tossed a coin 100 times and got heads 38 times. We find the probability of a coin landing tails up by dividing the number of tails (38) by the total number of times we tossed the coin (100).

This is 38/100 = 0.38, so the probability of a coin landing tails up is 0.38.

Using percentages gives us a deeper understanding of what a probability value is. For instance, P(X) = 38% tells us that for any coin toss, there’s a 38% chance the coin will land tails up.

From the example above, we learned about experimental probability. However, calculating experimental probabilities requires us to perform a random experiment many times, which may not always be realistic in practice. With theoretical probability, you do not conduct an experiment. Instead, you use what you know about the situation to determine the probability of an event occurring. The theoretical probability of an event occurring is a “predicted” probability-based upon knowledge of the situation. It is the number of favourable outcomes to the number of possible outcomes. The favourable outcomes could be one if we expect the event to occur once.

This allows us to use the following formula to calculate the probability of an event E:

Note: the formula above works under the assumption that the outcomes have equal chances of occurring.

An example of theoretical probability, lets us find the probability of tossing a head on a fair coin toss. Since no experiment is needed, there are 2 possible outcomes when tossing a coin: head and tail. Since,

Under the best circumstances, we would expect to toss one head out of every 1 coin toss.

Another example of theoretical probability, let us look at the number observed when rolling two standard six-sided dice. Each die has a 1/6 probability of rolling any single number, one through six, but the sum of two dice will form the probability depicted in the image below.

Graphical illustration of rolling a dice twice

Seven is the most common outcome (1/6 + 6/6, 6/6 +1/6, 5/6 + 2/6, 2/5 + 5/6, 3/6 + 4/6, 4/6 + 3/6). Since the probability of rolling a seven (7) is 6 times out of 36 rolls.

Theoretical probability﹦3/6= 1/6 = 0.16

In conclusion, theoretical probability is based on the assumption that outcomes have an equal chance of occurring while empirical probability is based on the observations of an experiment. There are two other perspectives of probabilities and these are axiomatic probability and subjective probability. You can read more about it here.

What Is a Probability Distribution?

A probability distribution is a statistical function that describes all the possible values and likelihoods that a random variable can take within a given range. This range will be bounded between the minimum and maximum possible values, but precisely where the possible value is likely to be plotted on the probability distribution depends on several of factors. These factors include the distribution’s mean, standard deviation, skewness, and kurtosis.

The most common probability distribution is the gaussian (or normal) distribution, or “bell curve,” although several other distributions such as Binomial, Geometric and Poisson exist that are commonly used. Now let’s learn about the most common probability distribution in statistics.

Gaussian distribution

The Gaussian or normal random variable is arguably the most popular random variable in all of probability and statistics. It is often used to model variables with unknown distributions in the natural sciences. The gaussian distribution graph is a perfect symmetry, such that, if folded at the middle, it will give two halves since one-half (1/2) of the noticeable data falls on each side of the graph.

Normal distribution graph of birthweight in kg

Samples of an ideal Gaussian distribution (or normal distribution) follow bell curve distribution which means that most of the observed data is clustered near the mean, while the data become less frequent when farther away from the mean. For example, the figure above shows birthweight in kilogram(kg) values generated with Gaussian distribution, and the area with the highest mean (3.39kg) of the birthweight values shows where the bell curve is extreme.

Since the normal distribution statistics estimate many natural events so well, it has evolved into a standard of recommendation for many probability queries. Some of the examples are;

· Weight of the Population of the world

· Rolling a dice (once or multiple times)

· To judge the Intelligent Quotient Level of children in this competitive world

· Income distribution in countries economy among poor and rich

· The sizes of Males shoes

· Weight of newly born babies range

We have two main parameters to explain regarding our Gaussian distribution model they are mean and standard deviation. Mean is usually represented by μ and standard deviation by σ. The equation below is the Mathematical formula for the Gaussian probability distribution function.

1. Mean (μ)

The mean is used as a measure of central tendency. It can be used to describe the distribution of values measured as ratios or intervals. In a normal distribution graph, the mean defines the location of the peak and most data points are clustered around the mean. Any changes made to the mean move the curve to the left of right along the X-axis.

2. Standard Deviation (σ)

The standard deviation measures the dispersion of the data points relative to the mean. It represents the distance between the mean and the observations.

In a normal distribution graph, the standard deviation determines the width of the curve, and it tightens or expands the width of the distribution along the x-axis. Generally, a small standard deviation relative to mean produces a steep curve, while a large standard deviation relative to mean produces a flatter curve.

Gaussian distribution is important in data science because of the following reasons:

1. The Gaussian distribution is the most important probability distribution in statistics because it fits many natural phenomena. For example, heights, blood pressure, measurement error, and IQ scores follow the normal distribution.

2. Conclusions and summaries derived from such analysis are intuitive and easy to explain to audiences with basic knowledge of statistics.

3. Gaussian is preferred because it makes the math a lot simpler. For example;

-Its mean, median, and mode are all the same

- The entire distribution can be specified using just two parameters (mean and standard deviation).

Conclusion

In this article, we have familiarised ourselves with the basics of Probability and Gaussian distributions. Also, we have learned the steps in starting probability and statistics journey in Data Science as beginners. I hope you found this helpful. Thanks for reading!!

References

https://www.dataquest.io/blog/learn-statistics-probability-data-science-course/

https://cims.nyu.edu/~cfgranda/pages/stuff/probability_stats_for_DS.pdf

https://www.kdnuggets.com/2018/06/why-data-scientists-love-gaussian.html