Some Important Probability and Statistical Concepts in Data Science

5 min readMar 13, 2021

--

A beginner’s Overview

Mathematics is the bedrock of contemporary science disciplines.

Remember how you were forced to derive long formulas and find X in your high school and college days? you wondered: “who all this things help sef?”. Well, now that you want to start coding and building cool stuff to change your world, “e go help you”.

At the core of fancy machine learning models, deep learning and data science tools lies mathematical concepts such as statistics, probability, calulus, algebra, regression models, etc. And an understanding of these concepts behind the cool algorithms would give you a strong footing in the data science/ Artificial intelligence space .

In this article, we would be walking through some of the concepts a beginner data scientist would need to start off their journey to the wonder world of data.

Probability

Probability is a numerical description of how likely it is that an event will occur, or the chance that something is true.

We often desire to know beforehand the outcome of recurrent uncertainties; Like what the weather would look like the next day or if we could win a Ludo game with siblings.

Six sided dies: What will the outcome of a toss be?

Of course we can’t exactly tell how these situations would turn out, that’s why they are uncertainties.

Hence, the concept of probability aims to build a mathematical framework to represent and analyze these phenomenon with numbers. Probability is often denoted as “p(x)”

p(x) = the likelihood that a particular event(x) will occur.

We can determine the probability of an event occurring either experimentally (empirically) or theoretically:

Empirical probability: An empirical probability is the actual expected outcome of an event derived from experimentation. Example can be tossing a die several times and deciding the outcome of the next toss from previous occurrences.
Theoretical probability: This is the expected outcome of an event based on the assumption that all events have an equal chance of occurring without carrying out any experiment. Example, we can assume that the numbers 1, 2, 3, 4, 5,6 have equal chances of being obtained from a toss of a 6-sided coin ( that is, they all have an equal probability of ⅙(0.33)).

Understanding probability entails we get a grasps of the following concepts and terms:

1. Probability space or sample space (Ω)

This is the area which covers all possible outcomes of the experiment/population under consideration. Our sample space could be either discrete (finite) or continuous.

2. Probability distribution

Probability distribution is the function that describes the likelihood of obtaining the possible values that a random variable can assume.

The sum of all possible probabilities = 1

p(x) + p(y) + p(z) = 1

Probability must lie between the range of 0 and 1.

We can broadly classify Probability distribution into two, namely:

Discrete probability distributions for discrete variables: For these classes of distribution, some of the models used are Binomial distribution, Poisson distribution, Uniform distribution depending on the type of data.
Probability density functions for continuous variables: Here we would find that the Normal distribution (also known as Gaussian distribution or the “bell curve.” ) is the most frequently used. This is because a normal distribution is symmetric and fits a wide variety of phenomena, such as human height, IQ scores, etc. Other common continuous distributions are the Weibull distribution and the lognormal distribution.

This is Figure 6A.15 (Pg. 61) from Probabilistic approaches to risk by Aswath Damodaran.

3. Set theory

We can simply define a set theory as an unordered collection of things/objects. A set could be made up of anything ranging from numbers or alphabets to animate or inanimate objects. It could even be empty (as in an empty set). A set is enclosed in curly brackets {}. We use set notation to specify compound events in probability. Some expressions and operations of sets useful to a data scientist include:

Venn diagrams: This is a pictorial representation of a set. Probability spaces can be visualized with a Venn diagram

Venn diagram defined in a chart. obtained from: mychartguide.com

Union (∪) and Intersection (∩)of set

Universal set, superset, strict subset and subsets
Relative and absolute complements

Statistics

As data scientists, it’s often necessary for us to analyze data statistically. Our goal for statistical analysis is to extract information from data by computing statistics, which are deterministic functions of the data.

Statistical analysis could be either descriptive or inferential

Descriptive statistics:

Descriptive statistics is the term given to the analysis of data that helps describe, show or summarize data in a meaningful way such that, for example, patterns might emerge from the data. Descriptive statistics simply describes the data, a total conclusion is not drawn beyond the data.

Measures of central tendencies and measures of spread are some the ways by which we can descriptively analyze data

For instance, If we gathered some student score over a period of time; we could use descriptive statistics to describe the

Sample mean
Sample standard deviation
Make a bar chart or boxplot
Describe the shape of the sample probability distribution

Inferential statistics:

Inferential statistics allows the data scientist to make predictions (“inferences”) from gathered data. With inferential statistics, we take data from samples and make generalizations about a population. This type of statistical approach can be used to estimate parameters or test hypotheses.

From the example on student score data analysis above, we could use inferential statistics to develop a generalized model for students, and draw conclusions. Some statistical models used for inferential statistics include: Regression analysis, ANOVA (analysis of Variance), student t-test, etc.

Does this sound like a lot to take in already? Not to worry, it’s one step at a time. Just don’t stop learning. Here are some of the resources that helped me understand these concepts better. You can check them out too:

I would like to interact with you more often. How about checking out my Twitter and LinkedIn handle? You should click the clap button if you found this post worth reading. Also, don’t take this all in alone, let your buddies know what concepts you are learning by sharing.