Statistics: The Big Picture
Solving the statistics jigsaw puzzle
This post contains my notes from the Stanford Online Statistics & Probability course and the associated comments:
Unfortunately, this course has been removed from Stanford Online, they moved to an e-learning platform EdX for hosting their free courses.
Since the course has been removed, I just thought to put some of my notes in the form of an article. So, others can read and enjoy the statistics subjects.
Enough talk on course story. Let’s solve the statistics jigsaw with four pieces on our table, Population, Exploratory Data Analysis (EDA), Probability, and Inference.
So, let’s start our STAT’s journey with a famous quote:
“Most people use statistics like a drunk man uses a lamppost; more for support than illumination” ― Andrew Lang
The Big Picture
In a nutshell, what statistics is all about is converting data into useful information. Statistics is, therefore, a process in which we
- Collect data
- Summarize data
- Interpret data
The process of statistics starts when we identify what group we want to study or learn something about. We call this group the population. Note that the word populations here do not refer only to people; it is used in the broader statistical sense to refer not only to people, but also to animals, objects, and so on. For example, we might be interested in
- The number of people recovered from COVID-19 in a certain country.
- How the population of mice reacts to a certain chemical.
- The opinions of the population of INDIAN adults about the death penalty.
Population, then, is the entire group that is the target of our interest
In most cases, the population is so larger that, as much as we want to, there is absolutely no way we can study all of it (imagine trying to get opinions of all INDIANS about my blog). A more practical approach would be to examine and collect data only from a subgroup of the population, which we call a sample.
We call this first step, which involves choosing a sample and collecting data from it, producing data.
It should be noted that since, for practical reasons, we need to compromise and examine only a sub-group of the population rather than the whole population, we should make an effort to choose a sample in such a way that it will represent the population well.
For example, if we choose a sample from populations of INDIAN adults, and ask their opinions about the death penalty, we don’t want our sample to consist only from a particular state or region.
Once the data has been collected, what we have is a long list of answers to questions, or numbers, and in order to explore and make sense of the data, we need to summarize that list in a meaningful way. This second step, which consists of summarizing the collected data, is called Exploratory Data Analysis.
Now, we’ve obtained the sample results and summarized them, but we are not done. Remember that our goal is to study the population, so what we want is to be able to draw conclusions about the population based on the sample results. Before we can do so, we need to look at how the sample we’re using may differ from the populations as a whole, so that we can factor that into our analysis. To examine this difference, we use Probability.
In essence, the probability is the “machinery” that allows us to draw conclusions about the population based on the data collected about the sample.
Finally, we can use what we’ve discovered about our samples to draw conclusions about our population. We call this step the process Inference.
And this is the Big Picture of Statistics.
Hopefully, now you know how to solve the statistics jigsaw.
For understanding the puzzle perfectly, let me give you an example.
Example:
At the end of May 2020, a poll was conducted (by ABC News and the Economic Times) for the purpose of learning the opinions of INDIAN adults about lifting the lockdown.
- Producing Data: A (representative) sample of 2,000 INDIAN adults was chosen, and each adult was asked whether he or she favoured or opposed lockdown lifting.
- Exploratory Data Analysis (EDA): The collected data were summarized, and it was found that 70% of the sampled adults favour the lockdown, not lifting it.
- and 4. Probability and Inference: Based on the sample result (of 70% favouring the lockdown) and our knowledge of probability, it was concluded (with 95% confidence) that the percentage of those who favour the lockdown in the population is within 3% of what was obtained in the sample (i.e., between 67% and 73%).
The following figure summarizes the example:
EndNote:
The four-step process that encompasses statistics:
- Producing Data — Choosing a sample from the population of interest and collecting data.
- Exploratory Data Analysis (EDA) — Summarizing the data we’ve collected.
3. and 4. Probability and Inference — Drawing conclusions about the entire population based on the data collected from the sample.
If you wish to contact me, you can follow me on LinkedIn:
and you can find my all medium blogs-related resources here: