Introduction to Descriptive Statistics and Measures of Central Tendency.

January 24, 2022

Consider two unrelated questions:

  • Who was the best Indian cricket team captain in the history of Indian cricket?
  • What is happening to the economic health of India’s middle class?

The 1st question is trivial but the 2nd question is a profoundly important one. For the 1st question, cricket enthusiasts can argue about it endlessly. But the 2nd question is important as the middle class is the backbone of the Indian economy.

Now using these two questions I’m going to illustrate the strengths and limitations of descriptive statistics, which are the numbers and calculations we use to summarise raw data.

Let’s go ahead with the trivial question first.

Chat

So above is the screenshot of a little chat that I had with the windows chat box. When I asked it to show me more data, it simply replied by asking me to search for the data on my own. Anyways the point here is, I can go on talking more about his success rate but that would be raw data which might be hard to digest as he has played 350 ODIs and 24 world cup matches. Or I can just say that his winning average was 58.82% at the end of his career. That is a descriptive statistic, or a ‘summary statistic’.

dhoni(1)

Winning average is Dhoni’s gross simplification of 350 ODIs. It is easy to understand but limited in what it can tell us.

Now moving on to the question about the economic health of the Indian middle class. To answer this question we need to find the economic equivalent of winning average. We need a measure which is simple and accurate, and which can tell us how the economic well-being of an average Indian has changed over the last few decades. The reasonable answer, though not accurate, is to measure the change in per capita income of India over the course of a generation, which is around 30 years. Per capita income is nothing but total income divided by total population.

In 1990 the average income in India was INR 5,882 and in 2020 it is INR 1,44,476. Well we got pretty rich.

economy

But here’s the twist. These figures might be misleading. If we adjust this figure for inflation then in 1990 rupees 5,882 was equal to current rupees 25,950. Yet another big problem is that the average income in India is not equal to the income of the average indian.

The above number doesn’t tell us about the distribution of money in different classes. There exist four classes of income groups in India: low, lower-middle, upper-middle and high income groups. The top 1% of the total population can raise per capita income without putting any money in the pocket of the other 99%. The average income can go up without helping the average Indians.

From cricket to income, the most basic task when working with data is to summarise a great deal of information. Descriptive statistics give us a manageable and meaningful summary of the underlying phenomenon. Descriptive statistics can be like online dating profiles: technically accurate and yet pretty darn misleading.

In this article I’ll discuss Central tendency.

Consider the below data:

table

pollution

This is a report that tells us that global warming is largely a result of human activity that produces carbon dioxide (CO2) emissions and other greenhouse gases. The CO2 emissions from fossil fuel combustion are the result of electricity, heating, industrial processes, and gas consumption in automobiles. The International Energy Agency reported the per capita CO2 emissions by country (that is, the total CO2 emissions for the country divided by the population size of that country) for the year 2011. For the nine largest countries in population size (which make up more than half the world’s population), the values were, in metric tons per person.

From the above table we can see that the average emission by each country is equal to 4.6.

average_formula

But from the above table we can see that only 3 countries are emitting more than 4.6 metric tons of CO2. The mean can be highly influenced by an outlier, which is an unusually small or unusually large observation. An outlier is an observation that falls well above or well below the overall bulk of the data. Outlier in the data calls for more investigation.

mean_median

Since mean does not give us an accurate picture we look for median. Median is the middle value from the data when sorted in ascending order or descending order. In the median we look for the middle value from the data. For example, if we arrange the emission in ascending order like: 0.3, 0.4, 0.8, 1.4, 1.8, 2.1, 5.9, 11.6, 16.9.

Then the fifth value is median i.e. 1.8. Five is the middle of nine. The median is not going to change even if the US starts emitting 90 metric tons of carbon but on the other hand the mean will increase.

The shape of the distribution is highly influenced by whether the mean is greater or lesser than median. For the above example the shape of the distribution is right skewed as the mean is greater than the median. Because the mean is the balance point, an extreme value on the right side pulls the mean toward the right tail. Because the median is not affected, it is said to be resistant to the effect of extreme observations. The median is resistant. The mean is not.

skewness

If the mean is lesser than the median then the distribution is going to be left skewed and if the mean is equal to the median then the distribution is going to be symmetric.

Median is not affected by Outliers. The median is determined by having an equal number of observations above and below it.

Mean uses all the numerical values in data whereas median depends on how far observations fall from the middle. Because the mean is the balance point, an extreme value on the right side pulls the mean toward the right tail. Because the median is not affected, it is said to be resistant to the effect of extreme observations.

From these properties, you might think that it’s always better to use the median rather than the mean. That’s not true. If a distribution is highly skewed, the median is usually preferred over the mean because it better represents what is typical and if the distribution is close to being symmetric or only mildly skewed, the mean is usually preferred because it uses the numerical values of all the observations.

So from the above topic we can conclude that every measure of central tendency is important as they give us insight about the centre of the data in different ways.

But what is Mode then?

Mode is the value that occurs most frequently. It describes a typical observation in terms of the most common outcome. The concept of the mode is most often used to describe the category of a categorical variable that has the highest frequency. With quantitative variables, the mode is most useful with discrete variables taking a small number of possible values.

mode

For the CO2 data, there is no mode as all the values occur one time so consider the above histogram. It shows the number of students vs the number of hours of TV watched per day by the students. Here 4,8, 22, 32, 8 and 6 are frequency or mode. So from the above data we can say that 32 students spend 4-5 hours daily watching TV. We can similarly note this for other values as well.

The mode need not be near the centre of the distribution. It may be the largest or the smallest value. Thus, it is somewhat inaccurate to call the mode a measure of centre, but often it is useful to report the most common outcome.

So from the above topic we can conclude that every measure of central tendency is important as they give us insight about the centre of the data in different ways.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]