In the previous article we discussed Central tendency and how they are used to get some insight about the centre of data. But a measure of the centre is not enough to describe the distribution of a quantitative variable adequately. It tells us nothing about the variability of data. As an example, consider this hypothetical data that shows distribution of income of the middle class in Delhi (Blue) and Mumbai (orange).
Now as you can see from the diagram, both the distributions are symmetric and the mean of both the distributions is at INR 50,000. However, the monthly income of people in Delhi goes from INR 30,000 to INR 70,000 whereas those in Mumbai go from INR 10,000 to INR 1,00,000. Income in Delhi is more similar and varies more in Mumbai. A simple way to describe this is with range.
In Delhi the range is INR 70,000 – INR 30,000 = INR 40,000. In Mumbai the range is INR 1,00,000 – INR 10,000 = INR 90,000. The range is a rough and ready measure of dispersion. However, we cannot fully put our trust in range alone. It only depends on two extreme values and this might result in error if there are outliers in the data.
As an example consider the hypothetical distribution which represents the marks of students in two different sets. If I tell you to calculate the range of two sets you will find the range for set A as 10 and for set B as 10 too. But in set B you can see that apart from two extreme values only 3 different values were observed i.e. 21, 13 and 14. On the other hand 9 different values were observed in set A. But you can see that both set A and B have the same range, thanks to the influence of the outlier in B. Its range is the same as the range in Set B.
Standard Deviation and variance
One way of getting a fairer measure of dispersion is standard deviation. Standard deviation of a distribution is a way of indicating a kind of average amount by which all the values deviate from mean. The greater the dispersion, the bigger the deviations and the bigger the standard deviation.
The deviation of an observation x from the mean μ is (x – μ), the difference between the observation and the sample mean.
Consider the data below in the table.
The mean for Set X is 30 and for Set Y is 33. From the table we can see that the values in set X are more dispersed as compared to values in set Y. So it is easy to conclude that Standard deviation of Set X is greater than set Y.
Let’s calculate the deviation of set Y.
Now if we take the average of deviation then we find that it will add up to zero. So taking the average will be a bad idea. To overcome the difficulty we take squares of each value and add them. This is how we get rid of negative value. After dividing the added squared value with total observations we get variance.
Variance has its own disadvantages. If the original value is in some units say x then variance will be in units squared x.
To get the variance in the same units as the observed value, we take the square root of variance and this is what we call standard deviation.
Standard deviation of Set K = sqrt(36)= 6
Now look at the 2nd figure. We now know that the range in both distributions is the same. It wouldn’t have been the same if we would have ignored the outliers in Set B. This gets us to introduce another measure of dispersion that takes a kind of ‘mini-range’ from near the centre of a distribution, thus avoiding outliers.
This range is based on what are called the quartiles of the distribution. Quartiles are the values that cut the observations into four equal lots just like the median cuts the observation in two equal lots.
As in the diagram there are three quartiles: Q1, Q2 and Q3. The Q2 is the same as median. Q2 is the 2nd quartile. The difference between Q1 and Q3 is the mini-range. It is also called the interquartile range. Q1 is the 1st quartile and Q3 is the 3rd quartile.
Let’s look at this figure:
Since there are 16 observations, we want to cut the bottom 4 and top 4 observations. So Q1 is at 9 and Q3 is at 16 and the interquartile range is 7. No doubt interquartile range gives more indication of the dispersion than full range.
To summarise, the different ways to measure variability in distribution are range, standard deviation, variance and interquartile range.
Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]