# Shape of Distribution

February 24, 2022 We have already discussed arithmetic Mean and Standard Deviation which are powerful ways of describing most statistical distributions that involve quantity-variables.

Look at the figure below. We often encounter this shape while working with statistical data. Such symmetry is very common in statistical distributions especially where biological variations are concerned. But it is not universal.

## Skewed Distribution

Look at the two distributions below:  Such Distributions are called skewed. The skew is the tail of observation. If the tail is on the right it is said to be positively skewed and if the tail is on the left it is said to be negatively skewed.

If we look for the mean of the distribution in the 1st case it is X and in the 2nd case it is Y. It is worth noticing the effect skewness has on the relative size and positions of mean, median and mode.

Consider the diagram below based on hypothetical data. In this diagram mean, median and mode all lie in the same place i.e. at the centre. But If we look at the data in the above figure we will find that the mode which has the most frequent value is below the peak. Mean is more close to the tail and median lies in between mode and mean. The 2nd figure with negative skewness is similar as well.

In skewed distribution, the relative position of the three averages is always predictable. The mode is below the peak, the mean would have been pulled out in the direction of the tail and the median is between mode and mean. The greater the skew, the greater will be the distance between mean and mode.

All the distributions we see will have some skewness.

If we assume that the above two Figures represent the incomes of families in 2010 and 2020, and combine the figures we will get a distribution that is called Bimodal distribution. In this graph we can suggest that two different groups are involved.

## Normal Distribution

In school days, every year teachers used to collect the data of heights of students in class. Let’s assume that we have one such data. The 1st figure shows the distribution of Weight (Pounds) for 50 students in class. The 2nd figure shows the Weight (Pounds) of 500 students in school.  We can see in the 1st figure that there are peaks and valleys but if we draw a rough sketch on the histogram joining all the peaks it will appear like a bell curve.

In the 2nd figure we can see the peak and valley effect disappear. This is because we now have more data. So as the number of students in the sample size increases the curve of distribution becomes smoother and smoother and will end up like a bell shaped curve.

The bell-like shape of the distribution above follows what is called the normal curve of distribution. The curve is perfectly symmetrical and its mean, median and mode are in the centre. So if we cut the curve vertically upward at the centre we will get equal areas on either side of the curve. The normal curve is thin or tall or short or slumping out very flatly depending on the standard deviation.

When we call this the ‘normal’ curve, we do not mean that it is the usual curve. Rather, ‘norm’ is being used in the sense of a pattern or standard – ultimate, idealised, ‘perfect’- against which we can compare the distributions we actually find in the real world.

In the real world it is impossible to get a perfect normal distribution as the sample size does not contain infinite data points. But still a small sample can produce a fair bell-shaped curve. The distribution can look as if it is trying to be normal. This suggests that such a sample comes from a large population whose distribution could indeed be described by a normal curve.

In this case, we can interpret the sample using certain powerful characteristics of the normal distribution. The normal curve is characterised by the relationship between its mean and its standard deviation. Using the mean and the standard deviation, we can state the proportion of thepopulation that will lie between any two values of the variable. We can then regard any given value in the distribution as being ‘so many standard deviations’ away from the mean. We use the standard deviation as a unit of measurement.

For example If we consider the 2nd figure again, the mean is 127.2 and the standard deviation is 11.9 pounds. Students whose weight is greater than or equal to 139.1 pound is 1 standard deviation above the mean and students whose weight is less than or equal to 115.3 pound is 1 standard deviation below the mean, and so on. Thus any value in a distribution can be re-expressed as so many standard deviations above or below the mean; it does not matter if the distribution is normal or not. But if it is normal we can use our knowledge of normal distribution to find how many observations lie between any two given points. The standard deviation slices up a normal distribution into standard-sized slices, each slice containing a known percentage of the total observation.

## Portion Under the Normal Curve

If we mark standard deviations on the above figure we will come to know that 68% of observation is enclosed between 1 standard deviation below and above the mean. That accounts for two thirds of the area under the curve. The remaining 32% resides outside the 1 standard deviation. Similarly 95 percent of the data resides 2 standard deviations below and above the mean and 99.7% of data resides 3 standard deviations above and below the mean.

## Summary

We now know if the tail is to the right then distribution is positively skewed and if the tail is to the left then it is negatively skewed. The position of mean, median and mode can be predicted with respect to peak. It takes an infinite amount of observations to get a perfect normal distribution. In real life it is impossible to get an infinite amount of data so the distributions are close to ideal normal distribution.

Standard deviation is a great measure to find dispersion of data from the mean. We now know that 68% of data lie in between 1 standard deviation above and below the mean. In our ‘real life’ distributions are reasonably close to those predicted by the theoretical normal curve.