Basic Statistics For Data Scientists

January 22, 2022

Statistical studies are carried out everywhere today. Whether it is for growing business, reporting weather or to make vaccines. Statistics have helped us do things more efficiently.

Consider a scenario where you are the manager of a restaurant and it’s your 1st day at managing things there. Initially you have no idea about the flow of customers and how many men you might need to handle customers and how many items you might need to make food without any wastage. Due to these reasons, in your initial days you do a bad job. But you analyse things and start to optimise things from experience.

Restaurant

As an example initially you would make more food on Monday because you thought the flow of customers would be consistent as it was the day before. But instead on Monday the flow of customers was low because it was the start of the weekday. This resulted in the wastage in food and man power. Noting this, next Monday you order your men to make less food. As days go by, you get better ideas and you do things more smartly.

You use statistics to analyse and optimise things in order to increase the revenue of the restaurant and minimise the losses. You analyse which age group of customer is coming to your restaurant, what is the best seller on the menu so that it’s always in stock And so on. You keep collecting data and analysing them in order to optimise things.

Statistics is a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data. You go to the doctor to describe your symptoms. Based on those symptoms the doctor predicts a few problems in your body and advises you to run some tests on your body based on which he gives you a prescription. This is how doctors use statistics to cure their patients.

Doctor

We have learned about data in our previous article and statistics is the science of learning from data.

There are two types of variables in statistics:

  • Numerical (age, numbers of cars, etc)
  • Categorical data (Demographic information of a population: gender, disease status)

Numerical is further divided into discrete (date, number of dogs in house, etc) and continuous (eg. Weight, height etc) variables. Categorical data is further divided into nominal and ordinal data.

Types_of_variable

Statistics is divided into two parts. i.e. descriptive statistics and Inferential statistics. In this article we will just quickly review what we have under the umbrella of descriptive statistics and inferential statistics. Let’s start with descriptive statistics.

1) Descriptive Statistics.

Let’s assume that you own a retail shop. Fortunately you had collected data of sales that had happened in the last one year. Looking at the data, what is the 1st thing that comes to your mind? Of course you will look at the profit that you made in the last one year. But along with the profit you’ll also look for important things like: average profit made in one month, highest selling product, lowest selling product, seasonal products, etc. Now here you are extracting some important information that will help you grow your business.

While extracting the above information from the data you will use median, mean, mode or various graphs to extract information. This is how you are using Descriptive statistics to study data. Descriptive Statistics are used to present quantitative descriptions in a manageable form. It helps you in understanding data effectively and efficiently. Three main types of descriptive statistics are:

  • Distribution which mainly deals with frequency of each value
  • Central tendency that deals with averages of the values
  • Dispersion concerns with how spread out the values are

Consider 9 people sitting in a Bar next to each other, each one earning INR 1 lakh a month. The average salary of the room is INR 1 lakh. If any one of the people leaves the room the average will not change. Now let’s assume that Mukesh Ambani enters the Bar who let’s say earns INR 100 crore a month and sits next to those people. Now the average salary of the room will become INR 10,00,90,000.

Bar

Now none of them (nine people) earn more than a lakh rupee but the average is about 10 crore. This might be quite misleading. It doesn’t give us the right information about the exact average value of the room. This is the problem with the mean. It is affected by extreme values. So instead of looking at the mean we look for the salary of the person who is sitting in the middle. In this case the median can be calculated by adding the salary of people sitting at position 5 and 6, and then divide them by 2 which will give us an average of 1 lakh.

Median is not affected by the extreme values. Similarly if we look at the frequency of salary, we will get that 9 people are earning INR 1 lakh a month. Now this is the mode.

Now let’s take another example where for some reason you are not feeling good. So you go to the doctor, run some tests and find that your XYZ count (a made up blood chemical ) is 200. You instantly rush to the internet and find that the ideal XYZ count for your age is 180. Now your count is 20 points higher than the ideal level. If you don’t know the statistics you might inform your near and dear ones or might take a vacation to enjoy your remaining life.

None of this would be necessary. When you call the doctor’s office back to arrange for your hospice care, the physician’s assistant informs you that your count is within the normal range. “But how could that be? My count is 20 points higher than average!”- you yell repeatedly into the receiver. “The standard deviation for XYZ count is 40 “, says the technician and this leaves you feeling confused.

sad_person

The natural variation in XYZ count is 40. If it exceeds this number then it is a matter of concern. Many people have XYZ count higher than ideal level. So how to figure out the highest or lowest limit? Standard deviation is a measure of dispersion, meaning that it reflects how tightly the observations cluster around the mean. For many typical distributions of data, a high proportion of the observations lie within one standard deviation of the mean meaning that they are in the range from one standard deviation below the mean to one standard deviation above the mean.

Graphs are an effective way to analyse data. You can look for hours on tabular data and still can get nothing out of it. Plot those points on a graph and boom you can grasp huge information in just a minute.

Depending on statistics you can perform univariate analysis, bivariate and multivariate analysis.

Univariate analysis is used to describe distribution of a single variable, including its central tendency and dispersion. For visualisation you can use Histogram and the shape of the distribution can be described with skewness and kurtosis.

Bivariate and multivariate analysis is used when we have more than one variable in our data. Bivariate analysis is used to see if there is any relationship between values. The relationship between values can be determined by looking at the Contingency table, Scatter plots, etc.

2) Inferential statistics

Everyone loves to get entertained and one of the major ways is to watch T.V. Everyone watches T.V. and everyone has their favourite T.V shows to watch. If you watch Indian T.V serials, each week BARC (Broadcast Audience Research Council) releases a list of top 10 shows running on television based on their TRP (Television Rating Point). How do you think they figure out the top T.V. shows?

They don’t call or message each Indian and ask what your favourite T.V show is this week. If they do that, remember the population of India is around 140 crore. They will be dealing with 140 crore data points each week which can be tedious. Instead they have randomly installed a device called a people metre in the homes of let’s say 2,00,000 people in different regions. The show creators and T.V channels have no information about who has the people metre. BARC observes what the audience is watching on their T.V, Which show is being watched most of all, etc. and based on that data they release weekly top 10 shows.

Watching_tv

140 crore is the population size and 2,00,000 is the sample size. Here BARC studies data collected from 2,00,000 people and infer about 140 crore population. This is what Inferential statistics is all about. Study of small sample sizes to understand the population.

Inferential Statistics uses sample data to infer which is cost effective and less tedious than collecting data from the entire population.

BARC releases this top 10 list with some confidence interval, which only means that if some study is conducted many times with a completely new sample each time, it is likely that most of the time the studies will have an estimate that lies within the same range of values .

Now let’s assume that BARC made a hypothesis that says, people above 25 years of age watch “Anupama”. This must be tested. So BARC Collects some more data on age groups and analyses the data. After Analysing if it is found that people above 25 years of age watch “Anupama” then BARC accepts the hypothesis else they will reject it. This is what Hypothesis Testing is.

Hypothesis testing makes use of inferential statistics and is used to analyse relationships between variables and makes population comparisons through the use of sample data. This falls under the category of statistical test. Some other methods of testing are correlation tests and comparison tests. Pearson’s r test, Spearman’s r test and Chi-square test are examples of correlation tests. Whereas t-test, ANOVA is an example of a comparison test.We will be exploring this topic in the coming article of Inferential statistics.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]