Picture a scenario where you are chilling on your couch and your phone rings. The number is unknown to you. Out of curiosity, you pick up the call and there is an agent on the other side who claims that you have been awarded INR 10 lakh in a random lottery! He asks for your bank account details to credit that amount. Looking at such a huge amount you get excited and start sending him your bank details. After sometime you check your account only to realize that your account has been wiped clean. It dawns on you that you have been defrauded. It could have been avoided had the call been marked as spam. This is where classiﬁcation comes in. Nowadays phones have an in-built spam classiﬁer such as the app Truecaller. It classifies phone numbers as spam and not spam based on previous reviews on the numbers given by the users. Spam classiﬁers are also used to classify mails and messages.
Classification algorithms are used in a wide range of businesses from e-commerce transactions to detecting fraudulent transactions. For example, Amazon studies your buying patterns to recommend you similar products that might interest you. Like you want to buy a laptop with the following features: 8 GB ram, 256 GB SSD, i5 10th gen, 2 GB graphic card, 14 inch FHD display etc. at a price of INR 50,000. You are searching for these high end features at the lowest possible price. Amazon notes this behaviour of yours and their classiﬁer puts you among the cluster of people who look to buy good products at a low price. So the next time you visit their app, they will show you the high quality products which are available at discounted/lower prices.
The application of classiﬁcation algorithms are not just limited to business and spam classifiers but are also used in Natural Language Processing(NLP), Medical Science etc.
Why do we need machine learning algorithms to do the classiﬁcation? Let’s understand this problem with the following example-
In today’s world, there are billions of people active on social media. They share a variety of things on the internet like photos, videos, memes, messages etc. But there are certain groups of people who post hate speech, or controversial photos or videos that may trigger social distress which may result in violence. Now these hate speeches need to be removed from social media as they are harmful. So what are the ways in which we can accomplish this task? One way is to hire people who will remove these harmful materials from the platforms but it will be ineﬃcient as it will take lot of time to complete the task and by that time it may have triggered violence somewhere. Also, the person which the company had hired might be biased towards a certain community and may not remove certain posts, which again might lead to violence. To overcome these problems, we have machine learning algorithms to accomplish the task of removing harmful materials from social media efficiently and unbiasedly.
So how does classiﬁcation work exactly?
Consider a box containing geometrical ﬁgures such as cubes, pyramids, spheres, cones and cylinders. Now you are provided with ﬁve boxes that have the logo of a cube, pyramid, sphere, cone and cylinder respectively. Your task is to separate the shapes and put them in the boxes with their respective logos. This process is called classiﬁcation or categorization. In this article we will be discussing the following topics:
- Two-Dimensional Binary Classiﬁcation
- 2D Multiclass Classiﬁcation
- Multiclass Classiﬁcation
We will be covering the first two items in the first part of this article.
1. Two-Dimensional Binary Classiﬁcation
Suppose we have thousands of pictures of cats and dogs with one folder labeled as cats containing pictures of cats and another folder labeled as dogs containing pictures of dogs. We fed these pictures as input and labels as output to teach our classifier to diﬀerentiate between cats and dogs. The sample which we use to train our machine is called a training set and a sample set which we use to predict the output is called a testing set. This is an example of supervised learning. In supervised learning we guide our machine to distinguish between things.
In the above example our input data belongs to only two diﬀerent classes. Since there are two possible classes for every input, it is known as binary classiﬁcation.
So which two dimensions are represented in the Two-Dimensional Binary Classiﬁcation? They represent two numbers for each input data. We can show each input as a point on the plane which means that we have a bunch of points or dots on the page.
The technique used for classifying this data is called boundary method where the input data is plotted and a line or curve is drawn separating the two classes. Some boundaries are better at predicting the results than other boundaries.
Let’s take an example of the two variants of the iris flower- setosa and versicolor for better clarity. We have two features of the ﬂower i.e. petal length and petal width. Based on these two features we plot the graph and can draw a line separating two classes i.e. setosa and versicolor.
0 represent setosa and 1 represents versicolor We are done with our classiﬁer here, now when we feed our machine with some unseen data it will classify whether it is versicolor or setosa. Now let’s suppose that we have increased our dataset and plotting, it gives us this graph-
In the above graph we can still separate the classes but this time the line is not straight, it is curved. This isn’t a problem as we can still classify the ﬂowers. The color blue represents setosa and orange represents versicolor. We call the sections onto which we chop up the plane decision regions, or domains, and the lines or curves between them as decision boundaries.
Now suppose we received another batch of data of ﬂowers from a diﬀerent region. We again plot the input as before and this time we get the image as shown in the picture below-
This time it is diﬃcult to draw a boundary in spite of having a blue and orange region. In this case, instead of predicting a single class with absolute certainty we assign probability to each possible class.
The point in the orange region implies that the ﬂower is versicolor with the highest possible probability and that probability goes on decreasing as we move from the orange to the blue region. Similarly the point lying in the blue region implies that the ﬂower is setosa with the highest probability. If we don’t mind getting the false positives or false negatives, we can use the idea of accuracy or precision to draw the curve that separates the two classes, where false positive implies that ‘as the point is lying in the orange region it is versicolor though it is setosa’ and false negative implies that ‘though the point is in orange region it is setosa despite it being versicolor’.
When the decision boundaries are sharpened, the probabilities are simply 1 or 0. On the other hand when the regions are fuzzy, each class will have some non-zero probability. Whatever be the case in practice, we need to convert probabilities into decisions.
2. 2D Multiclass Classiﬁcation.
As we learned more about iris we got to know that there is a 3rd variant named virginica. Now we have three classes of iris i.e. setosa, versicolor and virginica. Again we need to train our machine to distinguish between these 3 variants like we did in binary classiﬁcation but this time the machine needs to distinguish virginica as well. So next time when we feed new data into the machine, it will distinguish it as setosa or versicolor or virginica. This task of classifying more than two classes using 2D input is called 2D multiclass classiﬁcation. We need to ﬁnd boundaries deﬁning multiclass regions.
Some common algorithms used for 2D binary and 2D multiclass classiﬁcation are.
- logistic regression.(For binary classiﬁcation)
- k-Nearest Neighbors
- Decision Trees
- Support Vector Machine. (For binary classiﬁcation)
- Naive Bayes
We will discuss these in a separate article.
In the next article we will discuss the remaining two types of classification tasks.
Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]