Understanding Multiclass Classification and Clustering

January 15, 2022

In the previous article where we took an example of Iris flowers, we considered only two features of the flower but in the real world this might not be true. We would need more dimension to classify things efficiently.

Suppose that along with petal length and petal width we have data of sepal length and sepal width as well. Now here we have 4 dimensions. Our brain can process up to 3 dimensions accurately. Beyond that, visualizing becomes difficult.

For example in binary classification we classified flowers as setosa and versicolor based on 2D data. It is easy to plot and see how they are differentiated on the graph. But things might be different in higher dimensions even though data can be classified into different categories as we did in 2D binary classification.

Our algorithms are based on math and math does not care about dimension; so do our algorithms. But it does not mean that we should not worry about the dimensions. As dimensions increase computation time, the memory allocated to algorithms also increases.

In the real world most data have multiple classes like in the case of classifying fruits, classifying movies based on genre, classifying weather etc. It might become complicated to build multiclass classifiers, so instead we can build a whole bunch of binary classifiers and combine their results to produce a multiclass result. Binary classifiers are simpler and faster than multiclass classifiers.

Popular algorithms that can be used for multi-class classification include:

  • k-Nearest Neighbors
  • Decision Trees
  • Naive Bayes
  • Random Forest
  • Gradient Boosting

In this section we will be discussing two famous ways of using binary classification to perform multiclass classification.

  • OvR (One versus Rest)
  • OvO (One versus One)

OVR (One versus Rest)

Let us consider four classes as shown in the following figure-


Now instead of building one classifier which classifies 4 classes, we can build 4 different binary classifiers which classify each class. For example, class_1 classifies whether an item belongs to class 1 or not. Similarly class_2 classifies whether an item belongs to class 2 or not and so on. After classifying all these classes we put this data in an algorithm which builds boundaries around each class.


Now it might happen that some data points may belong to 2 or more classes. In this case that data point is run through all the binary classifiers and we look for the class with the highest probability of having that data point.


The good thing about this is, it is conceptually simpler and faster. But the downside is that we have to teach each classifier and classify every input sample for each class to find which class it belongs to.

OvO (One versus One)

Again consider 4 classes as shown in the previous diagram. In this case we will consider two classes at one time to check in which class the item belongs to. For example we have two classes i.e. class_1 and class_2. Now we will classify the items depending on which class they belong to. If an item belongs to class_1 we will put it in class_1, if it belongs to class_2 we will put it in class_2. After doing this we will consider another set of class_1 and class_3 and classify the items to their respective class. Similarly we can go on classifying items in class_1 and class_4 , class_2 and class_3, class_2 and class_4, class_3 and class_4.


It might again happen that a data point may belong to more than one space. In this case we look for classes that have majority votes. for eg. We can see in below figure, that point came in class_1 for 3 times so we say that class_1 have the majority of votes.


It is clear that One-versus-one requires far more classifiers than one-versus-rest, but it’s sometimes appealing because it provides a clearer understanding of how the sample was evaluated with respect to each possible pair of classes. The downside is that as the classes increase, the number of binary classifiers increases exponentially which in result cost computer memory and time.


Suppose I give you a bucket full of 6 varieties of fruits and 6 baskets. Your task is to cluster similar fruits in one basket. You will complete the task in minutes. You will not even care about the name of the fruit. Just based on a few dimensions you will cluster the fruits in different baskets. Humans are pretty good at clustering.

Clustering is another method to classify unlabeled data by breaking the data into similar chunks. This method of classsification falls under the umbrella of unsupervised learning.

Moving further with the above example, again I give you a bucket full of different varieties of fruits but this time you don’t know how many varieties of fruits you have to cluster. But still you complete the task with ease and cluster similar fruits in 6 different baskets. Superb! See how good our brain is?

When we perform a similar task on a computer we need to tell the program how many clusters we need and that number is represented by k. We say that k is a hyperparameter or the value that we choose before training the system. Our chosen value of k tells the algorithm how many regions to build. Because the algorithm uses the geometric means, or averages, of groups of points to develop the clusters, the algorithm is called k-means clustering.

Now the freedom to choose k value is a blessing as well as a curse. If we know how many clusters we have to divide the data into then simply putting the right k value does our job. It might happen that the computer does not put the boundaries at the right place as it does not know where we are accepting the boundaries on the graph. But it might not be the case if our data is sorted with a big space between the clusters. The downside of specifying k value is that we may not have any idea about how many clusters best describe our data.


From the above image we can see how clusters may get divided if we don’t know the right k value. Here K=5 best defines our cluster. But if we don’t know the right k value we might end with one of the above clusters which will be wrong if we choose any other k value rather than k=5.

One good way of eliminating these problems is training our clustering models several times with different values of k and then choosing the value which performs the best. This task of finding the appropriate value of k is called hyperparameter tuning.

The only disadvantage is that it takes a lot of computational resources and time. One way to avoid hyperparameters is to analyse data with some visualization tools and then feed the right value of k to our model or maybe a list of right values of k. This might reduce some time and computational resources.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]