Evaluation metrics of classification models- A beginner-friendly guide

Debadri Sengupta
6 min readJun 6, 2020

A hearty welcome to all my readers! In my first story, I discuss various methods used while evaluating a supervised learning model for classification.

Confusion matrix

Source: https://glassboxmedicine.com/2019/02/17/measuring-performance-the-confusion-matrix/

Confusion matrix is a matrix showing where the model got confused.

To better understand these terms, let’s use a medical model.

· True positive is when the patient is sick and the model predicts him to be sick

· False positive is when patient is healthy but model predicts him to be sick

· False negative is when patient is sick but model predicts him to be healthy

· True negative is when patient is healthy and model predicts him to be healthy

Confusion matrix example

This is a confusion matrix of a spam classifying model. Essentially the principal diagonal is the area where the model is not confused and returns accurate results. All other squares are where model returned inaccurate results.

This is a confusion matrix for three possible illnesses A, B and C. Instead of the number of each type of classification, it indicates the probability of being classified into each type. Each row adds up to 1. Again, squares other than those in the principal diagonal indicate where the model was confused.

We’ll try to understand the next metrics in terms of the confusion matrix.

Accuracy

It is essentially (number of data points properly classified) over the (total number of data points). It is also the number of points in the principal diagonal of the confusion matrix over the total number of points.

Accuracy fails when number of data points in question is very high. For example, in a tax evasion prediction model, suppose there are 300 tax evaders out of 30,000 candidates. And our model predicts all of them clean. Even then the model has an accuracy of ((30000–300)/(30000))*100= 99%! But our model is terrible right? As even a single evader may cause the government a loss of millions!

Precision and Recall

Precision answers the question, out of all people diagnosed sick, how many were actually sick.

Recall (or sensitivity) answers the question, out of all people actually sick, how many did we diagnose as sick.

This is the simplest way to put it in terms of a medical model. Let’s now look at the formulae.

[Note: Before proceeding, please revise the abbreviations like FN, TP etc. from the first image]

Precision= TP/TP+FP

Recall=TP/TP+FN

If we want a smaller number of false negatives, we focus on recall. This is essential for medical models where we do not want to classify a sick patient as healthy!

On the other hand, a smaller number of false positives is guaranteed by precision. This is important for let’s say a spam classifier, where we do not want important messages to be classified as spam!

It is interesting to note that precision and recall have an antagonizing effect on one another. Let’s say we want to predict whether a passenger in an airplane is a terrorist or not. If we classify all passengers as terrorists, we would have a recall of 1 (no false negatives) but we have to put a bar on literally all the passengers! Conversely, if we classify all of them as non-terrorists, we have a precision of 1, but risk lives of many others. A proper balance is essential.

This is where F-beta score comes in.

F-beta score-

A larger beta places greater emphasis on recall. This is because with an increased beta, weightage of precision in the denominator increases which decreases its overall effect. Conversely, a smaller beta places greater emphasis on precision.

As an example, we may want a larger beta value in a video recommendation system as we don’t want to miss out suggesting a video to a user. A smaller beta value may be crucial for a company implementing a model to send sample products to prospective users. A large number of false positives may lead to spending on the wrong person!

Its most famous form is the F-1 score which is the harmonic mean between the precision and recall laying equal weightage on both.

F-1 score= (2*precision*recall)/(precision+recall)

Sensitivity and Specificity

Sensitivity is the exact same thing as recall.

Specificity answers the question, out of all people we diagnosed healthy, how many were actually healthy.

Specificity= TN/TN+FP

If we want lesser number of false positives we focus on specificity.

We plot all the above metrics in confusion matrix terms.

Sensitivity and specificity in a confusion matrix
Precision and Recall in a confusion matrix

Each row and column of the confusion matrix implies some metric. This is why the confusion matrix is so important!

ROC Curve-

ROC Curves for three models. Note that the straight line implies a random model

ROC (Receiver Operator Characteristics) is a curve plotting True Positive Rate against False Positive Rate.

True positive rate is surprisingly same as recall.

TPR= no. of points correctly classified as positive/total no. of points actually positive= TP/TP+FN

Similarly,

False positive rate (FPR)= no. of points incorrectly classified as positive/total no. of points actually negative= FP/FP+TN

Instinctively, we can imagine we want a model which gives a high TPR but low FPR. Hence, our curve should be as much close to covering the entire graph as possible. In more technical terms, the Area Under Curve (AUC) should be as high as possible.

Now, how do we know the points to plot this curve?

Imagine our model has classified points like this. Points to the left of the split are classified negative while those right are positive. Its FPR is 2/2+5= 0.286. There are 2 red points(negative) classified as positive out of 7 total red point. Its TPR is 5/5+2=0.857. 5 actually positive points have been classified as positive out of 7 total positive points. Hence the point is plotted as (0.286,0.857) on the curve. Similarly, we plot the points at various splits resulting in the ultimate ROC curve.

A split at the absolute left classifies everything as positive hence FPR=TPR=1 and split at the extreme right has FPR=TPR=0 as nothing is classified as positive.

Now, let’s take a look at a perfect split.

All splits at the left of mid-mark classify all positive points correctly as true positive resulting in TPR=1 but false positives occur too. All splits at the right of mid-mark have no false positives resulting in FPR=0 but false negatives occur. What we get if we plot all the points is a perfect square with unit area as the ROC curve.

Curve ‘A’ is the perfect model (unit square), curve B is the good model, while curve ‘C’ is a random model

Since this is an unattainable model, our model should aim at maximizing AUC towards 1.

The above diagram shows sensitivity and (1-specificity) instead of TPR and FPR on the axes. Recall that TPR is the same as sensitivity. FPR= FP/FP+TN while specificity=TN/TN+FP. Hence, 1-specificity is same as FPR.

This is another form of ROC curve with specificity and sensitivity on the axes. Rotate the axes anti-clockwise by one quadrant to transform it into our standard ROC curve.

Hope you liked my article and it helped you!

The above content is created using what I learned from Udacity and various other platforms. Some images’ rights lie solely under Udacity License and are not permitted to be used in return of money.

--

--

Debadri Sengupta

Experimenter of Machine Learning in production and research. Deeply interested in MLOps.