April 13, 2021

# Fixing Prediction Confidence in Classification Models

## Jacob Rich

One of the main tasks that modern deep neural networks are used for is classification. Classification datasets are comparatively easy to annotate - typically only one value (a category label) is required per sample of data (like an image or a sentence of text), and deep networks are known to perform well on these tasks. At Y Meadows, one of the main components of the AI are classification models which will categorize incoming messages into predetermined categories.

The way that classification models usually work is that the input is mapped into a numerical representation with a dimensionality equal to the number of categories. So, for example, if we were using a model trained for sentiment analysis, for which the categories are simply “positive” or “negative”, an output of two numbers would be computed by the model. The first number would correspond to the “similarity” between this input text and the training data that the model saw for the “positive” category; and the second number would be the same for the “negative” category. We can think of these as similarity scores for the text in question. For a given sentence, there might be a lot of contextual similarity between it and positive sentences in the training set, but very little similarity between it and negative sentences in the training set. That would mean that that sentence would receive a large first number and small second number from the model.

To make these output numbers understandable to the user, they are pushed through what is called a *softmax function**,* which normalizes the scores so that they add up to one. Now, the output of the model can be interpreted as a probability distribution - the number corresponding to each category is now seen as the probability that the text belongs to that category. In many applications, the number attributed by the model to a category is called the “probability” or the “confidence”. In reality though, the numbers do not represent any probabilistic entity; they are merely normalized similarity scores.

This presents a problem when the “confidence” of a model’s prediction is provided to the user to affect decision-making. To the uninitiated, seeing that the “confidence” of a prediction is 90% makes them think that the likelihood of the prediction being correct is 90%, when that is in fact not at all the case. Quite often, because of the nature of the softmax function, wrong categories are predicted with confidences that are much higher than they should be if they really reflected a probabilistic confidence. This is a long-standing problem, and many solutions have been suggested. Here I would like to talk about two quite simple solutions that could be applied without having to train the model any differently.

### Calibration With Temperature Scaling

One approach to solving our problem is “calibrating” the model after it is trained. This approach assumes that we have access to a significant set of test data that the model has not been trained with that we can use to calibrate it on. There are a plethora of calibration methods, but as noted in **this paper****,** “Temperature scaling is the simplest, fastest, and most straightforward of the [calibration] methods, and surprisingly is often the most effective.” The idea of temperature scaling is simple: even if the output number of a model corresponding to a given category does not represent the likelihood that its prediction is correct, a prediction that gets a high number is still *more likely* to be correct than a prediction with a number that is smaller. That means that there is probably a simple mapping that can convert our number into what *is * the probability of that category. And that mapping involves just a scale factor, or a “temperature”, for the similarity values that are output by the model *before* they are fed to the softmax function. In other words, we divide the similarity scores for the categories by some number to make them smaller (this shrinks the absolute difference between them), then they are normalized by the softmax function to become probabilities. As an example, suppose for our sentiment model, a sentence was given a score of 3 for the positive category and 6 for the negative category. When those numbers go through the softmax, they become positive: 0.047 and negative: 0.953. But suppose we add a “temperature” equal to 2; that means we first divide the original scores by 2 - so now the scores are 1.5 and 3, and after the softmax they become positive: 0.182 and negative: 0.818, which is a big difference from what the probabilities were without the added temperature scaling.

We can learn the temperature value that we need by running the model on a bunch of test inputs for which we have the category labels. After running the model on all of our test samples, we bin these samples based on the output “confidence” values. So there may be a bin for 0.95-1.0, a bin for 0.90-0.95, etc. If the confidence were to represent a probabilistic value, we would expect that the accuracy of the predictions in the 0.9-0.95 bin would be approximately 92.5%, but in actuality it tends to be lower. We set the target probability value for this bin to be the actual accuracy of the predictions in that bin. Then the problem becomes to find the temperature value for our model such that the probability values after the temperature scaling are as close to the targets for all the bins as possible, and luckily this is a simple optimization problem that can be solved easily.

The benefits of this kind of calibration is that it is simple, intuitive, and easy to implement. However, there are two main downsides. Firstly, it assumes that we have a relatively large set of test samples to use for the calibration process, and this is not always the case. Second, calibration only fixes the problem of the scale of the probability scores for data that is within the distribution that the model was trained on. In practice, much of actual prediction uncertainty comes from the fact that incoming data is in some way dissimilar from the data that the model has seen during training. If some text is given to the model that is entirely different from the type of data in the training set *and* in the set used for calibration (whether it be with respect to its style, language, context, or otherwise), then the calibration will not necessarily have solved the problem.

### Using Dropout to Simulate Ensembles

When it comes to text categorization, the main source of uncertainty that would affect a model’s prediction is what is known as ** epistemic uncertainty**, meaning uncertainty that comes from not having seen enough data to make an informed decision about the new incoming text input. In 2016,

**Gal et al.**showed that this type of uncertainty could be modeled in deep networks using “dropout” layers. What are dropout layers? Well, a deep network is usually made up of stacked layers. Each layer is just a function which takes the output of the previous layer as its input and computes a new output to pass to the next layer. Typically, the output of a layer is just a vector of numbers. By adding dropout to a layer, a specific percentage (say 10%) of the numbers in the output vector are randomly selected and set to zero. This is often done after many of the layers during the training process as a form of regularization - to prevent the model from over-fitting to the training set by not allowing it to overly rely on a small amount of features. But usually, after the training is finished, the dropout layers are disabled so as to make the model deterministic without any randomness, and to take advantage of all the features.

To understand how this is useful to determine the uncertainty of a model, let’s imagine trying to teach a group of children the difference between a horse and a cow. We show the children ten pictures of cows and ten pictures of horses. Now, we want to test their knowledge. If we show them another picture of a cow or a horse, chances are that all of them - or at least most - would guess correctly. But what if we showed them a picture of a donkey, and told them they have to guess cow or horse? Probably some would guess “cow” and some would guess “horse” - it certainly would not be as unanimous as when we showed them an actual cow or horse picture. This is because each child’s set of visual features they’ve determined for horses and cows is slightly different, and therefore when they are presented with something which they’ve never seen before, their comparisons to their known features of cows and horses yield varying responses.

The same is true with our models. If we train a group of models using the same data, then when we present the models with data similar to what it has seen during training (in-distribution data), they are likely to agree in their predictions. However, if we present the models with data that is unlike what is has seen during training (out-of-distribution data), we are likely to get varying responses from the various models. Averaging the predictions given to us by all the models is likely to give us a better representation of the probability that the prediction is correct. In Machine Learning, this approach is called ** ensembling**, but when it comes to deep networks, we can simulate the existence of many models using dropout at prediction time. By randomly zeroing out features from each layer of the network, we are ensuring that each time the model is run, it relies on a different subset of its features, which makes it in reality a slightly different model each time we get a prediction. If we predict on the same text many times, each time applying random dropout, then the higher the variance is between all the predictions, the higher the epistemic uncertainty is, and either way, averaging the predictions will provide a refined confidence in the same way as model ensembles. (A

**study performed by Google Research**showed that in many cases the dropout approach performs just as well if not better than ensembling, and almost always better than temperature scaling, in terms of the confidence representing a probabilistic reality.)

The main benefit of this approach is that the resulting confidence score represents the probability that the model is correct *given the training data that it has seen*. Also, unlike the calibration approach, it does not require a set of annotated test data to use for calibrating the scores. The main downside is that it is computationally expensive; the model needs to be run a bunch of times to get the score instead of just once.

### An Example of Using Dropout for Sentiment Analysis

To illustrate the above approach of using dropout in practice, I downloaded **HuggingFace****’**s most popular model trained for sentiment analysis: **distilbert-base-uncased-finetuned-sst-2-english**. This is HuggingFace’s own **DistilBERT** model fine-tuned on the **SST-2**** **dataset. The model is trained to predict a score for two classes, “positive” and “negative”, and the training data is snippets of movie reviews taken from the Rotten Tomatoes website. Here’s one example of a sentence from the training set:

The film overcomes the regular minefield of coming-of-age cliches with potent doses of honesty and sensitivity .

Now, let’s try the model on a sentence that is in-distribution to the training data. Something like this:

I tried to watch the movie in its three-hour entirety, but I was overwhelmed by the desire to turn it off and go to sleep.

Running the model on the above text returns a prediction of “negative” with a score of 0.9993. Now, I will turn on dropout layers in the model, and run the model on the same text 50 times. After averaging the results, I get the same prediction with a score of 0.9916 and when I compute the variance of the predictions I get a variance of 0.0000001, which tells me that all the runs of the model produced almost identical predictions; this means the model is very certain. Now, let’s try another text input to the model which is a different type of data that the model is unlikely to have ever seen anything similar to, i.e. out-of-distribution data. In the world of customer service, here is a typical message one might get from a customer:

*Hello. I am trying to change the email address associated with my account and am having trouble finding the correct way to do this. I would really appreciate some help, and it would be exceptionally great if someone could call me to get this sorted out. - Richard*

When we give this to the model, it predicts “positive” with a score of 0.9518. But after turning on dropout and averaging over 50 runs, we get “positive” with a score of 0.656, significantly lower than the out-of-the-box model. The variance of the predictions is 0.127 - approximately one million times higher than the variance for the previous text we tried! This means that the model is extremely uncertain about its prediction - in fact out of the 50 runs of the model, 16 of those times it predicted “negative”.

In this case, since the DistilBERT model is so efficient (it averaged around 14 milliseconds per prediction on my laptop - without any GPU processing), even running it 50 times still takes less than a full second.

### Conclusion

These two methods that we discussed above are probably the most popular approaches to gaining a more realistic “confidence” score from a deep neural network than what is provided by default. Temperature scaling is probably the most appropriate in applications where runtime is important and where data is available for calibration after training. The dropout approach, though, has been shown to be generally more effective, especially in the domain of text classification, and can be useful for other things besides for just a refined confidence value. For instance, computing the uncertainty of the model on a set of unlabeled data could show which data is most useful to be annotated and added to the training set.

At Y Meadows, we expect the confidence associated with a message’s intent to be informative to the user’s decision of how to handle it. As an example, they may want to confirm certain actions with a human agent if the intent prediction is below a threshold confidence value. For that, it is important that the confidence given by our model be intuitive and easy to understand.

## You Might Also Like

You may have heard of GPT-4, but did you know that it's actively being used by support teams to automatically resolve customer requests?

Read more →Ticket routing is one process that many businesses fail to get right. Read more to learn about automated ticket routing for customer support.

Read more →As employee empowerment becomes more widespread, agents are demanding more from their employers. Read more on how to better meet the needs of your employees.

Read more →