Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

Deniz Koyuncu, Rensselaer Polytechnic Institute

Jacob Rich, Y Meadows Inc.

When machine learning models are trained to perform a particular task, we usually collect a set of samples, the “test set”, that is representative of the task at hand in order to measure the model’s performance. If there are any limitations on the samples collected – and there usually are – the measured performance we get from the test set often will not match the performance measured when the model is actually deployed in the field. One common example is when the test set samples come from a different source than the samples the model is being used for in practice; for instance, the test samples for a “sentiment classification” model might have been collected from Instagram comments, but the model is then put into production to analyze Twitter feeds. It’s very common for test sets to be merely a “hold-out” from the samples collected for training, which means that the source and style of the test samples will closely match what the model was trained on, much more so than any samples it will be shown in production.

In particular, with respect to image classification, it’s been found that state-of-the-art classifiers trained on the Image-Net dataset can have 11%– 14% performance drops when provided with a new dataset collected in a similar way [1].  

One simple way to improve the estimate of the model’s performance in a task is to obtain additional labeled samples, preferably without the limitations of the original test set. However, the collection process can be labour intensive and it still may possess the limitations of the original dataset if they are not identified. Even if the sample collection process is unbiased, a low performance in the new dataset (which can lead to investigating new models) doesn’t by default pinpoint the limitations of the existing model.

Intuitively, the problem with a model that tests well experimentally, but performs poorly in practice is that the model has not detected the patterns that allow it to generalize broadly enough to understand other data which might look somewhat different. In an interesting new paper, “Beyond Accuracy: Behavioral Testing of NLP Models with CheckListBeyond Accuracy: Behavioral Testing of NLP Models with CheckList”, the authors instead suggest using a palette of interpretable tests which can reveal whether the model has certain linguistic skills which they refer to as capabilities. In some sense, these capabilities show the generalization skills of the model.  Moreover, the proposed approach is more practical than collecting new samples since the samples used in those tests either don’t require labeled data or can be generated at scale.

A Model’s Capabilities

The authors introduce a list of ten capabilities (which they say is not exhaustive) which are broadly defined and better understood when presented with the tests that measure them. 

As an example, the “temporal” capability, which can be broadly understood as a model’s understanding of time, is measured by testing samples where understanding of temporal cues are crucial for accurately predicting the label, for instance testing the model’s response to language like “is” versus “used to be”. To give another example, the “robustness” capability can be measured by checking whether the model is robust to typos, the addition of dummy text, or paraphrases. By testing these capabilities, one can to an extent measure whether a model has some of the linguistic skills that humans possess, and thus by testing these capabilities one can check -indirectly at least – how well a model’s performance will generalize to other data.

Measuring the Capabilities

The authors introduce three testing types for measuring any given capability. The simplest type is the Minimum Functionality test (MFT), which is “inspired by unit tests in software engineering”; the MFT just uses simple templates to create testing samples with corresponding labels which can be compared with the model’s prediction. For example, an MFT used for evaluating a model’s Negation capability checks whether the model can accurately classify sentences generated with the template “I {NEGATION} {POS_VERB} the {THING}.” – such as “I can’t say I recommend the food” – as negative [2].

The second type of test for capabilities is the Invariance Test (INV). Instead of checking whether a sample input is correctly classified (which requires a label), the model’s prediction for the input is compared with its prediction to a slightly changed, or “perturbed”, version of the same input. Because the perturbation is done in such a way that the semantic meaning of the query remains unchanged, the two predictions are expected to be equal. In other words, we expect the model to be invariant to these types of changes. An example INV test for the Robustness capability changes one letter of a sentence to a nearby letter on the keyboard, simulating a typo, and expects the predicted sentiment for sentences to be equal.

The third type of capability test is a Directional Expectation Test (DIR). This is similar to the INV test, except that the perturbations are done such that the model’s prediction of the perturbed query is expected to change in a certain direction. For instance, one DIR test for sentiment analysis is to add a positive sentence at the end of an input sample (like “You are extraordinary.”); the expectation is that the prediction should be not less positive than the prediction for the original text without the added sentence. 

How does a test change for different NLP tasks?

To measure a specific capability in different NLP tasks, applied tests are adjusted to each particular task. For example, MFT tests for measuring the temporal capability are different for Sentiment Analysis, Quora Question Pair (determining if two questions are the same), and Machine Comprehension tasks (Table 1). 

Table 1: Temporal-capability checking MFT tests applied to different NLP tasks. Task” denotes the NLP task, “Description” the description of the MFT test, “Example” is an example sentence and “Expected Output” denotes the expected output of the model tested. Table data from (2).
Task Description Examples Expected Output
Sentiment Sentiment change over
time, present should prevail
I used to hate this airline, although now I like it. Pos
QQP Is ≠ used to be -Is Jordan Perry an advisor?
-Did Jordan Perry use to be an advisor?
MC Understanding before/after, last/first C: Logan became a farmer before Danielle did.
Q: Who became a farmer last?

A test type may not be applicable to all NLP tasks. For example, with respect to the Vocabulary capability, DIR tests are applicable to sentiment analysis models but not to QQP and MC tasks (Table 2).

Table 2: Vocabulary-capability checking DIR tests applied to different NLP tasks. Table data from (2).
Task Description Examples Expected Output
Sentiment Add positive phrases, fails if sent. goes down by > 0.1 @SouthwestAir Great trip on 2672 yesterday… You are extraordinary.
Add negative phrases, fails if sent. goes up by > 0.1 @USAirways your service sucks. You are lame.
QQP Not available.
MC Not available.

Checklist Applied to Popular Models

 In the sentiment analysis task, authors compare three commercially available models (from Microsoft, Google, and Amazon) and two publicly available models (BERT and RoBERTa) in terms of their failure rates in different tests.

In the paper, they list the results of the 17 tests that have been applied to the models. Among the 17 tests, in only two tests did the commercial models reasonably outperform the non-commercial models, and both of those tests involved the concept of “neutral” sentiment, which the non-commercial models were not trained to predict (they only have two categories – positive and negative). Impressively, on 14 out of the 17 tests either the BERT or the RoBETA (the two non-commercial “research” models tested) achieved the lowest failure rate. This could suggest that the routine controls that the commercial models go through doesn’t cover the types of tests introduced with CHECKLIST, and customizations that can be introduced by routine maintenance may somehow degrade the performance in these tests. Of course, for an ideal fair comparison, the underlying architectures of the commercial models should be described.

One interesting result is that the commercial models performed particularly poorly in Sentiment Analysis when presented with tests for the “Negation” capability. For example, a negation at the end of a negative sentence should result in a positive sentiment, such as “I thought I would dislike this product, but I didn’t”. The commercial models fail on these types of samples 90-100% of the time, while the research models tend to perform better – one does just slightly better, but the other significantly so. 

For QQP, only BERT and RoBERTa are evaluated. One of the most problematic parts for RoBERTa was Semantic Role Labeling capability. It achieved a 100% failure rate on question pairs created by MFT that use “symmetric relations” but change the order such as “Are B.B. King and Albert King relatives?” and “Are Albert King and B.B. King relatives?”. One test where BERT achieved a 100% failure rate was the MFT targeting the temporal capability which creates unequal question pairs like “What was life on earth like before the discovery of fire?” and “What was life on earth like after the discovery of fire?”. With similar tests, the authors show how state-of-the-art models can stumble with simple rules of language.

User Studies

The authors devote a section in the paper to describing two user studies they performed with CHECKLIST to show its usefulness in the testing process. First, they collaborated with the team at Microsoft responsible for the commercial Sentiment Analysis model to devise many new tests which helped the team uncover new bugs and areas of poor performance. Secondly, the authors organized a study to compare people tasked with creating as many testing samples as possible in two hours for evaluating a model in the QQP task. To some of the group, they gave no further instructions, to others they gave short descriptions of their listed capabilities and tests, and to a third set they gave access to their tools to create tests at scale with templates. Their results show the effectiveness of their CHECKLIST tests to help with the debugging process, and the power of their tools to create test samples easily and efficiently.


The authors propose an alternative way of evaluating certain aspects of a model’s performance which they call capabilities. Checking those capabilities is, in a way, a method for applying sanity checks to the model, and should correspond well to the model’s performance on real-world data. To test those capabilities they introduce convenient ways of generating test samples by using templates and by perturbing existing unlabeled data.

Evaluating both the commercial models and non-commercial state-of-the-art models in terms of their capabilities showed some surprising and interesting results, and  exposed how high performing and well-maintained commercial models can still fail significantly in tests which would be simple for humans.

By showing both how CHECKLIST can identify shortcomings of even the best performing models and how users can use it to identify bugs in a more efficient manner than manually creating datasets, the authors give a complete picture of the usefulness of their method.



  1. Recht, B., et al., Do ImageNet Classifiers Generalize to ImageNet?, in Proceedings of the 36th International Conference on Machine Learning, C. Kamalika and S. Ruslan, Editors. 2019, PMLR: Proceedings of Machine Learning Research. p. 5389–5400.
  2. Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh: “Beyond Accuracy: Behavioral Testing of NLP models with CheckList”, 2020, Association for Computational Linguistics (ACL), 2020; arXiv:2005.04118.


Reference implementation provided in the paper


Related Post

Follow Us

© 2020 Y Meadows, Inc.