BERT: Understanding the Null response threshold in QA systems

Iván Corrales Solera
6 min readFeb 2, 2022

Building a robust Questions Answering system (QA) implies that the system will provide an answer only when this can be found in the context. That means, that when we build a QA system we must consider two types of questions

  • Questions with one (or more) valid answers (Positive questions)
  • Questions with nonvalid answers (Negative questions)

At times It’s better to say nothing than to say nonsense

Example: Positive & Negative questions

Let’s have a look at the following text

“Maria Salomea Skłodowska, also known as Marie Curie, was the first woman to win a Nobel Prize, the first person and the only woman to win the Nobel Prize twice, and the only person to win the Nobel Prize in two scientific fields”

A) Positive question

  • Question: who was the first person to win the Nobel Prize twice?
  • Answers: There are two valid answers to the above question: “Maria Salomea Sklodowska” and “Marie Curie”.

B) Negative question

  • Question: Who was the first man to win the Nobel Prize in two scientific fields?
  • Answer: We don’t have enough information to answer the above question.

Make predictions

The models based on BERT focused on QA tasks don’t predict the answer to a question but they provide a score for each possible answer.

Our challenge is to decide the prediction (or predictions) for a given question with a context. I say a challenge because not all the questions won’t be able to be answered.


First of all, we need to tokenize the question (and the context) as this will be the input data for our model. In the following example we will use the tokenizer deepset/roberta-base-squad2 , but feel free to take the one that you most like. Have a look at, to check the available models.

The tokenizer function returns a dictionary. . By default, this dictionary contains two keys: attention_mask and input_ids . We could request other information but that’s out of the scope of this article.

I want you to have a look at the tokenized message above. Question and context are separated by a special token </s> . It’s necessary to identify which part is the question and which one belongs to the context because the answer shouldn’t be taken from the own question. By the way, we can take advantage of the method `sequence_ids` to identify which token belongs to the question or (it’s represented with 0) or to the context (represented with 1) . The special characters ( <s> or </s> are represented with None)

Execute the model

We can have a look at the available pre-trained models. This time is important we take a model that has been trained for doing Question Answering. Actually, the model that we’re using was trained with a dataset squadv2.

To execute the model we only need to pass a couple of tensors (more parameters can be passed but they’re not required)

The returned value from the model is a dictionary that contains two tensors: start_logits and end_logits .

These two tensors contain float values: positive and negative values. These values inform about the “probability” that the token, in the same position, is the first or the last token in the answer.

Make the prediction

It’s time to make the predictions and we already have all the needed information. So let’s enum the premises that we need to follow to make the prediction

  • All the tokens in the answer must be extracted from the context but not from the question. They must be represented with a value 1 in the sequence_ids
  • The index of the last token in the answer must be greater than the index of the first token in the answer. The index in start_logits must be lower than the index in end_logits
  • The predicted answer should be the one with a higher score. The value of start_logits[start_index] + end_logits[end_index] must be the maximum possible.

The variable predictions is a list that contains. all the possible combinations that match the above premises. The predictions are sorted by the score. Thus the prediction in position 0 will be one with the highest probability to be successful. However, the model could be able to obtain more than one valid answer… Let’s see the 5 predictions with the highest probability

Apparently, all the above questions could be correct… well we could discard the fifth in the ranking. Answers #2 and #3 could be considered the same. Actually, we will normalize them when we want to obtain the metrics.

We can create a function in Python to simplify the following examples

So It looks great, we build a system able to make good predictions, but… What’s about negative questions?

Our system can answer any question, even though the question can’t be answered. You could have realized that the score, in this case, is negative and we could use that value as a threshold (The null response threshold) but sometimes our model will return positive scores for questions that don’t have a possible answer, for example, the below question…

To sum up, we know that we need to establish a threshold. If we didn’t find any prediction with a score higher than the null response threshold our prediction would be that we can’t make a prediction.

We add a new param in the function prediction. This value is the thresholdand It’s used to identify when an answer is not valid.

I will publish a new article in which I will explain how to obtain the optimal threshold for the null responses.

The Notebook with the code can be found here