Understanding Semantic Search — (Part 5: Ranking Metrics for Evaluating Question Answering Systems and Recommendation Systems)

Kaushik Shakkari
7 min readMar 15, 2023

--

Picture from the author: A gorgeous photograph near serene Green Park Lake in Seattle (November 2022)

This article explained the retriever and reader architecture used for extractive question-answering modeling. In short, for a given user query, the retriever model predicts the best passages, and the reader model takes the output from retriever models to predict relevant answers.

The last article explained different answer quality metrics for reader models. Those metrics examine whether the answers predicted by the question are accurate compared to the labels provided by human annotators. In general, annotators provide multiple gold standard annotations/labels for a question, and the model predicts various answers. However, displaying answers according to their relevance is also essential as it affects the user experience. For example, users might check only the top answers and expect them to be displayed in the order of relevance for a given question, as they might need more patience to read all answers. Therefore, it is essential to evaluate the ranking quality of the predicted answers to assess model performance accurately.

In this article, I will review different ranking metrics for evaluating answer order relevance for the question-answer pairs below (The answers are ordered based on relevance). The previous article used Exact Match (EM), F1-Score, or SAS to compare a predicted answer and annotator labels. In this article, I used an EM while computing ranking metrics.

Table from the author: Consider the above examples to explain different metrics

1. Top-N Accuracy:

Top-N accuracy based on EM takes first N predictions from the model (sorted in descending order of answer confidence) and checks if there is any lexical overlap with gold standard annotations/labels.

For the user query — What character did Robert Downey Jr. play?:
- The first prediction of my model is Elon Musk. However, it is not matching with any of provided golden labels. Hence Top1 accuracy is 0.
- The second prediction is Tony. However, it is not matching with any of provided golden labels. Hence Top2 accuracy is also 0.
- The third prediction is Stark. It is matching with one of the labels provided by annotators. Hence, Top3 accuracy is 1.

Below is the table which shows if the model's predicted answers are correct or incorrect with respect to any of the labels provided by the annotator and also associated Top1, Top2, and Top3, respectively.

Table by the author: Calculated Top N for three examples introduced previously

The limitations of Top N accuracy:

  1. The ranking across multiple answers is ignored. In the second example, the golden labels are Brooklyn, New York City, and New York. The model predicted answers correctly but ranked them in a different order. Even though answers were ranked incorrectly, the top 1, top 2, and top 3 are 1.
  2. Top N metrics can be biased towards the most relevant answer compared to other relevant answers. If top n is correct, top n+k always becomes correct, where n > 0 and k > 0. This can lead to a lack of diversity in the answers and cannot evaluate multiple relevant answers.

2. Precision over k:

Precision over k measures the ratio of correctly identified answers among the first k answers predicted by the model.

For the user query — What character did Robert Downey Jr. play?:
- The two most correct answers predicted by the model are Elon Musk and Tony. However, both answers are irrelevant. Hence the score is 0.
For the user query — Where is Captain America from?:
- The two most correct answers predicted by the model are Brooklyn and New York. However, both answers are irrelevant. Hence the score is 0.5.
For the user query — What are Thor’s three favorite weapons?:
- The two most correct answers predicted by the model are Infinity Gauntlet and
Mjolnir. However, only Mjolnir is relevant. Hence the score is 0.5.

Table by the author: Calculated Precision over k for three examples introduced previously

However, the limitations of precision over k remain the same as those of Top N.

3. Mean Reciprocal Rank (MRR):

MRR finds where the top predicted answer is ranked in provided labels. Doing that, MMR tries to avoid a limitation of top N and precision over k (ranking across multiple answers is not completely ignored).

For the user query — What character did Robert Downey Jr. play?:
- Elon Musk is the first predicted answer by model. However, none of the labels provided by my annotator matches the top answer predicted by the model. Hence rank is infinity, and reciprocal rank is 1/inf = 0
However, for the user query — Where is Captain America from?:
-
The top answer predicted by the model is Brooklyn, the first relevant answer ranked by the annotator. Hence rank is 1, and the reciprocal rank is 1/1 = 1

Table by the author: Calculated MRR for three examples introduced previously

The limitations of Mean Reciprocal Ranking:

The MRR metric only looks at how well the most relevant answer from a list of predictions is doing and does not consider the rest of the predicted answers. Hence, the ranking across multiple answers still needs to be completely captured.

4. Mean Average Precision (MAP):

Mean Average Precision (MAP) is calculated by taking the mean or average of multiple “precision over k” values for each query.

For the user query — What character did Robert Downey Jr. play?:
- For k=1, precision over k is 0 as Elon Musk is an incorrect answer
- For k=2, precision over k is 0 as both Elon Musk and Tony are incorrect answers
- For k=3, precision over k is 1 as the third answer, Stark is the correct one
Mean Average Precision for this query is 0.33 (average of 0, 0, and 1).

Table by the author: Calculated MAP for three examples introduced previously

The limitations of Mean Average Precision:

Even though MAP is a better ranking metric than previous metrics, MAP does not wholly consider the order of the answers.

5. Normalized Discounted Cumulative Gain (NDCG):

NDCG is calculated by assigning a relevance score to each answer in the set and then computing a discounted cumulative gain (DCG) over the set. Later, the DCG is then normalized to get NDCG. Higher the NDCG score, the better the answers ranked by the model.

Let’s understand how the NDCG score is calculated for an example below.

For the user query — What are Thor’s three favorite weapons?:
Ranked golden labels or annotations:
Mjolnir, Stormbreaker, Jarnbjorn
Consider Mjolnir is the most favorite weapon, Stormbreaker is the second most favorite, and Jarnbjorn is the third most favorite.
Answers predicted by model: Infinity Gauntlet, Mjolnir, Stormbreaker

Cumulative Gain: It is defined as the sum of the relevance score in a
recommendation set of evidence.
Based on the labels, the relevance scores for model predictions are the following:
Relevance Score (Infinity Gauntlet): 0 (non-relevant)
Relevance Score (Mjolnir): 3 (highly relevant)
Relevance Score (Stormbreaker): 2 (relevant)
Cumulative Gain = 0 + 3 + 2 = 5
However, the cumulative gain is ambiguous and cannot determine how well answers are ranked.

Discounted Cumulative Gain (DCG): summation of relevance(i) / log (i+1)
Note:
Log in base 2
DCG = (0 / log(1+1)) + (3 / log(2+1)) + (2 / log(3+1)) = 0 + 1.89 + 1 = 2.89
However, DCG is hard to interpret.

Normalized Discounted Cumulative Gain (NDCG): DCG / DCG(i)
DCG(i) is DCG scored for an ideal answer.
The ideal answer is the golden label provided by the annotator. Hence, in the above example, DCG(i) is the DCG score for Mjolnir, Stormbreaker, and Jarnbjorn.
Let's first calculate the cumulative gain for the ideal answer.
Relevance Score (Mjolnir): 3
Relevance Score (Stormbreaker): 2
Relevance Score (Jarnbjorn): 1
Cumulative Gain (i) = 3+ 2+ 1= 6
DCG (i) =
(3 / log(1+1)) + (2 / log(2+1)) + (1 / log(3+1)) = 3 + 1.26 + 0.5 = 4.76
NDCG = DCG / DCG (i) = 2.89 / 4.76 = 0.61

NDCG is always between 0 and 1.

From the above example, we can understand NDCG considers that some answers are more relevant than others, ranking the most relevant responses first, then those of lesser relevant ones, and finally, the least relevant answers.

The limitations of Normalized Discounted Cumulative Gain:

  1. If there are no labels for a question, meaning no relevant answer, then the DCG (i) / Ideal DCG score is 0. In this case, we need to make the NDCG score 0.
  2. Annotators giving relevance scores for each label answer can be a tedious task.

Conclusion:
In conclusion, this article focused on different ranking metrics to evaluate the order of relevance of question-answering systems. Top N accuracy, precision over k, mean reciprocal rank, mean average precision, and normalized discounted cumulative gain are some of the metrics explained in this article. Each metric has its limitations. Even though choosing the right metric according to the application is essential, Normalized Discounted Cumulative Gain (NDCG) is a good ranking metric.

Stay tuned for more articles in the Understanding Semantic Search Series! (Learn more about other articles in the series here)

Add me on LinkedIn. Thank you!

--

--

Kaushik Shakkari

Senior Data Scientist | My biggest success till day was converting my passion to my profession (Data Science 🚀)