Understanding Semantic Search — (Part 8: Unlock the Power of Extractive Question Answering Modelling — Learn How to Tune Language Models for Semantic Search)
Why, when, and how to train language models for extractive question-answering modeling?
TABLE OF CONTENTS:
- Introduction (summary of retriever and reader architecture)
- Different ways to train NLP models
- Annotations and Best Practices
- Training Retriever Models
- Training Reader Models
- Conclusion
Introduction:
In part 0 of the series, I introduced different forms of semantic search applications, including extractive question-answering (EQA) modeling. The goal of EQA is to automatically extract the most relevant answer to a question from a given corpus of text.
In part 1, I briefly discussed how language models like Bert could be used for question-answering or reading comprehension tasks. In part 2, I introduced reader and retriever architecture and models for building EQA-based scalable semantic search applications.
Summary of retriever and reader architecture for EQA:
- Documents and webpages are broken down into chunks of text.
- The retriever’s job is to predict the top chunks of text or passages relevant to the question.
- The reader’s job is to provide the exact granular answer by predicting the start and end indexes in the relevant chunks of text or passage.
The pre-trained models for the retriever and reader are available and ready to use on platforms like HuggingFace. Even though we can often find task-specific (question-answering), domain-specific (legal), and language-specific (Spanish) models like “roberta-base-spanish-sqac,” they won’t always give high accuracies for a specific set of dataset users have. Hence, base pre-trained models are fine-tuned on the exact dataset for higher accuracy.
Different ways to train NLP models:
- Supervised training is the traditional approach to language modeling, where labeled data is used to train a model on a given task. This approach requires a large amount of labeled data, which can be expensive and time-consuming.
- Unsupervised training uses unlabeled data like documents to train a model. This approach can be more efficient and requires no labeled data.
In part 6 of the series, I explained how to train dense retriever models without human annotations using an unsupervised training technique called the generative pseudo-labeling technique. - Weak supervision is a training technique that uses weakly labeled data. In weak supervision, the labels assigned to the data are only partially correct (approximates). Often rules are used by domain experts to create weak labels. These weak labels are used to train a model. Usually, a model trained using weak supervision has higher accuracy than completely unsupervised training methods and lower accuracy than supervised training methods.
- Active learning methods help reduce the amount of data humans must label and improve the model's accuracy by focusing on the most important data points. Users iteratively select and label the most important data points. This is often done through exploration and exploitation, where the system interacts with annotators and uses feedback to identify the most useful data points.
Often weak supervision and active learning can be combined. This allows for quicker training times and more efficient use of data. It can also reduce the human effort required to create a labeled dataset.
Learn more about Weak Supervision and Active Learning here.
Annotations and Best Practices:
Collecting labels for training can be challenging because it requires significant time, effort, and resources to do it correctly. Multiple annotators perform labeling with different experiences, and a lack of process and standardization can lead to confusion and errors. Moreover, over a while, data becomes obsolete. Hence, data might be iteratively collected, and models might be retrained to maintain performance.
What data to collect?
We need question and relevant passage (chunks of text) pairs to train the retriever. We also need question and relevant answer pairs to train the reader.
How to annotate and collect (question, relevant answer, and chunk of text containing answer) pairs for training?
Multiple tools like Label Studio, Haystack Annotations, Prodigy and Doccano, and Brat can help to get the above annotations. You can also create a custom annotation application using Streamlit.
Annotation process:
- Standard annotation:
In this process, we are doing a basic version of active learning where we collect a set of questions that are a good representation of user questions for the static question-answering use case. In the case of dynamic question-answering use case, we ask annotators to create questions manually that are a good representation of user questions
(Learn more about Dynamic vs. Static Question Answering here.)
Once the questions are collected, annotators must manually find relevant answers and associated metadata like context or passage, page number, document name or id, etc. - Easy and fuzzy annotation:
Finding answers to questions that are a good representation of user questions might be tedious, as annotators must search manually and review the entire dataset to find relevant answers. If there is a resource crunch or annotators cannot complete the annotation process in a limited time, we can create something called fuzzy annotations. Fuzzy annotations are easy to collect and often give good accuracies. In this process, annotators interact with the system where they can review documents, highlight an answer and create questions on the fly. The system will capture the created question and associated metadata. - Semi-automated annotation:
This is another hacky way to get annotations. The idea is to host a service that uses the most relevant pre-trained models to predict answers to questions. Annotators use this service instead of completely manually searching for answers. This process takes very less time compared to the standard approach. However, annotations might be biased as users rely on models to collect annotations.
Annotations must be designed, and annotators need to be provided with guidelines. It is important to monitor the annotation process iteratively. Some of the best practices include the following:
Generic Process Best Practices:
- Provide training and support to annotators to ensure they understand the annotation process and techniques.
- Create a dummy annotation session. Assigning multiple annotators to annotate the same data, evaluate them and provide feedback to educate annotators and reduce the likelihood of errors.
- Iteratively collect annotations in feedback loops and monitor them regularly to ensure consistency, quality, and accuracy over time.
- Assign an annotation manager to overlook the process.
Guidelines or Best Practices for Collecting or Creating Questions:
- Avoid ambiguous questions.
- Avoid inferential or opinion-oriented questions.
- Collect factual questions.
- Ensure that the dataset is balanced by including a variety of questions with different levels of difficulty.
- Have questions where answers don’t exist. (This will help train models to predict no answer)
Guidelines or Best Practices for Selecting Answers:
- Avoid partially correct answers.
- Short answers are often clear and better than long answers.
- The answer needs to be a continuous span of text in a document. Avoid answers if they are present across different locations of a document or documents.
Training Retriever Models:
In this article, I explained two types of retrievers with examples. Sparse and dense retrievers represent text in vectors so that machines understand them. Sparse retrievers look for keyword matches between user query vectors and passage vectors. In contrast, dense retrievers focus on semantic meaning and look for contextual similarity between the user query and passage vectors.
Why train retrievers?
It is crucial to train retrievers to improve the performance of searches. If they do a bad job in the first place, there is no way the reader can predict a relevant answer as it is dependent on retriever predictions.
Sparse Retrievers:
Sparse retrievers like TF-IDF and BM25 are static algorithms and are not considered trainable. They are based on shallow and rule-based models where weights remain constant and are based on word frequencies.
Dense Retrievers:
Dense retrievers like Dense Passage Retriever (DPR) and SBert can be trained to understand the context of a query and use that information to accurately select the most relevant passages for a specific set of documents in a domain.
As discussed in my previous article, DPR uses two different models to generate embeddings (one for queries and one for passages). You can easily access pre-trained models from Huggingface. For example, “facebook/dpr-question_encoder-single-nq-base” is a pre-trained query encoder model, and “facebook/dpr-ctx_encoder-single-nq-base” is a pre-trained passage or context encoder model. These pre-trained models can be trained on the data collected by the above-listed annotation process.
Training Steps:
- Take a query and relevant passage pair from annotations.
- Pre-trained models generate respective embedding vectors for the query and passage pair.
- The dot product is calculated between vectors.
- Use an optimization function to maximize the dot product and update the weights of pre-trained models.
- Repeat steps 1 to 4 until the expected maximization is reached by testing on validation data.
In short, the above training process will tune both models such that embeddings vectors of relevant queries and passages are similar.
Even though sparse retrievers are not trainable, they are generally used over dense retrivers if users search using keywords instead of queries or complete statements. Can you think why? :) (Add your answer in comments!)
Training Reader Models:
In this article, I explained the machine reading comprehension or reader model. A reader takes a user question and a relevant passage (predicted by the retriever) to predict the exact answer. In order to do that, it predicts the start and end index in the relevant passage.
Often the reader for extractive question answering is a language model based on an encoder architecture. A classic example of it is Google BERT. BERT predicts answers by predicting which token marks the start and which token marks the end.
We input each word in the text’s final embedding into the start token classifier. Then, we calculate the dot product of the output embedding and the ‘start’ weights and use the softmax activation to get a probability distribution of the words. The word with the highest probability of being the start token is chosen. We do the same process for the end token but with a separate weight vector.
Model weights for both start and end will get updated based on train data collected through the annotation process.
From the above two pictures, we can see the “340” word has the highest probability for the start token, and the “M” token has the highest probability for the end token. Hence, for a sample question like “How many parameters does BERT large model has?” the reader model predicts 340M.
Conclusion:
To sum up, there are various methods, such as supervised, unsupervised, weak supervision, and active learning to train extractive question-answering models. Supervised training requires considerable data and resources, yet the results are worth it. Annotations and best practices should be employed for collecting and constructing datasets for tuning models. Both the Retriever and Reader pre-trained models can be tuned using annotated data. By means of this training, machines can comprehend user queries and precisely extract the most pertinent answer to a question from a specified text corpus.
Stay tuned for more articles in the Understanding Semantic Search Series! (Learn more about other articles in the series here)
Add me on LinkedIn. Thank you!