Understanding Semantic Search — (Part 6: Tune Dense Retriever Models without any Human Annotations using Generative Pseudo Labeling)

Kaushik Shakkari
6 min readSep 13, 2022
Photo by author: A super hot day in Miami, Florida (July 2022)

Tuning language models requires a vast amount of training data. Generally, AI professionals use supervised fine-tuning approaches to tune models for high performance. However, it is hard to find tens of thousands of labeled data. For example, I recently started working in the Spanish legal domain and could not find relevant supervised training data. Creating annotations for tuning language models is complex and requires a reasonable amount of time and resources. In fact, I have seen some customers and domain experts getting a little frustrated and even giving incorrect annotations when asked to provide labels for tuning language models.

Photo by Ben White on Unsplash

In Part 2: Machine Reading Comprehension at Scale of the series, I introduced Sparse Retrieval Algorithms and Dense Retrieval Algorithms. In this article, we will explore an unsupervised domain adaptation technique for fine-tuning Dense Retrieval Algorithms called Generative Pseudo Labeling(GPL). The idea of GPL is to generate training data and use it for tuning language models.

All you need for performing unsupervised fine-tuning are the following:

  1. A Dense Retrieval Algorithm like Facebook AI’s Dense Passage Retriever algorithm.
  2. Unlabeled target domain data (documents/crawled website text data). For example, In my case, I crawled PDFs from Spanish legal websites like Cámara de Senadores (Senate of the Republic).
  3. Pre-trained query generation model to generate queries for passages from unlabeled data.
  4. An optional post-processing step on queries and passages.
  5. Model training against loss function to fine-tune Dense Retrieval Algorithm using generated synthetic queries.

Let us now discuss steps 3,4,5 from above in detail:

Step 3: Query Generation

In Natural Language Processing (NLP), query generation is the task of generating synthetic queries for a given piece of text. Many language models can be used for the query generation task. GPL uses T5, the Text-to-Text Transfer Transformer language model.

T5 introduction by Google AI

Google AI introduced T5, an 11-billion parameter model trained on the Colossal Clean Crawled Corpus (C4) corpus. T5 has achieved state-of-the-art results on many NLP downstream tasks like question answering, machine translation, sentiment analysis, document summarization, etc.

T5 is also used for the query generation task. You can find the pre-trained T5 model on Microsofts’ MS MARCO Passage Dataset (a dataset with 500k actual search queries from Bing together with the relevant passage) here.

T5 model generated “who is tony stark” for the above passage — from HuggingFace.

Let us name extracted passages from domain data as original passages (OP). We use the T5 model to generate a query (GQ) for each original passage. (OP, GQ) pairs are used in the negative mining and pseudo labeling step described below.

Step 4: Negative Mining and Pseudo Labeling

The quality of some queries generated by T5 might be poor because T5 is trained on generic data, whereas we are using this model on passages for a specific domain. For example, a legal passage might not have the answer for the query generated from the same passage (false positive queries). We get false positive queries because T5 was not explicitly trained on legal data. Unlike other unsupervised fine-tuning techniques like GenQ, GPL performs additional post-processing steps, called negative mining and pseudo labeling, to identify noisy data before tuning our dense passage retriever model.

Negative mining aims to find passages similar to the original passages. Later a passage is randomly selected from similar passages as a negative passage (NP) for each OP to create (GQ, OP, NP) pairs.

The assumption is that a negative passage won’t answer the query generated by the original passage, and the DPR model needs to learn to separate the original passage and the negative passage for a given generated query from the original passage.

Pseudo Labeling aims to calculate the margin by taking the difference between cross-encoder model similarities of (GQ, OP) and (GQ, NP) to create (GQ, OP, NP, margin) pairs for training DPR. In Part 4 of this series, I introduced Semantic Answer Similarity metrics, which use a cross-encoder to calculate the similarity between answer pairs. We can use the same cross-encoder model to calculate the similarity between (GQ, OP) and (GQ, NP) pairs.

For a generated query (GQ) in a (GQ, OP, NP, margin) pair,

margin > 0 indicates that a negative passage (NP) is more relevant than the original passage (OP) for its generated query (GQ). In this case, GQ might be a bad query (noisy), and OP might not answer GQ

margin < 0 indicates that the original passage (OP) is more relevant than the negative passage (NP). In this case, GQ might be a good query, and OP might answer GQ

margin ~= 0 indicates that both original passage (OP) and negative passage (NP) might be equally relevant

Pseudo Labeling enables fine-tuning to consider original and negative passages differently (relevant, partially relevant, and irrelevant) based on the margin score.

Step 5: Fine-tuning

The Dense Passage Retriever (DPR) takes (GQ, OP, NP) pairs and generates dense embeddings for GQ, OP, and NP, respectively. A new margin is calculated by taking the difference of similarities between (GQ, OP) and (GQ, NP) dense embedding pairs. Mean Square Error (MSE) loss is used to tune the DPR model.

Mean Square Error (MSE) formula from Wikipedia

To apply the above MSE loss formula for our case, we consider n as the number of (GQ, OP, NP, margin, new margin) pairs, Y as the margin (label), and Ŷ as the new margin (prediction).

This training against MSE loss makes the DPR model’s predicted new margin mimic the margin between original and negative passage pairs. This fine-tuning enables DPR to adapt embeddings to the respective user domain as original passages used for tuning are from crawled or collected domain documents.

The pros of using unsupervised tuning like GPL are unlike supervised tuning models, we need not worry about training data at all, like quality and quantity of human annotations and overfitting with training data, etc.

Conclusion:

Some might still wonder how a technique like GPL is effective for domain adaptation. However, the GPL paper stated, “On six representative domain-specialized datasets, we find the proposed GPL can outperform an out-of-the-box state-of-the-art dense retrieval approach by up to 9.3 points nDCG@10.” I applied GPL in the legal domain, and the predictions from the tuned model are highly relevant compared to non-tuned DPR for given user questions.

Kudos, you completed domain adaptation by Generative Pseudo Labeling (GPL). In future articles, I will explain supervised fine-tuning as, generally, supervised fine-tuning is more effective than unsupervised fine-tuning.

Stay tuned for more articles in the Understanding Semantic Search Series! (Learn more about other articles in the series here)

Add me on LinkedIn. Thank you!

--

--

Kaushik Shakkari

Senior Data Scientist | My biggest success till day was converting my passion to my profession (Data Science 🚀)