Understanding Semantic Search— (Part 7: The Rise of Vector Databases in the World of Semantic Search)
The advancements of digitalization led to a rapid increase in the creation, storage, and processing of data. MongoDB stated that “about 80–90% of data collected by organizations is unstructured”. It also mentioned that unstructured data is growing much faster than structured data. Moreover, Forbes stated that “95% of businesses cite the need to manage unstructured data as a problem for their business”.
Unstructured data comes in different forms, like text, images, and videos. In this article, we will focus on textual data and discuss how vector databases are becoming critical for managing text.
Machines cannot understand texts. They only understand numbers. Organizations have released many powerful second-generation pre-trained language models since 2018, starting with Google BERT, which can generate high dimensional embeddings (vectors of real numbers) for a chunk of unstructured data (typically for a sentence/passage in a document or a webpage).
These generated embeddings represent provided chunks of text and can be used in building models for many downstream NLP tasks like question answering, semantic similarity, summarization, classification, clustering, etc.
With the latest advancements from our research team in the science of language understanding, made possible by machine learning, we’re making a significant improvement to how we understand queries, representing the biggest leap forward in the past five years and one of the biggest leaps forward in the history of Search. — GOOGLE (October 2019)
In part 2 of the series of my articles, I introduced the Retriever and Reader architecture to scale the Machine Reading Comprehension task on long documents. Retrievers take user queries as input to retrieve relevant passages/documents by computing the similarity between the vector embedding of the user query and vector embeddings of documents indexed in the database. A dedicated database is needed to store and process these vector embeddings for similar semantic search applications.
The relational database concept was introduced in 1970 to model and store data efficiently. However, it has strict logical constraints to only store data in a tabular format, making them unsuitable for storing unstructured data. Later, NoSQL databases got famous in the early 2000s for handling unstructured data. However, traditional NoSQL databases require additional overhead to process unstructured data and perform analytical operations like similarity search. Also, traditional databases cannot run compute-intensive dense retriever algorithms against frequently changing data at scale. Hence a specialized database for embeddings called vector databases came into existence!
Vector databases are partially or fully managed services explicitly designed for vector embeddings. Let us explore some of the main features of vector databases.
Consider a document retriever application that takes the user’s question as input and queries vector database for relevant documents. Irrespective of the industry, customers want high performance for retriever applications. Performance is not just limited to accuracy but also consists of the following metrics:
- latency: time to retrieve relevant documents
- throughput: the number of concurrent queries the retriever application can run in a second
- index time: the number of new documents the retriever application can ingest and make them ready for users to query
Approximate Nearest Neighbors (ANN):
Often latency is as important as accuracy for retriever models in search applications. The retriever system must compare query embeddings with embeddings of each document indexed in a vector database to predict relevant documents. This brute-force process is time-consuming and might lead to high latency, which companies might not appreciate and accept. Hence, Approximate Nearest Neighbor algorithms came into existence which are very efficient at latency with compromising a little accuracy.
Some examples of ANN algorithms are Google’s ScaNN, Meta’s FAISS, Spotify’s Approximate Nearest Neighbors (ANNOY), Hierarchical Navigable Small World graphs (HSNW), etc. Eyal Trabelsi has written an amazing article that provides details of these algorithms.
Meta-Data Filtering:
Adding metadata is optional for documents indexed in the vector database. Metadata can include document category, document sub-category, time of ingestion and document source, etc. Often people use pre-metadata filtering to first filter data based on metadata like document category to reduce search space and retrieve results faster. For instance, users might be interested in results for documents in California state only — therefore, we would filter documents based on “state” metadata before applying any model. Moreover, production-ready managed vector databases like Pinecone claim they have the single-state filtering technique to reduce further latency than normal pre-meta data filtering or post-metadata filtering techniques.
Scaling:
Approximate Nearest Neighbors algorithms and Meta-Data filtering can help to reduce the latency for a single user query. However, production systems might have tens of thousands of users querying concurrently and expecting results in a few seconds. Hence, scaling is essential for vector databases to have high throughput and low latency.
Vector databases use the concept of replication and sharding to scale horizontally. Documents can be divided into different shards. The replicas for shards are created and distributed across multiple systems to paralyze user queries. A good vector database manages the entire workflow process.
Conclusion:
Vector databases play a significant role in search and recommendation systems. However, they are not limited to text but are also used for managing different unstructured data types. Spotify uses them for managing music embeddings, Instagram for image embedding, Youtube for video embeddings, etc. Soon, they will be everywhere!
I highly recommend this medium article from Dmitry Kan to compare different vector databases widely used in the industry.
Stay tuned for more articles in the Understanding Semantic Search Series! (Learn more about other articles in the series here)