Hybrid Search with Vald: A Prospect Future for Search

vald.vdaas.org
ITNEXT
Published in
4 min readApr 1, 2024

--

This article presents an experimental evaluation of Hybrid Search, which combines Vald (NGT) for semantic search and OpenSearch BM25 for lexical search.

What is Hybrid Search?

Hybrid Search combines different search approaches, such as full-text and vector search or Lexical and Semantic search, to achieve enhanced search accuracy. With the recent rise of vector search technology, Hybrid Search has become hot in the search technology community. This approach merges the power of modern machine learning models like LLMs and the advanced vectorization and vector search technologies that have evolved alongside them with the established classical search techniques of the past. This fusion has led to a surge in research and practical applications.

Several search libraries and engines, including LangChain and OpenSearch, already offer Hybrid Search functionality. Also, numerous studies have demonstrated the effectiveness of this approach, showcasing its potential and revolutionizing the way we search for information.

The Vald Project’s Exploration of Hybrid Search

At Vald, we are focusing on the development of vector search technologies. However, we also keep a close eye on the latest trends in search technology as a whole, and Hybrid Search is one of the topics that has particularly caught our attention. Hybrid Search has the potential to significantly contribute to the future of search technology and is worth exploring further. Therefore, inspired by research from Pinecone and others, we conducted an experimental evaluation to assess the extent to which Hybrid Search can improve search accuracy using Vald.

Experiment: Evaluating the Impact of Hybrid Search on Recall and NDCG

The main objectives of this experiment were to:

  • Evaluate the accuracy improvement achieved by Hybrid Search compared to individual search methods.
  • Analyze the impact of mixing different proportions of search results on the overall accuracy

In this experiment, we implemented Hybrid Search using Vald (NGT) for Semantic Search and OpenSearch’s BM25 for Lexical Search. We used Recall and NDCG as evaluation metrics to assess the search accuracy.

Experimental Condition

The experiment was conducted with the following settings:

Dataset

  • MS MARCO passage (train: 8.8M, val: 6980)

Model

  • SentenceBERT

Vectorization

Due to the 512-token limit, we averaged the vectorization results of the remaining tokens to obtain the overall document vector.

Vald(v1.7.8)

  • Search Parameters: radius: -1.0, epsilon: 0.05, timeout: 3s, k: 1000
  • Edge Size Pair Set: creation_edge_size: 50, search_edge_size: 50(default), creation_edge_size: 100, search_edge_size: 100

OpenSearch(v2.11.0)

  • k1: 1.2, b: 0.75, k=size: 1000(default), k1: 0.9, b: 0.4, k=size: 1000

Mixing Search Results

To mix the search results, we defined a parameter α that represents the proportion of each search result used. The final Top-K 1000 search results were a combination of Vald and OpenSearch results, weighted by α.

Results and Analysis

The graphs below show the Recall/α and NDCG/α curves.

Recall/α
NDCG/α

Accuracy Improvement

As shown in the graphs, the results demonstrated that Hybrid Search outperformed both Vald and OpenSearch in terms of Recall and NDCG. This confirms the potential of Hybrid Search to deliver more relevant and accurate search results.

We also evaluated the impact of tuning the parameters of Vald and OpenSearch on the accuracy. Regardless of these parameters, the shape of the curves for both metrics is similar, indicating that the accuracy changes consistently with the value of α.

Impact of Mixing Search Results

The accuracy of Hybrid Search was found to be sensitive to the value of α. The best accuracy was consistently achieved at α=0.5 in this experiment, indicating that mixing equal proportions of results from Vald and OpenSearch yielded the optimal outcome. It's important to note that the optimal α can vary depending on the dataset and search query.

Conclusion

In this article, we explored the potential of Hybrid Search through an experimental evaluation using Vald and OpenSearch. While we focused on a specific dataset and model combination, we believe that further investigation with different text datasets, as well as image and audio data from other domains, can yield even more valuable insights.

We also performed some tuning on Vald during the experiments. Details about the tuning process and its effects will be covered in a future article, so stay tuned if you’re interested!

If you found this article helpful, please consider leaving a reaction and giving it a GitHub Star. Your feedback is greatly appreciated!

If you have any questions or requests, please feel free to contact us!

We’re waiting for your contributions!

See you next post :)

Other Post

--

--

A highly scalable distributed fast approximate nearest neighbor dense vector search engine.