Vald. A highly scalable distributed fast approximate nearest neighbor dense vector search engine.

vald.vdaas.org
Geek Culture
Published in
6 min readMay 31, 2021

--

The State of Data Retrieval

Nowadays, the need for information retrieval is increasingly based on text and various types of data, including text, images, video, and audio. These technologies are getting more and more advanced every year. The types of data and ways to interpret them are also diversifying. Currently, these technologies are helpful in a wide range of areas, such as recommendation search, which includes the user’s behaviour in context, and image search, which uses image feature vector to perform similar searches. As these technologies and contents develop, the scale of these technologies is increasing every year. It is not easy to search these data efficiently with traditional rule-based search engines. Text-only search has already been outmoded. In addition to the image search, text search methods are also becoming popular, where queries are processed with natural language processing that accurately interprets the meaning of the text and returns near-meaningful search results. In another area, a multi-modal search by image and text becomes available by generating an image feature and text feature combination. As mentioned above, a wide variety of data can be represented as vectors using deep learning techniques these days. As a result, the vector search engine as a model-independent search technology has become an important technology. In this blog, I would like to introduce a product that uses Approximate Nearest Neighbor (ANN) Search used in Vector search technology.

What is Approximate Nearest Neighbor (ANN)

Before explaining ANN, let us take a look at kNN (k-Nearest Neighbor). Many experts have already explained kNN, so that I will give a simplified explanation here. kNN is one of the methods to find the most similar vector as well as ANN. It is a method to search k nearest neighbor data to the query. However, this method has concerns about the amount of data and performance. The trade-off of returning exact neighbors in kNN is that the amount of data to search increases as the number of data increases, which takes a lot of computation time. This is why ANN has appeared as another method. ANN is a method to search for data that is approximately in the neighbor or approximate neighbor. There are many ways to implement ANN, but the most common are Graph or Tree-based implementations. Graph-based ANN does not search all Graph Edges. Instead, it searches a portion of them by setting a threshold to prevent a proportional increase in computation time as the number of data increases.

The ANN method is an efficient way to calculate the similarity of vectors for recent big data.

Current Status of Approximate Nearest Neighbor Search Using Vectors

Many companies are working on vector approximately nearest neighbor search, such as Yahoo Japan’s NGT, Facebook’s Faiss, Spotify’s Annoy, and Microsoft’s SPTAG. These libraries are open source projects. You can check them on GitHub. So far, the most balanced and best-performing engine is Yahoo Japan’s NGT. The results of each benchmark can be found at ANN-Benchmarks.

About NGT

NGT has been developed by @masajiro, a researcher at Yahoo Japan. The NGT uses a “Graph And Tree” design for fast Approximate k-Nearest Neighbor Search. NGT has a server implementation called NGTD. It provides NGT’s vector search functionality through a REST and gRPC interface.

The NGT is a well-balanced engine that enables high performance and high precision vector approximately nearest neighbor search. Due to the nature of keeping indexes in memory, it requires much memory. Especially in multi-modal data, the number of dimensions of each vector data tends to be high. Therefore, the number of indexes that can be used is limited to the maximum memory capacity of the server.

There is another problem. Suppose the NGTD server holds a large number of indexes. In that case, the user must accurately control the complicated vector indexing workflow such as Insert, Search, Delete, Update, Commit command of Graph index, and the system locks during the commit operation. So the amount of data as the processing time of commit increases with the increase, it becomes more frequent to lock actions via the user’s API.

From these things, we realized that NGT/NGTD has its limits, and the design of the clusterized NGTD project began. However, it cannot avoid many of these problems and inconveniences because the operation cost cannot be reduced by just distributing. We reviewed and redesigned the entire existing system to improve it.

That is why we developed a distributed dense vector approximate nearest neighbor search engine called “Vald”.

What is Vald?

Vald is a highly scalable distributed fast approximate nearest neighbor dense vector search engine with a horizontal scaling feature designed based on microservices architecture with a high affinity with Kubernetes. Vald is an OSS project that anyone can easily contribute to using GitHub. Vald was initially developed based on NGTD to support the ANN search using big data on a Yahoo Japan scale. It has been designed and developed to meet many requirements, such as stability, disaster recovery capabilities, and performance requirements, which is why it is often tested by Yahoo Japan, a substantial Japanese IT company.

1. Cloud-Native Vector Similarity Search Engine

2. Highly Scalable

  • Vald is designed for the Kubernetes environment. The scalability is Kubernetes cluster size. Users can quickly scale their Kubernetes if cluster scales Vald can use more and more memory (vector index) capacity.
  • Vald has distributed vector store algorithms and compute-resource based balancing algorithms, and it enables cluster-wide optimized memory control

3. Highly customizable

  • All Vald component is configurable using a YAML file and Environment Variable.
  • Users can customize many params such as grpc MsgBufferSize and others
  • The Vald architecture is well divided based on domain feature, and users can easy to customize Vald component and easy to add new DB support
  • We have a Filter component both of Ingress and Egress
  • Using the Ingress-Filter component, users can easy to manipulate incoming data. e.g. Convert blob data to vector using TensorFlow Vector dimension padding
  • Using the Egress-Filter component, users can easy to manipulate outgoing search results. ANN Search result is approximate. Sometimes it makes problems for recommendation accuracy. Vald has Egress-Filter, users can easy to modify ANN search results.

4. Auto Vector Index Management

  • Automatically backup index data to multiple locations such as AWS S3, Google Cloud Storage, Cassandra, MySQL, Local Volume.
  • Automatically recover the index from the backup location.
  • Automatically commit index for non-blocking vector search experience.
  • Automatically rebalance index replication for stability.

5. Multiple Language Support

  • Supported languages are Go, Python, Java, Node, Clojure.
  • We also have proto files that many languages can connect to Vald using their gRPC libraries.

Finally

Thanks for taking the time to read this blog post. You can try Vald right away from Get-Started here.

We will post more about Vald in the future, including how to use it and improve your search engine, so please look forward to the next post.

--

--

vald.vdaas.org
Geek Culture

A highly scalable distributed fast approximate nearest neighbor dense vector search engine.