Vald. A highly scalable distributed fast approximate nearest neighbor dense vector search engine.

The State of Data Retrieval

Nowadays, the need for information retrieval is increasingly based on text and various types of data, including text, images, video, and audio. These technologies are getting more and more advanced every year. The types of data and ways to interpret them are also diversifying. Currently, these technologies are helpful in a wide range of areas, such as recommendation search, which includes the user’s behaviour in context, and image search, which uses image feature vector to perform similar searches. As these technologies and contents develop, the scale of these technologies is increasing every year. It is not easy to search these data efficiently with traditional rule-based search engines. Text-only search has already been outmoded. In addition to the image search, text search methods are also becoming popular, where queries are processed with natural language processing that accurately interprets the meaning of the text and returns near-meaningful search results. In another area, a multi-modal search by image and text becomes available by generating an image feature and text feature combination. As mentioned above, a wide variety of data can be represented as vectors using deep learning techniques these days. As a result, the vector search engine as a model-independent search technology has become an important technology. In this blog, I would like to introduce a product that uses Approximate Nearest Neighbor (ANN) Search used in Vector search technology.

What is Approximate Nearest Neighbor (ANN)

Before explaining ANN, let us take a look at kNN (k-Nearest Neighbor). Many experts have already explained kNN, so that I will give a simplified explanation here. kNN is one of the methods to find the most similar vector as well as ANN. It is a method to search k nearest neighbor data to the query. However, this method has concerns about the amount of data and performance. The trade-off of returning exact neighbors in kNN is that the amount of data to search increases as the number of data increases, which takes a lot of computation time. This is why ANN has appeared as another method. ANN is a method to search for data that is approximately in the neighbor or approximate neighbor. There are many ways to implement ANN, but the most common are Graph or Tree-based implementations. Graph-based ANN does not search all Graph Edges. Instead, it searches a portion of them by setting a threshold to prevent a proportional increase in computation time as the number of data increases.

Current Status of Approximate Nearest Neighbor Search Using Vectors

Many companies are working on vector approximately nearest neighbor search, such as Yahoo Japan’s NGT, Facebook’s Faiss, Spotify’s Annoy, and Microsoft’s SPTAG. These libraries are open source projects. You can check them on GitHub. So far, the most balanced and best-performing engine is Yahoo Japan’s NGT. The results of each benchmark can be found at ANN-Benchmarks.

About NGT

What is Vald?

1. Cloud-Native Vector Similarity Search Engine

2. Highly Scalable

  • Vald is designed for the Kubernetes environment. The scalability is Kubernetes cluster size. Users can quickly scale their Kubernetes if cluster scales Vald can use more and more memory (vector index) capacity.
  • Vald has distributed vector store algorithms and compute-resource based balancing algorithms, and it enables cluster-wide optimized memory control

3. Highly customizable

  • All Vald component is configurable using a YAML file and Environment Variable.
  • Users can customize many params such as grpc MsgBufferSize and others
  • The Vald architecture is well divided based on domain feature, and users can easy to customize Vald component and easy to add new DB support
  • We have a Filter component both of Ingress and Egress
  • Using the Ingress-Filter component, users can easy to manipulate incoming data. e.g. Convert blob data to vector using TensorFlow Vector dimension padding
  • Using the Egress-Filter component, users can easy to manipulate outgoing search results. ANN Search result is approximate. Sometimes it makes problems for recommendation accuracy. Vald has Egress-Filter, users can easy to modify ANN search results.

4. Auto Vector Index Management

  • Automatically backup index data to multiple locations such as AWS S3, Google Cloud Storage, Cassandra, MySQL, Local Volume.
  • Automatically recover the index from the backup location.
  • Automatically commit index for non-blocking vector search experience.
  • Automatically rebalance index replication for stability.

5. Multiple Language Support

  • Supported languages are Go, Python, Java, Node, Clojure.
  • We also have proto files that many languages can connect to Vald using their gRPC libraries.

Finally

Thanks for taking the time to read this blog post. You can try Vald right away from Get-Started here.

A highly scalable distributed fast approximate nearest neighbor dense vector search engine.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store