A Super Easy Way to Try Similarity Search using Vald

vald.vdaas.org
5 min readJun 21, 2021

--

How to deploy Vald on your k3d within 5 minutes

Photo by Sebastian Unrau on Unsplash

In this post, we’ll show a quick and easy way to perform a similarity search using Vald.

Prerequisites

Before we start, we need to install 4 tools.

If you have Homebrew installed, you can install them by the following command.

$ brew install helm
$ brew install jq
$ brew install k3d
$ brew install kubectl

Setting up a k3d cluster

Create a k3d cluster by executing the following command. It creates a cluster with 4 nodes (1 server + 3 agents).

$ k3d cluster create vald --api-port 6550 -p "8080:80@loadbalancer" --agents 3 --image=docker.io/rancher/k3s:v1.20.9-k3s1

Please ensure that kubectl cluster-info returns a proper result.

$ kubectl config use-context k3d-vald
$ kubectl cluster-info
Kubernetes master is running at https://0.0.0.0:6550
CoreDNS is running at https://0.0.0.0:6550/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Metrics-server is running at https://0.0.0.0:6550/api/v1/namespaces/kube-system/services/https:metrics-server:/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

Deploy Vald using Helm

Use Helm to deploy Vald to your k3d cluster.

$ helm repo add vald https://vald.vdaas.org/charts
$ helm install vald-cluster \
--version v1.1.0 \
--set defaults.image.tag=v1.1.0 \
--set gateway.backup.enabled=false \
--set gateway.meta.enabled=false \
--set gateway.lb.minReplicas=1 \
--set gateway.lb.hpa.enabled=false \
--set gateway.lb.ingress.enabled=true \
--set gateway.lb.ingress.host=localhost \
--set gateway.lb.ingress.annotations."ingress\.kubernetes\.io/protocol"=h2c \
--set gateway.lb.ingress.annotations."ingress\.kubernetes\.io/ssl-passthrough"="\"true\"" \
--set gateway.lb.ingress.annotations."traefik\.protocol"=h2c \
--set gateway.lb.ingress.annotations."kubernetes\.io/ingress\.class"=traefik \
--set gateway.lb.gateway_config.index_replica=3 \
--set agent.minReplicas=3 \
--set agent.resources.requests.cpu=100m \
--set agent.resources.requests.memory=100Mi \
--set agent.ngt.auto_index_check_duration=1m \
--set agent.ngt.auto_index_duration_limit=30m \
--set agent.ngt.dimension=300 \
--set agent.ngt.distance_type=l2 \
--set agent.ngt.object_type=float \
--set manager.compressor.enabled=false \
--set manager.backup.enabled=false \
--set meta.enabled=false \
vald/vald

Please wait until all components are ready on the k3d cluster. It’s ready if kubectl get pods returns Running status on all pods.

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
vald-agent-ngt-0 1/1 Running 0 70s
vald-agent-ngt-1 1/1 Running 0 54s
vald-agent-ngt-2 1/1 Running 0 41s
vald-discoverer-7f9d49dc54-zfl5s 1/1 Running 0 70s
vald-manager-index-5dd6d78d7c-qxvkq 1/1 Running 0 70s
vald-lb-gateway-6cf4548687-nqccd 1/1 Running 0 70s

Similarity Search

In this section, we’re going to try a similarity search of the common English words.

Install valdcli

valdcli is a useful application to send requests to Vald cluster. You can install it from the repository vdaas/vald-client-clj. The binaries for macOS and Linux are available.

For macOS,

$ curl -LO https://github.com/vdaas/vald-client-clj/releases/download/v1.1.0/valdcli-macos.zip
$ unzip valdcli-macos.zip

For Linux,

$ curl -LO https://github.com/vdaas/vald-client-clj/releases/download/v1.1.0/valdcli-linux-static.zip
$ unzip valdcli-linux-static.zip

Please put the executable valdcli in your PATH and confirm it shows help.

$ valdcli --help

Insert word vector data

Get an example JSON data from GitHub.

$ curl -O https://raw.githubusercontent.com/rinx/word2vecjson/master/data/wordvecs10000.json

It contains the word vectors for the 10000 most common English words. These vectors are generated using word2vec. Also, you can get 1000, 5000, and 25000 words datasets from the same repository.

Let’s try to insert the word vectors using valdcli stream-insert. wordvecs10000.json is a JSON file that has 10000 vectors (the format is like [{"id": "mac", "vector": [0.000001, 0.000002, ...]}]). Print the JSON by cat command and pass it to valdcli using pipe. valdcli stream-insert --json reads JSON data from stdin and inserts them into the Vald cluster.

$ cat wordvecs10000.json | valdcli -h localhost -p 8080 stream-insert --json

After insertion, the index manager will detect the uncommitted vectors and send requests to the agents to create indexes. It may take a few minutes. (depending on the duration of the auto_index_check_duration)

Search similar words

After creating indexes, the agents will be ready to receive search requests. Now we’ll try to use two types of methods to search for similar words.

Search by vector

Using valdcli, you can send your requests from JSON-formatted vector data. Let’s send a search request with a vector in wordvecs10000.json file. Using jq command, you can extract a vector of the word "mac".

$ cat wordvecs10000.json | jq -c '.[] | select(.id == "mac") | .vector'
[-0.017271,-0.012576,-0.052985,0.078137,-0.05701,0.047284,-0.03823,0.048291,0.140177,0.030182,-0.111336,0.122068,0.079813,-0.109324,0.042757,0.025487,0.104629,0.035044,0.019199,-0.106642,0.003563,0.017941,0.044937,0.059357,0.015175,0.037392,0.043931,0.130116,-0.022636,0.025151,-0.028505,-0.000647,-0.041416,-0.053656,-0.023642,0.000312,0.030349,0.021295,0.017354,0.04326,0.051644,0.070759,0.022133,-0.008342,-4e-06,0.01115,-0.015845,0.083167,0.00851,0.089874,-0.033535,-0.034373,0.041583,-0.029511,-0.073777,0.056004,0.066064,-0.030685,-0.067741,0.035212,0.043931,0.04108,-0.035044,0.021965,0.054327,-0.025487,-0.044266,-0.018109,-0.128104,-0.025654,0.010564,-0.064387,-0.004485,-0.064723,-0.005911,-0.055668,-0.027163,-0.007671,0.016265,-0.005743,-0.066735,0.031355,0.01442,0.05265,0.105971,-0.158286,-0.018863,0.154261,0.040745,-0.058016,-0.078807,-0.015929,-0.002211,-0.037559,-0.012408,-0.004087,0.031355,0.093228,0.036553,-0.100605,-0.070759,0.057345,-0.01027,-0.016516,0.034876,-0.016768,-0.044937,0.07646,0.09591,-0.059357,-0.116702,-0.021379,-0.018612,0.068412,0.022972,-0.010983,-0.081826,-0.089874,-0.016516,-0.007839,-0.024984,-0.066399,-0.084508,-0.105971,0.030852,-0.1053,-0.034541,0.084844,-0.023978,-0.040577,-0.004318,0.022804,-0.118043,0.084508,-0.000922,-0.019367,0.027666,0.056004,-0.045608,0.056674,-0.017522,-0.049967,-0.002641,-0.123409,0.068076,0.010899,0.118714,-0.073442,-0.048961,-0.01551,0.004318,-0.024984,-0.063717,-0.012576,-0.009013,0.053991,-0.018109,-0.024481,-0.049297,0.012492,-0.042757,0.01442,-0.054998,0.054998,0.005198,-0.015342,0.01836,0.101276,-0.078807,-0.058016,-0.012073,0.072436,0.000482,-0.060698,-0.037056,0.079478,0.065393,0.007462,-0.012659,-0.00371,-0.006581,-0.126092,0.054662,-0.061034,-0.029008,-0.039068,0.046278,0.0332,-0.036721,-0.007084,-0.012911,-0.030182,-0.05768,-0.006204,-0.016851,0.038062,0.013414,0.032864,-0.045943,0.022133,-0.056004,-0.034038,0.002285,0.113349,0.177065,0.012492,-0.043931,-0.06204,-0.029511,0.026493,-0.024481,0.057345,-0.086521,-0.00436,-0.062375,-0.070759,-0.022469,0.020959,-0.080149,-0.121397,-0.02666,0.055668,0.067406,0.00918,-0.016265,0.037056,-0.107312,0.017019,0.02666,-0.022301,-0.043596,0.051644,-0.015845,-0.087862,-0.047955,0.025822,0.084844,-0.08149,0.001383,0.146213,-0.010731,0.013079,-0.034038,-0.103288,0.048626,0.013079,-0.08585,0.058016,0.018863,-0.024648,-0.138165,0.136152,0.059692,0.036386,0.050973,0.045943,-0.040745,0.045943,-0.061705,-0.019953,0.040577,0.021462,0.033032,0.036386,0.005994,0.072436,-0.112007,0.034709,-0.042087,-7.4e-05,0.041583,-0.034709,-0.099935,0.042757,0.027834,-0.051979,-0.021043,-0.101947,-0.046278,-0.015845,-0.019115,-0.053656,-0.048291,-0.002588,-0.031523,-0.05701,-0.082161,-0.006497,0.009138,0.070759,-0.058016,0.012408,0.031355,-0.029343,0.138835,-0.088533,-0.063717,-0.005408,-0.016768,0.040913]

This is a 300-dimensional vector. It represents the feature of the word “mac”.

Using this vector, a search request can be created like the following.

$ cat wordvecs10000.json | jq -c '.[] | select(.id == "mac") | .vector' | valdcli -h localhost -p 8080 search -n 5 -e 0.1 --json | jq
[
{
"id": "mac",
"distance": 0
},
{
"id": "windows",
"distance": 0.9622495
},
{
"id": "apple",
"distance": 0.9877319
},
{
"id": "leopard",
"distance": 1.0445662
},
{
"id": "computer",
"distance": 1.1087651
}
]

It returns the 5 most similar words of “mac”.

Search by ID

You can send a search request about a specific word in the index.

$ valdcli -h localhost -p 8080 search-by-id -n 10 -e 0.1 snow
[{:id "snow", :distance 0.0}
{:id "snowing", :distance 0.838368}
{:id "rain", :distance 0.86527264}
{:id "icy", :distance 0.93737966}
{:id "fog", :distance 0.95259696}
{:id "ice", :distance 0.9600457}
{:id "weather", :distance 0.98229766}
{:id "rains", :distance 0.9843779}
{:id "winter", :distance 1.0100114}
{:id "wet", :distance 1.0111399}]

It returns the 10 most similar words of “snow”.

(valdcli returns EDN formatted results by default. If you set — json option, it returns JSON formatted results.)

Closing

Please delete the cluster by the following command.

$ k3d cluster delete vald

In this article, we’ve introduced an example to perform similarity search using Vald.
For more information, please visit our website.

--

--

vald.vdaas.org

A highly scalable distributed fast approximate nearest neighbor dense vector search engine.