Publishing a Notebook with chiVe dataset

vald.vdaas.org
5 min readSep 16, 2021

This is an introduction to the background and content of vald-demo repository that was recently released.

Why did we create this repository?

Vald is a highly scalable distributed fast approximate nearest neighbor dense vector search engine. Vald is a newly made OSS project and is still not popular yet. Now, we are working on growing the user community and attract more contribution to it. This post and the publishing of the notebook are part of those efforts.

Currently, Vald is designed with many new technologies to allow greater flexibility and to meet the various demands of users. On the other hand, high flexibility is often accompanied by high complexity of the software structure. It makes it hard to image how to use Vald. In addition, it gives an impression that it is difficult to use it, and it also gives an impression that it is difficult to get a concrete image of what it can do.

Vald now aims to make it simple, easy to use, fast, and high-performance. As a first step, we created this notebook as a concrete and simple use case. By providing one concrete example of what can be done, we would like to make Vald easier to understand and use. One of the other reasons why we created this notebook is to help people to use Vald easily in various environments and use cases.

This is the reason why we created this repository.

We would like to release more demos using the other datasets such as English text or images.

Contents of vald-demo/chive

Next, let’s take a look at the contents of the repository we have published.

The above repository contains a chiVe directory that contains the notebook and other files that we have created. The contents of the directory are as follows.

chive\
- README.md
- sample-values.yaml: Sample YAML for deploying Vald using Helm to run notebook.
- tutorial.ipynb : Example of using Vald with chiVe.
- tutorial.md : Example of using Vald with chiVe (with output cells)

The released notebook is intended for those who have completed Get Started. To run the released notebook, complete Get Started is required.

If you haven’t done it yet, please try it at first. We will start to introduce the usage and contents of the notebook.

How to use

In this chapter, we will show an example of using Jupyter Notebook in the Docker environment as one of the usage methods.

Let’s try to execute following commands at first.

git clone https://github.com/vdaas/vald-demo.git
docker run -it -v $(pwd)/vald-demo:/home/jovyan/work -p 8888:8888 jupyter/datascience-notebook

If executing success, you can use a notebook in Jupyter Notebook.

This notebook will give users to experience using the basic Vald interface such as Insert, Search, Update, and Remove using chiVe. Also, you can experience applied uses such as Word Analogies through similarity search.

Next, we would like to introduce the basic interface of Vald, which is used in this notebook.

Insert

Insert is an interface to register vectors to Vald. Here, a 300-dimensional vector randomly generated by np.random.rand(300) is inserted to Vald. (NOTE: The import statement is omitted.)

code:

# create gRPC channel
channel = grpc.insecure_channel("localhost:8081")
# create stub
istub = insert_pb2_grpc.InsertStub(channel)

# Insert
sample = np.random.rand(300)
ivec = payload_pb2.Object.Vector(id="test", vector=sample)
icfg = payload_pb2.Insert.Config(skip_strict_exist_check=True)
ireq = payload_pb2.Insert.Request(vector=ivec, config=icfg)

istub.Insert(ireq)

output:

name: "vald-agent-ngt-0"
uuid: "test"
ips: "127.0.0.1"
ips: "127.0.0.1"
ips: "127.0.0.1"
ips: "127.0.0.1"
ips: "127.0.0.1"

Search

Search is an interface to perform the similarity search on vectors indexed in Vald. In this example, we use a randomly generated 300-dimensional vector. Since only one vector has been indexed into Vald at the previous Insert phase, one value is returned as a result of a similarity search. As for the value of distance, it may change depending on the value of the random vector and the function of distance calculation such as l2 (means l2-norm) and cos (means cosine distance).

Also, by inserted another vector with a different id of test, you can search for multiple vectors inserted in Vald. (Vald does not allow the insert of vectors with the same id.) Please try it!

code:

# create stub
sstub = search_pb2_grpc.SearchStub(channel)

# Search
svec = np.random.rand(300)
scfg = payload_pb2.Search.Config(num=10, radius=-1.0, epsilon=0.1, timeout=3000000000)
sreq = payload_pb2.Search.Request(vector=svec, config=scfg)

sstub.Search(sreq)

output:

results {
id: "test"
distance: 0.22659634053707123
}

Update

Update is an interface to update the vectors which are already inserted in Vald. Here, we replace the vector whose id is test with another random vector.

code:

# create stub
ustub = update_pb2_grpc.UpdateStub(channel)

# Update
sample = np.random.rand(300)
uvec = payload_pb2.Object.Vector(id="test", vector=sample)
ucfg = payload_pb2.Update.Config(skip_strict_exist_check=True)
ureq = payload_pb2.Update.Request(vector=uvec, config=ucfg)

ustub.Update(ureq)

output:

name: "vald-agent-ngt-0"
uuid: "test"
ips: "127.0.0.1"
ips: "127.0.0.1"
ips: "127.0.0.1"
ips: "127.0.0.1"
ips: "127.0.0.1"

Remove

Remove is an interface to delete the vectors which are already inserted in Vald. In this example, the vector with id test will be deleted from Vald.

code:

# create stub
rstub = remove_pb2_grpc.RemoveStub(channel)

# Remove
rid = payload_pb2.Object.ID(id="test")
rcfg = payload_pb2.Remove.Config(skip_strict_exist_check=True)
rreq = payload_pb2.Remove.Request(id=rid, config=rcfg)

rstub.Remove(rreq)

output:

name: "vald-agent-ngt-0"
uuid: "test"
ips: "127.0.0.1"
ips: "127.0.0.1"
ips: "127.0.0.1"
ips: "127.0.0.1"
ips: "127.0.0.1"

Closing

We introduced the background and the content of vald-demo repository. Thank you for your interest in the post and Vald. If you are interested in approximate neighbor search in this post, please try to run our notebook.

We hope to enhance the community around Vald and approximate neighbor search. Let’s work together to improve the community!

If you want to know more about Vald, please visit follows web site or join our Slack:

See you again :)

--

--

vald.vdaas.org

A highly scalable distributed fast approximate nearest neighbor dense vector search engine.