Release Announcement: Vald v1.5.0

vald.vdaas.org
4 min readApr 5, 2022

A new version of v1.5.0, a minor update, has been available since last week. The principal update is Vald applies a new feature, CoW(Copy on Write), for safety backup with Persistent Volume.

This post describes the background and mechanism of CoW.

For more details of changes, please refer to CHANGELOG.

What is the CoW?

CoW is a safety backup function with PV to prevent data loss. When pod failure on vald-agent-ngt pod occurs during the backup phase, the backup file may corrupt due to the Kubernetes terminating the pod. It is difficult to restore the index after the pod restart because the restore function will not work right. CoW will prevent this issue and keep a safe backup process.

CoW mechanism

CoW works with a simple mechanism using three directories, tmp, index path, and old index path.

The backup process with CoW is the following steps.

Step1:

Vald Agent writes the indexes to the tmp directory when the SaveIndex process starts.

CoW step1: create new backup file when SaveIndex process starts

Step2:

Vald Agent moves the backup data in the index path directory, which is the current backup data, to the old index path. Here, the data is overwritten once, and multi-generation management is not performed.

CoW step2: move current backup file to old index path directory

Step3:

Vald Agent moves the data in the tmp directory to the index path directory and creates the new tmp directory for the next backup.

CoW step3: move new backup file to index path directory and create new tmp dir for next SaveIndex

If Pod failure occurs during this process, the backup data will roll back, and Vald Agent will restore the indexes from old backup data.

Cautionary point

As you see, CoW holds the old backup data, which means Vald requires more than double the storage capacity for each PV-mounted Vald Agent pod compared to when not using CoW.

The way to turn on CoW

You can deploy a Vald cluster using Helm or vald-helm-operator.

The example yaml using CoW is below.

defaults:
image:
tag: v1.5.0

gateway:
lb:
resources:
requests:
cpu: 100m
memory: 50Mi
gateway_config:
# a number of index replicas.
index_replica: 2

agent:
minReplicas: 6
maxReplicas: 6
podManagementPolicy: Parallel
resources:
requests:
cpu: 100m
memory: 50Mi
# We recommend setting this value long enough to ensure the backup speed of PV since the Index is backed up at the end of the pod.
terminationGracePeriodSeconds: 600
# This is the persistent volume setting.
# Please change it according to your environment.
persistentVolume:
enabled: true
accessMode: ReadWriteOnce
storageClass: local-path
size: 5Gi
ngt:
dimension: 784
index_path: "/var/ngt/index"
enable_in_memory_mode: false
# limit duration of automatic indexing.
auto_index_duration_limit: 730h
# check duration of automatic indexing.
auto_index_check_duration: 24h
# number of cache to trigger automatic indexing.
auto_index_length: 1000
# duration of automatic save index.
auto_save_index_duration: 365h
# batch process pool size of automatic create index operation.
auto_create_index_pool_size: 1000
# the flag of using CoW or not.
enable_copy_on_write: true

discoverer:
resources:
requests:
cpu: 100m
memory: 50Mi

manager:
index:
resources:
requests:
cpu: 100m
memory: 30Mi
indexer:
# concurrency for indexing operation.
concurrency: 1
# limit duration of automatic indexing.
auto_index_duration_limit: 10m
# check duration of automatic indexing.
auto_index_check_duration: 1m
# limit duration of automatic index saving.
auto_save_index_duration_limit: 1h
# duration of automatic index saving wait duration for next saving.
auto_save_index_wait_duration: 10m
# number of cache to trigger automatic indexing.
auto_index_length: 100
# number of pool size of create index processing.
creation_pool_size: 10000

The important points are:

  • Set agent.persitentVolumne config because CoW requires PV
  • Set agent.ngt.in_memory_mode as false to use PV for backup
  • Set agent.ngt.enable_copy_on_write as true

When applying this sample yaml successfully, CoW is available on your Vald cluster.

About deploy, please refer to Get Started.

--

--

vald.vdaas.org

A highly scalable distributed fast approximate nearest neighbor dense vector search engine.