Veri — Probabilistically Scaling Vector Spaces

Berk Gökden
Berk Gökden’s adventures
7 min readMay 23, 2018

--

Introducing Veri, a database for Vector Spaces. It was a long running idea in my head, but only now I managed to finish the first working version. It is a statistically distributed database that allows k nearest neighbour queries on vector spaces for machine learning applications.

First of all, a necessary premise: knn is very simple but useful algorithm of machine learning, a very old trick in machine learning world and I developed Veri to help developing deep learning networks. I would like for knn to be used as a way of searching instead of classification, mostly combined with word2vec or doc2vec. This particular application of Veri will be however discussed more in depth in an another blog post.

Veri means data in Turkish.

I have developed Veri since there is no good database for only machine learning. Most of the available tools are not production ready: they are only useful for research purposes, require large amount of memory or are single instances solution. This makes it difficult to run even simple operations in a vector spaces.

Veri supports 3 main services:

  • Insert Data (Both Rest API and gRPC)
  • Search (knn) (Both Rest API and gRPC)
  • Get Sub Sample of Vector Space (This is currently not available in the first version, will only be supported over gPRC since it will stream the result)

Each Veri instance in a cluster has the same responsibility and services. It is totally decentralised, hence there is no master, no nameserver, no controller.

Each Veri instance has a Sub Sample of the Vector Space and keeps the average and histogram of distances of the data to the center. Veri instances responsible for finding and communicating with their siblings in order to keep their averages and histograms aligned and as close as possible, so that the data across all instances is always statistically identical.

What does statitistically identical mean?

I wanted every Veri instance has a data with similar distribution on each other. Calculating distribution is an expensive operation. I decided to focus on four important stats, the average of data (center), a histogram of distances to average of data, the number of entries and the maximum distance. Very instances share their stats with each other in order to negotiate data exchange until they are statistically close enough.

To achieve what described above, each Veri instance relies on an internal kd-tree. A kd-tree is a very simple but effective approach so I used one internally.

A 3-dimensional k-d tree. The first split (the red vertical plane) cuts the root cell (white) into two subcells, each of which is then split (by the green horizontal planes) into two subcells. Finally, those four cells are split (by the four blue vertical planes) into two subcells. Since there is no more splitting, the final eight are called leaf cells.

“In computer science, a k-d tree (short for k-dimensional tree) is a space-partitioning data structure for organizing points in a k-dimensional space. k-d trees are a useful data structure for several applications, such as searches involving a multidimensional search key (e.g. range searches and nearest neighbor searches). k-d trees are a special case of binary space partitioning trees.”

Source: https://en.wikipedia.org/wiki/K-d_tree

So what was wrong with other knn search applications or why I didn’t use a kd-tree directly?

There are 3 main reasons, I will explain them one by one.

  • Memory
  • High Availability,
  • Tree (Index) Build Time

Memory:

If you have a large Vector Space with higher dimensions, it is hard to fit it in memory. Usually we use 32 GB or 16GB nodes in a typical kubernetes clusters, but I recently used Gensim for doc2vec Vector Space and the kd-tree required a 104GB instance. A machine with that kind of memory can be very expensive and, on top of that, since IO is limited, in a production environment you would need multiple instances. This is obviously not a feasible solution.

Veri solves the memory issue by sharing the data between the instances while keeping the data distribution similar. This way each instance has a general view of the vector space and can respond to queries, although, of course, to get an accurate answer to a given query, all data should be merged. Each Veri instance keeps a memory state and if there is enough memory, it tries to replicate all the data it can. If memory resources are low, it will keep the data as long as it is statistically identical. If instead memory resources are scarce, the Veri instance will stop accepting data. If a new Veri instance joins the cluster, other Veri instances will share the data with the new node, and allow for new data entries. So Veri will automatically optimise between, memory and high availability on the fly.

High Availability:

When employing a large knn solution, to achieve high availability not only it is necessary to keep multiple running instances, but also all of them should be able to reply to queries and it should be possible to load balance between them. For most applications, these requirements cannot coexist.

As mentioned earlier, Veri shares responsibility between the nodes. If one node goes down, probably another node still has a copy of its data. Just add a node back and data will be replicated to the new node. Hence if you use enough nodes, you won’t lose data.

Each Veri instance can answer queries so you can load balance to all nodes without putting any thought on selecting the correct one, since all nodes are equal.

Tree (Index) Build Time:

I have looked into other solutions of approximate knn approaches. They all share the same problem: they require time in order to build a tree or an index. Of course, as the index or tree is being built data insertion and removal cannot happen. Additionally they all rely on a map of label to feature, while it is better to have a feature to label map.

Veri solves it by actually splitting the problem into smaller pieces.

Veri has a data Map of feature to label and a kd-tree.

Insertions and deletions occurs on the Map and the Map is regularly synchronised with the kd-tree. There is a short lock while the new kd-tree map is swapped with the old one. Since queries are distributed among nodes, it would not be noticeable from outside. You can continue inserting data while querying and you can also add remove nodes, quality of the responses can be low during the sync, but they will not stop.

Distribued Knn Query

This is the main feature of Veri, it does a distributed knn query on the cluster.

As explained in the previous sections Veri uses a process very similar to map-reduce, performing on the fly queries to its neighbours and merging the resulting data.

When a knn query is started, Veri creates a unique id, starts a timer, Then do a local kd-tree search, then requests its peers to do the same with a smaller timeout. When the peers respond Veri, the requesting node will then merge the results into a map, wait for the timeout and finally performs a refining process on the results map, before returning.

if a search with the same id is received, the query is rejected to avoid infinite recursions. In future versions, this behaviour will be replaced with cached results and checking timeout.

As mentioned, every knn query has a timeout and this timeout is also responsible for defining the precision. That means that a user can trade precision for time. For example in production users usually want a predictable response time and it makes then sense to sacrifice precision for it. This can by achieved with Veri, because every instance keeps a statistically identical data sample, which will yield the same results in most classification cases.

Veri applies Speculative Execution to every query by default. It is not possible to know which node will respond better so instead of calculating the best one, Veri asks all known peers and merges the results of the first responders. Most of the time result are time same or similar so the responses that is received before timeout are usually good enough. While other approaches are possible like clustering the data on each node and relaying on sharding, this approach has more advantages. Also, this solution has the secondary advantage of efficiently handling the scenario in which most of the users only query a specific set of data. This also means the data skewness issues, which are commonplace with Apache Spark or Hadoop, are solved by default with Veri.

So what about curse of dimensionality?

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings

https://en.wikipedia.org/wiki/Curse_of_dimensionality

It is still a problem, but Veri makes it easier to handle by splitting the problem into smaller pieces. It may be a good idea to add a dimension reduction feature or search only using certain dimensions.

So is it all rainbows and butterflies?

No, unfortunately it is not complete yet. And there are some problems:

There is no guarantee that the actual responses are the best possible responses. Since everything depends on being probabilistic right, results might change over time.

It keeps more memory in general than it will with one instance, if you don’t have regular queries it may be more expensive in some cases.

There’s no backup system. Veri relies instead on data being updated regularly since Vector Spaces usually change over time. It may feel not logical for many DevOps.

This is all for now. In the next blog post I will present some practical examples to showcase Veri’s capabilities.

In the meantime feel free to take a look at the github repository below.

Till the next time!

Here is the link for the github repository:

--

--