Vector Database Basics¶

Vector databases help us store, manage, and query the embeddings we created for generative AI, recommenders, and search engines.

Across many of the common use cases, users often find that they need to manage more than just vectors. To make it easier for practitioners, vector databases should store and manage all of the data they need:

embedding vectors
categorical metadata
numerical metadata
timeseries metadata
text / pdf / images / video / point clouds

And support a wide range of query workloads:

Vector search (may require ANN-index)
Keyword search (requires full text search index)
SQL (for filtering)

For this exercise we'll use LanceDB since it's open source and easy to setup

In [ ]:

  Copied!     
 
# pip install -U --quiet lancedb pandas pydantic
# pip install -U --quiet lancedb pandas pydantic

Creating tables and adding data¶

Let's create a LanceDB table called cats_and_dogs under the local database directory ~/.lancedb. This table should have 4 fields:

the embedding vector
a string field indicating the species (either "cat" or "dog")
the breed
average weight in pounds

We're going to use pydantic to make this easier. First let's create a pydantic model with those fields

In [ ]:

  Copied!     
 
from lancedb.pydantic import vector, LanceModel

class CatsAndDogs(LanceModel):
    vector: vector(2)
    species: str
    breed: str
    weight: float
from lancedb.pydantic import vector, LanceModel class CatsAndDogs(LanceModel): vector: vector(2) species: str breed: str weight: float

Now connect to a local db at ~/.lancedb and create an empty LanceDB table called "cats_and_dogs"

In [7]:

  Copied!     
 
import lancedb

uri = "data/.lancedb/"
db = lancedb.connect(uri)
import lancedb uri = "data/.lancedb/" db = lancedb.connect(uri)

In [8]:

  Copied!     
 
tbl = db.create_table("cats_and_dogs", schema=CatsAndDogs, exist_ok=True)
tbl = db.create_table("cats_and_dogs", schema=CatsAndDogs, exist_ok=True)

Let's add some data

First some cats

In [ ]:

  Copied!     
 
data = [
    CatsAndDogs(
        vector=[1., 0.],
        species="cat",
        breed="shorthair",
        weight=12.,
    ),
    CatsAndDogs(
        vector=[-1., 0.],
        species="cat",
        breed="himalayan",
        weight=9.5,
    ),
]
data = [ CatsAndDogs( vector=[1., 0.], species="cat", breed="shorthair", weight=12., ), CatsAndDogs( vector=[-1., 0.], species="cat", breed="himalayan", weight=9.5, ), ]

Now call the LanceTable.add API to insert these two records into the table

In [ ]:

  Copied!     
 
tbl.add(data)
tbl.add(data)

Let's preview the data

In [9]:

  Copied!     
 
tbl.head().to_pandas()
tbl.head().to_pandas()

Out[9]:

	vector	species	breed	weight
0	[1.0, 0.0]	cat	shorthair	12.0
1	[-1.0, 0.0]	cat	himalayan	9.5
2	[0.0, 10.0]	dog	samoyed	47.5
3	[0.0, -1.0]	dog	corgi	26.0

Now let's add some dogs

In [ ]:

  Copied!     
 
data = [
    CatsAndDogs(
        vector=[0., 10.],
        species="dog",
        breed="samoyed",
        weight=47.5,
    ),
    CatsAndDogs(
        vector=[0, -1.],
        species="dog",
        breed="corgi",
        weight=26.,
    )
]
data = [ CatsAndDogs( vector=[0., 10.], species="dog", breed="samoyed", weight=47.5, ), CatsAndDogs( vector=[0, -1.], species="dog", breed="corgi", weight=26., ) ]

In [ ]:

  Copied!     
 
tbl.add(data)
tbl.add(data)

In [10]:

  Copied!     
 
tbl.head().to_pandas()
tbl.head().to_pandas()

Out[10]:

	vector	species	breed	weight
0	[1.0, 0.0]	cat	shorthair	12.0
1	[-1.0, 0.0]	cat	himalayan	9.5
2	[0.0, 10.0]	dog	samoyed	47.5
3	[0.0, -1.0]	dog	corgi	26.0

Querying tables¶

Vector databases allow us to retrieve data for generative AI applications. Let's see how that's done.

Let's say we have a new animal that has embedding [10.5, 10.], what would you expect the most similar animal will be? Can you use the table we created above to answer the question?

HINT you'll need to use the search API for LanceTable and limit / to_df APIs. For examples you can refer to LanceDB documentation.

In [11]:

  Copied!     
 
tbl.search([10.5, 10.]).limit(1).to_pandas()
tbl.search([10.5, 10.]).limit(1).to_pandas()

Out[11]:

	vector	species	breed	weight	_distance
0	[0.0, 10.0]	dog	samoyed	47.5	110.25

Now what if we use cosine distance instead? Would you expect that we get the same answer? Why or why not?

HINT you can add a call to metric in the call chain

In [12]:

  Copied!     
 
tbl.search([10.5, 10.0]).metric("cosine").limit(1).to_pandas()
tbl.search([10.5, 10.0]).metric("cosine").limit(1).to_pandas()

Out[12]:

	vector	species	breed	weight	_distance
0	[1.0, 0.0]	cat	shorthair	12.0	0.275862

Filtering tables¶

In practice, we often need to specify more than just a search vector for good quality retrieval. Oftentimes we need to filter the metadata as well.

Please write code to retrieve two most similar examples to the embedding [10.5, 10.] but only show the results that is a cat.

In [13]:

  Copied!     
 
tbl.search([10.5, 10.0]).where("species = 'cat'", prefilter=True).limit(2).to_pandas()
tbl.search([10.5, 10.0]).where("species = 'cat'", prefilter=True).limit(2).to_pandas()

Out[13]:

	vector	species	breed	weight	_distance
0	[1.0, 0.0]	cat	shorthair	12.0	190.25
1	[-1.0, 0.0]	cat	himalayan	9.5	232.25

Creating ANN indices¶

For larger tables (e.g., >1M rows), searching through all of the vectors becomes quite slow. Here is where the Approximate Nearest Neighbor (ANN) index comes into play. While there are many different ANN indexing algorithms, they all have the same purpose - to drastically limit the search space as much as possible while losing as little accuracy as possible

For this problem we will create an ANN index on a LanceDB table and see how that impacts performance

First let's create some data¶

Given the constraints of the classroom workspace, we'll complete this exercise by creating 100,000 vectors with 16D in a new table. Here the embedding values don't matter, so we simply generate random embeddings as a 2D numpy array. We then use the vec_to_table function to convert that in to an Arrow table, which can then be added to the table.

In [14]:

  Copied!     
 
from lance.vector import vec_to_table
import numpy as np

mat = np.random.randn(100_000, 16)
table_name = "exercise3_ann"
db.drop_table(table_name, ignore_missing=True)
table = db.create_table(table_name, vec_to_table(mat))
from lance.vector import vec_to_table import numpy as np mat = np.random.randn(100_000, 16) table_name = "exercise3_ann" db.drop_table(table_name, ignore_missing=True) table = db.create_table(table_name, vec_to_table(mat))

Let's establish a baseline without an index¶

Before we create the index, let's make sure know what we need to compare against.

We'll generate a random query vector and record it's value in the query variable so we can use the same query vector with and without the ANN index.

In [15]:

  Copied!     
 
query = np.random.randn(16)
table.search(query).limit(10).to_df()
query = np.random.randn(16) table.search(query).limit(10).to_df()

/var/folders/gb/wf3d1d_d2bz689q8kdbw02880000gn/T/ipykernel_1489/229240492.py:2: UnsupportedWarning: to_df is unsupported as of 0.4.0. Use to_pandas() instead
  table.search(query).limit(10).to_df()

Out[15]:

	vector	_distance
0	[-1.2724329, 0.084512636, 0.28315237, 1.621330...	3.393116
1	[-1.0316224, -0.23577532, 0.39665398, 1.171307...	3.831863
2	[-0.49312374, -0.50828266, 0.046579137, 0.1497...	3.904232
3	[-1.0809894, -0.5009589, 0.6601867, 0.07009579...	4.459040
4	[-0.7098237, 0.50592774, 0.5868473, 1.3845882,...	4.534908
5	[-0.75004256, -0.7544854, 0.26175323, 1.517701...	4.555618
6	[-0.74581164, 0.6364289, 0.6266087, 0.8722173,...	4.572597
7	[-0.8690996, -0.51964694, 0.9082809, 0.2187988...	4.628890
8	[-1.6587613, 0.36453688, 0.14851123, -0.566129...	4.668116
9	[-1.5693077, -0.026916353, -0.25843197, 0.4409...	4.731431

Please write code to compute the average latency of this query

In [16]:

  Copied!     
 
import functools
import time


def timer(func):
    """Print the runtime of the decorated function"""

    @functools.wraps(func)
    def wrapper_timer(*args, **kwargs):
        start_time = time.perf_counter()
        value = func(*args, **kwargs)
        end_time = time.perf_counter()
        run_time = end_time - start_time
        print(f"Finished {func.__name__}() in {run_time:.4f} secs")
        return value

    return wrapper_timer


@timer
def get_top_n(table, query, n=10):
    return table.search(query).limit(n).to_df()
import functools import time def timer(func): """Print the runtime of the decorated function""" @functools.wraps(func) def wrapper_timer(*args, **kwargs): start_time = time.perf_counter() value = func(*args, **kwargs) end_time = time.perf_counter() run_time = end_time - start_time print(f"Finished {func.__name__}() in {run_time:.4f} secs") return value return wrapper_timer @timer def get_top_n(table, query, n=10): return table.search(query).limit(n).to_df()

In [17]:

  Copied!     
 
get_top_n(table, query)
get_top_n(table, query)

Finished get_top_n() in 0.0149 secs

/var/folders/gb/wf3d1d_d2bz689q8kdbw02880000gn/T/ipykernel_1489/3894619790.py:22: UnsupportedWarning: to_df is unsupported as of 0.4.0. Use to_pandas() instead
  return table.search(query).limit(n).to_df()

Out[17]:

	vector	_distance
0	[-1.2724329, 0.084512636, 0.28315237, 1.621330...	3.393116
1	[-1.0316224, -0.23577532, 0.39665398, 1.171307...	3.831863
2	[-0.49312374, -0.50828266, 0.046579137, 0.1497...	3.904232
3	[-1.0809894, -0.5009589, 0.6601867, 0.07009579...	4.459040
4	[-0.7098237, 0.50592774, 0.5868473, 1.3845882,...	4.534908
5	[-0.75004256, -0.7544854, 0.26175323, 1.517701...	4.555618
6	[-0.74581164, 0.6364289, 0.6266087, 0.8722173,...	4.572597
7	[-0.8690996, -0.51964694, 0.9082809, 0.2187988...	4.628890
8	[-1.6587613, 0.36453688, 0.14851123, -0.566129...	4.668116
9	[-1.5693077, -0.026916353, -0.25843197, 0.4409...	4.731431

Now let's create an index¶

There are many possible index types ranging from hash based to tree based to partition based to graph based. For this task, we'll create an IVFPQ index (partition-based index with product quantization compression) using LanceDB.

Please create an IVFPQ index on the LanceDB table such that each partition is 4000 rows and each PQ subvector is 8D.

HINT

Total vectors / number of partitions = number of vectors in each partition
Total dimensions / number of subvectors = number of dimensions in each subvector
This step can take about 7-10 minutes to process and execute in the classroom workspace.

In [18]:

  Copied!     
 
table.create_index(metric="L2", num_partitions=4000, num_sub_vectors=8)
table.create_index(metric="L2", num_partitions=4000, num_sub_vectors=8)

Now let's search through the data again. Notice how the answers now appear different. This is because an ANN index is always a tradeoff between latency and accuracy.

In [19]:

  Copied!     
 
table.search(query).limit(10).to_df()
table.search(query).limit(10).to_df()

/var/folders/gb/wf3d1d_d2bz689q8kdbw02880000gn/T/ipykernel_1489/4174343316.py:1: UnsupportedWarning: to_df is unsupported as of 0.4.0. Use to_pandas() instead
  table.search(query).limit(10).to_df()

Out[19]:

	vector	_distance
0	[-0.74581164, 0.6364289, 0.6266087, 0.8722173,...	4.730085
1	[-1.0502441, 0.2866477, 0.89400625, 0.0577777,...	4.771339
2	[-1.2426008, 0.29611367, 0.50209624, 0.1570571...	4.774913
3	[-1.5693077, -0.026916353, -0.25843197, 0.4409...	4.787258
4	[-0.8690996, -0.51964694, 0.9082809, 0.2187988...	4.876680
5	[-0.8164513, 0.071738884, 0.37751395, 1.258959...	4.885183
6	[-1.6587613, 0.36453688, 0.14851123, -0.566129...	4.906681
7	[-0.50475115, -0.763558, -0.5470999, 0.0168607...	4.952706
8	[-0.85710555, 0.97174263, 0.027284576, 0.84639...	5.101480
9	[-1.3415798, 0.61720556, -0.44235563, 0.281207...	5.250690

Now write code to compute the average latency for querying the same table using the ANN index.

SOLUTION The index is implementation detail, so it should just be running the same code as above. You should see almost an order of magnitude speed-up. On larger datasets, this performance difference should be even more pronounced.

In [20]:

  Copied!     
 
get_top_n(table, query)
get_top_n(table, query)

Finished get_top_n() in 0.0106 secs

/var/folders/gb/wf3d1d_d2bz689q8kdbw02880000gn/T/ipykernel_1489/3894619790.py:22: UnsupportedWarning: to_df is unsupported as of 0.4.0. Use to_pandas() instead
  return table.search(query).limit(n).to_df()

Out[20]:

	vector	_distance
0	[-0.74581164, 0.6364289, 0.6266087, 0.8722173,...	4.730085
1	[-1.0502441, 0.2866477, 0.89400625, 0.0577777,...	4.771339
2	[-1.2426008, 0.29611367, 0.50209624, 0.1570571...	4.774913
3	[-1.5693077, -0.026916353, -0.25843197, 0.4409...	4.787258
4	[-0.8690996, -0.51964694, 0.9082809, 0.2187988...	4.876680
5	[-0.8164513, 0.071738884, 0.37751395, 1.258959...	4.885183
6	[-1.6587613, 0.36453688, 0.14851123, -0.566129...	4.906681
7	[-0.50475115, -0.763558, -0.5470999, 0.0168607...	4.952706
8	[-0.85710555, 0.97174263, 0.027284576, 0.84639...	5.101480
9	[-1.3415798, 0.61720556, -0.44235563, 0.281207...	5.250690

Deleting rows¶

Like with other kinds of databases, you should be able to remove rows from the table. Let's go back to our tables of cats and dogs

In [21]:

  Copied!     
 
table = db["cats_and_dogs"]
table = db["cats_and_dogs"]

In [22]:

  Copied!     
 
len(table)
len(table)

Out[22]:

Can you use the delete API to remove all of the cats from the table?

HINT use a SQL like filter string to specify which rows to delete from the table

In [23]:

  Copied!     
 
table.delete("species = 'cat'")
table.delete("species = 'cat'")

In [24]:

  Copied!     
 
len(table)
len(table)

Out[24]:

What if I messed up?¶

Errors is a common occurrence in AI. What's hard about errors in vector search is that oftentimes a bad vector doesn't cause a crash but just creates non-sensical answers. So to be able to rollback the state of the database is very important for debugging and reproducibility

So far we've accumulated 4 actions on the table:

creation of the table
added cats
added dogs
deleted cats

What if you realized that you should have deleted the dogs instead of the cats?

Here we can see the 4 versions that correspond to the 4 actions we've done

In [25]:

  Copied!     
 
table.list_versions()
table.list_versions()

Out[25]:

[{'version': 1,
  'timestamp': datetime.datetime(2024, 3, 1, 10, 13, 40, 146383),
  'metadata': {}},
 {'version': 2,
  'timestamp': datetime.datetime(2024, 3, 1, 10, 14, 58, 727480),
  'metadata': {}},
 {'version': 3,
  'timestamp': datetime.datetime(2024, 3, 1, 10, 15, 48, 798191),
  'metadata': {}},
 {'version': 4,
  'timestamp': datetime.datetime(2024, 3, 1, 11, 0, 33, 856836),
  'metadata': {}}]

Please write code to restore the version still containing the whole dataset

In [26]:

  Copied!     
 
table = db["cats_and_dogs"]
table = db["cats_and_dogs"]

In [27]:

  Copied!     
 
len(table)
len(table)

Out[27]:

In [28]:

  Copied!     
 
# restore to version 3
table.restore(3)
# restore to version 3 table.restore(3)

In [29]:

  Copied!     
 
# delete the dogs instead
table.delete("species = 'dog'")
# delete the dogs instead table.delete("species = 'dog'")

In [30]:

  Copied!     
 
table.list_versions()
table.list_versions()

Out[30]:

[{'version': 1,
  'timestamp': datetime.datetime(2024, 3, 1, 10, 13, 40, 146383),
  'metadata': {}},
 {'version': 2,
  'timestamp': datetime.datetime(2024, 3, 1, 10, 14, 58, 727480),
  'metadata': {}},
 {'version': 3,
  'timestamp': datetime.datetime(2024, 3, 1, 10, 15, 48, 798191),
  'metadata': {}},
 {'version': 4,
  'timestamp': datetime.datetime(2024, 3, 1, 11, 0, 33, 856836),
  'metadata': {}},
 {'version': 5,
  'timestamp': datetime.datetime(2024, 3, 1, 11, 3, 7, 390330),
  'metadata': {}},
 {'version': 6,
  'timestamp': datetime.datetime(2024, 3, 1, 11, 3, 23, 457686),
  'metadata': {}}]

In [31]:

  Copied!     
 
table.to_pandas()
table.to_pandas()

Out[31]:

	vector	species	breed	weight
0	[1.0, 0.0]	cat	shorthair	12.0
1	[-1.0, 0.0]	cat	himalayan	9.5

Dropping a table¶

You can also choose to drop a table, which also completely removes the data. Note that this operation is not reversible.

In [32]:

  Copied!     
 
"cats_and_dogs" in db
"cats_and_dogs" in db

Out[32]:

True

Write code to irrevocably remove the table "cats_and_dogs" from the database

In [33]:

  Copied!     
 
db.drop_table("cats_and_dogs")
db.drop_table("cats_and_dogs")

How would you verify that the table has indeed been deleted?

In [34]:

  Copied!     
 
table.name in db
table.name in db

Out[34]:

False

Summary¶

Congrats, in this exercise you've learned the basic operations of vector databases from creating tables, to adding data, and to querying the data. You've learned how to create indices and you saw first hand how it changes the performance and the accuracy. Lastly, you've learned how to debug and rollback when errors happen.

Last update: 2024-10-23
Created: 2024-10-23