Vector Database BasicsĀ¶
Vector databases help us store, manage, and query the embeddings we created for generative AI, recommenders, and search engines.
Across many of the common use cases, users often find that they need to manage more than just vectors. To make it easier for practitioners, vector databases should store and manage all of the data they need:
- embedding vectors
- categorical metadata
- numerical metadata
- timeseries metadata
- text / pdf / images / video / point clouds
And support a wide range of query workloads:
- Vector search (may require ANN-index)
- Keyword search (requires full text search index)
- SQL (for filtering)
For this exercise we'll use LanceDB since it's open source and easy to setup
# pip install -U --quiet lancedb pandas pydantic
Creating tables and adding dataĀ¶
Let's create a LanceDB table called cats_and_dogs
under the local database directory ~/.lancedb
. This table should have 4 fields:
- the embedding vector
- a string field indicating the species (either "cat" or "dog")
- the breed
- average weight in pounds
We're going to use pydantic to make this easier. First let's create a pydantic model with those fields
from lancedb.pydantic import vector, LanceModel
class CatsAndDogs(LanceModel):
vector: vector(2)
species: str
breed: str
weight: float
Now connect to a local db at ~/.lancedb and create an empty LanceDB table called "cats_and_dogs"
import lancedb
uri = "data/.lancedb/"
db = lancedb.connect(uri)
tbl = db.create_table("cats_and_dogs", schema=CatsAndDogs, exist_ok=True)
Let's add some data
First some cats
data = [
CatsAndDogs(
vector=[1., 0.],
species="cat",
breed="shorthair",
weight=12.,
),
CatsAndDogs(
vector=[-1., 0.],
species="cat",
breed="himalayan",
weight=9.5,
),
]
Now call the LanceTable.add
API to insert these two records into the table
tbl.add(data)
Let's preview the data
tbl.head().to_pandas()
vector | species | breed | weight | |
---|---|---|---|---|
0 | [1.0, 0.0] | cat | shorthair | 12.0 |
1 | [-1.0, 0.0] | cat | himalayan | 9.5 |
2 | [0.0, 10.0] | dog | samoyed | 47.5 |
3 | [0.0, -1.0] | dog | corgi | 26.0 |
Now let's add some dogs
data = [
CatsAndDogs(
vector=[0., 10.],
species="dog",
breed="samoyed",
weight=47.5,
),
CatsAndDogs(
vector=[0, -1.],
species="dog",
breed="corgi",
weight=26.,
)
]
tbl.add(data)
tbl.head().to_pandas()
vector | species | breed | weight | |
---|---|---|---|---|
0 | [1.0, 0.0] | cat | shorthair | 12.0 |
1 | [-1.0, 0.0] | cat | himalayan | 9.5 |
2 | [0.0, 10.0] | dog | samoyed | 47.5 |
3 | [0.0, -1.0] | dog | corgi | 26.0 |
Querying tablesĀ¶
Vector databases allow us to retrieve data for generative AI applications. Let's see how that's done.
Let's say we have a new animal that has embedding [10.5, 10.], what would you expect the most similar animal will be? Can you use the table we created above to answer the question?
HINT you'll need to use the search
API for LanceTable and limit
/ to_df
APIs. For examples you can refer to LanceDB documentation.
tbl.search([10.5, 10.]).limit(1).to_pandas()
vector | species | breed | weight | _distance | |
---|---|---|---|---|---|
0 | [0.0, 10.0] | dog | samoyed | 47.5 | 110.25 |
Now what if we use cosine distance instead? Would you expect that we get the same answer? Why or why not?
HINT you can add a call to metric
in the call chain
tbl.search([10.5, 10.0]).metric("cosine").limit(1).to_pandas()
vector | species | breed | weight | _distance | |
---|---|---|---|---|---|
0 | [1.0, 0.0] | cat | shorthair | 12.0 | 0.275862 |
Filtering tablesĀ¶
In practice, we often need to specify more than just a search vector for good quality retrieval. Oftentimes we need to filter the metadata as well.
Please write code to retrieve two most similar examples to the embedding [10.5, 10.] but only show the results that is a cat.
tbl.search([10.5, 10.0]).where("species = 'cat'", prefilter=True).limit(2).to_pandas()
vector | species | breed | weight | _distance | |
---|---|---|---|---|---|
0 | [1.0, 0.0] | cat | shorthair | 12.0 | 190.25 |
1 | [-1.0, 0.0] | cat | himalayan | 9.5 | 232.25 |
Creating ANN indicesĀ¶
For larger tables (e.g., >1M rows), searching through all of the vectors becomes quite slow. Here is where the Approximate Nearest Neighbor (ANN) index comes into play. While there are many different ANN indexing algorithms, they all have the same purpose - to drastically limit the search space as much as possible while losing as little accuracy as possible
For this problem we will create an ANN index on a LanceDB table and see how that impacts performance
First let's create some dataĀ¶
Given the constraints of the classroom workspace, we'll complete this exercise by creating 100,000 vectors with 16D in a new table. Here the embedding values don't matter, so we simply generate random embeddings as a 2D numpy array. We then use the vec_to_table function to convert that in to an Arrow table, which can then be added to the table.
from lance.vector import vec_to_table
import numpy as np
mat = np.random.randn(100_000, 16)
table_name = "exercise3_ann"
db.drop_table(table_name, ignore_missing=True)
table = db.create_table(table_name, vec_to_table(mat))
Let's establish a baseline without an indexĀ¶
Before we create the index, let's make sure know what we need to compare against.
We'll generate a random query vector and record it's value in the query
variable so we can use the same query vector with and without the ANN index.
query = np.random.randn(16)
table.search(query).limit(10).to_df()
/var/folders/gb/wf3d1d_d2bz689q8kdbw02880000gn/T/ipykernel_1489/229240492.py:2: UnsupportedWarning: to_df is unsupported as of 0.4.0. Use to_pandas() instead table.search(query).limit(10).to_df()
vector | _distance | |
---|---|---|
0 | [-1.2724329, 0.084512636, 0.28315237, 1.621330... | 3.393116 |
1 | [-1.0316224, -0.23577532, 0.39665398, 1.171307... | 3.831863 |
2 | [-0.49312374, -0.50828266, 0.046579137, 0.1497... | 3.904232 |
3 | [-1.0809894, -0.5009589, 0.6601867, 0.07009579... | 4.459040 |
4 | [-0.7098237, 0.50592774, 0.5868473, 1.3845882,... | 4.534908 |
5 | [-0.75004256, -0.7544854, 0.26175323, 1.517701... | 4.555618 |
6 | [-0.74581164, 0.6364289, 0.6266087, 0.8722173,... | 4.572597 |
7 | [-0.8690996, -0.51964694, 0.9082809, 0.2187988... | 4.628890 |
8 | [-1.6587613, 0.36453688, 0.14851123, -0.566129... | 4.668116 |
9 | [-1.5693077, -0.026916353, -0.25843197, 0.4409... | 4.731431 |
Please write code to compute the average latency of this query
import functools
import time
def timer(func):
"""Print the runtime of the decorated function"""
@functools.wraps(func)
def wrapper_timer(*args, **kwargs):
start_time = time.perf_counter()
value = func(*args, **kwargs)
end_time = time.perf_counter()
run_time = end_time - start_time
print(f"Finished {func.__name__}() in {run_time:.4f} secs")
return value
return wrapper_timer
@timer
def get_top_n(table, query, n=10):
return table.search(query).limit(n).to_df()
get_top_n(table, query)
Finished get_top_n() in 0.0149 secs
/var/folders/gb/wf3d1d_d2bz689q8kdbw02880000gn/T/ipykernel_1489/3894619790.py:22: UnsupportedWarning: to_df is unsupported as of 0.4.0. Use to_pandas() instead return table.search(query).limit(n).to_df()
vector | _distance | |
---|---|---|
0 | [-1.2724329, 0.084512636, 0.28315237, 1.621330... | 3.393116 |
1 | [-1.0316224, -0.23577532, 0.39665398, 1.171307... | 3.831863 |
2 | [-0.49312374, -0.50828266, 0.046579137, 0.1497... | 3.904232 |
3 | [-1.0809894, -0.5009589, 0.6601867, 0.07009579... | 4.459040 |
4 | [-0.7098237, 0.50592774, 0.5868473, 1.3845882,... | 4.534908 |
5 | [-0.75004256, -0.7544854, 0.26175323, 1.517701... | 4.555618 |
6 | [-0.74581164, 0.6364289, 0.6266087, 0.8722173,... | 4.572597 |
7 | [-0.8690996, -0.51964694, 0.9082809, 0.2187988... | 4.628890 |
8 | [-1.6587613, 0.36453688, 0.14851123, -0.566129... | 4.668116 |
9 | [-1.5693077, -0.026916353, -0.25843197, 0.4409... | 4.731431 |
Now let's create an indexĀ¶
There are many possible index types ranging from hash based to tree based to partition based to graph based. For this task, we'll create an IVFPQ index (partition-based index with product quantization compression) using LanceDB.
Please create an IVFPQ index on the LanceDB table such that each partition is 4000 rows and each PQ subvector is 8D.
HINT
- Total vectors / number of partitions = number of vectors in each partition
- Total dimensions / number of subvectors = number of dimensions in each subvector
- This step can take about 7-10 minutes to process and execute in the classroom workspace.
table.create_index(metric="L2", num_partitions=4000, num_sub_vectors=8)
Now let's search through the data again. Notice how the answers now appear different. This is because an ANN index is always a tradeoff between latency and accuracy.
table.search(query).limit(10).to_df()
/var/folders/gb/wf3d1d_d2bz689q8kdbw02880000gn/T/ipykernel_1489/4174343316.py:1: UnsupportedWarning: to_df is unsupported as of 0.4.0. Use to_pandas() instead table.search(query).limit(10).to_df()
vector | _distance | |
---|---|---|
0 | [-0.74581164, 0.6364289, 0.6266087, 0.8722173,... | 4.730085 |
1 | [-1.0502441, 0.2866477, 0.89400625, 0.0577777,... | 4.771339 |
2 | [-1.2426008, 0.29611367, 0.50209624, 0.1570571... | 4.774913 |
3 | [-1.5693077, -0.026916353, -0.25843197, 0.4409... | 4.787258 |
4 | [-0.8690996, -0.51964694, 0.9082809, 0.2187988... | 4.876680 |
5 | [-0.8164513, 0.071738884, 0.37751395, 1.258959... | 4.885183 |
6 | [-1.6587613, 0.36453688, 0.14851123, -0.566129... | 4.906681 |
7 | [-0.50475115, -0.763558, -0.5470999, 0.0168607... | 4.952706 |
8 | [-0.85710555, 0.97174263, 0.027284576, 0.84639... | 5.101480 |
9 | [-1.3415798, 0.61720556, -0.44235563, 0.281207... | 5.250690 |
Now write code to compute the average latency for querying the same table using the ANN index.
SOLUTION The index is implementation detail, so it should just be running the same code as above. You should see almost an order of magnitude speed-up. On larger datasets, this performance difference should be even more pronounced.
get_top_n(table, query)
Finished get_top_n() in 0.0106 secs
/var/folders/gb/wf3d1d_d2bz689q8kdbw02880000gn/T/ipykernel_1489/3894619790.py:22: UnsupportedWarning: to_df is unsupported as of 0.4.0. Use to_pandas() instead return table.search(query).limit(n).to_df()
vector | _distance | |
---|---|---|
0 | [-0.74581164, 0.6364289, 0.6266087, 0.8722173,... | 4.730085 |
1 | [-1.0502441, 0.2866477, 0.89400625, 0.0577777,... | 4.771339 |
2 | [-1.2426008, 0.29611367, 0.50209624, 0.1570571... | 4.774913 |
3 | [-1.5693077, -0.026916353, -0.25843197, 0.4409... | 4.787258 |
4 | [-0.8690996, -0.51964694, 0.9082809, 0.2187988... | 4.876680 |
5 | [-0.8164513, 0.071738884, 0.37751395, 1.258959... | 4.885183 |
6 | [-1.6587613, 0.36453688, 0.14851123, -0.566129... | 4.906681 |
7 | [-0.50475115, -0.763558, -0.5470999, 0.0168607... | 4.952706 |
8 | [-0.85710555, 0.97174263, 0.027284576, 0.84639... | 5.101480 |
9 | [-1.3415798, 0.61720556, -0.44235563, 0.281207... | 5.250690 |
Deleting rowsĀ¶
Like with other kinds of databases, you should be able to remove rows from the table. Let's go back to our tables of cats and dogs
table = db["cats_and_dogs"]
len(table)
4
Can you use the delete
API to remove all of the cats from the table?
HINT use a SQL like filter string to specify which rows to delete from the table
table.delete("species = 'cat'")
len(table)
2
What if I messed up?Ā¶
Errors is a common occurrence in AI. What's hard about errors in vector search is that oftentimes a bad vector doesn't cause a crash but just creates non-sensical answers. So to be able to rollback the state of the database is very important for debugging and reproducibility
So far we've accumulated 4 actions on the table:
- creation of the table
- added cats
- added dogs
- deleted cats
What if you realized that you should have deleted the dogs instead of the cats?
Here we can see the 4 versions that correspond to the 4 actions we've done
table.list_versions()
[{'version': 1, 'timestamp': datetime.datetime(2024, 3, 1, 10, 13, 40, 146383), 'metadata': {}}, {'version': 2, 'timestamp': datetime.datetime(2024, 3, 1, 10, 14, 58, 727480), 'metadata': {}}, {'version': 3, 'timestamp': datetime.datetime(2024, 3, 1, 10, 15, 48, 798191), 'metadata': {}}, {'version': 4, 'timestamp': datetime.datetime(2024, 3, 1, 11, 0, 33, 856836), 'metadata': {}}]
Please write code to restore the version still containing the whole dataset
table = db["cats_and_dogs"]
len(table)
2
# restore to version 3
table.restore(3)
# delete the dogs instead
table.delete("species = 'dog'")
table.list_versions()
[{'version': 1, 'timestamp': datetime.datetime(2024, 3, 1, 10, 13, 40, 146383), 'metadata': {}}, {'version': 2, 'timestamp': datetime.datetime(2024, 3, 1, 10, 14, 58, 727480), 'metadata': {}}, {'version': 3, 'timestamp': datetime.datetime(2024, 3, 1, 10, 15, 48, 798191), 'metadata': {}}, {'version': 4, 'timestamp': datetime.datetime(2024, 3, 1, 11, 0, 33, 856836), 'metadata': {}}, {'version': 5, 'timestamp': datetime.datetime(2024, 3, 1, 11, 3, 7, 390330), 'metadata': {}}, {'version': 6, 'timestamp': datetime.datetime(2024, 3, 1, 11, 3, 23, 457686), 'metadata': {}}]
table.to_pandas()
vector | species | breed | weight | |
---|---|---|---|---|
0 | [1.0, 0.0] | cat | shorthair | 12.0 |
1 | [-1.0, 0.0] | cat | himalayan | 9.5 |
Dropping a tableĀ¶
You can also choose to drop a table, which also completely removes the data. Note that this operation is not reversible.
"cats_and_dogs" in db
True
Write code to irrevocably remove the table "cats_and_dogs" from the database
db.drop_table("cats_and_dogs")
How would you verify that the table has indeed been deleted?
table.name in db
False
SummaryĀ¶
Congrats, in this exercise you've learned the basic operations of vector databases from creating tables, to adding data, and to querying the data. You've learned how to create indices and you saw first hand how it changes the performance and the accuracy. Lastly, you've learned how to debug and rollback when errors happen.
Created: 2024-10-23