Rig-LanceDB Integration Overview

Introduction

The rig-lancedb crate provides a vector store implementation using LanceDB, a serverless vector database built on Apache Arrow. This integration enables efficient vector similarity search with support for both local and cloud storage (S3, Google Cloud, Azure).

Key Features

  • Columnar Storage: Native support for Apache Arrow’s columnar format
  • Multiple Index Types:
    • IVF-PQ (Inverted File with Product Quantization)
    • Exact Nearest Neighbors (ENN)
  • Flexible Deployment:
    • Local storage
    • Cloud storage (S3, GCP, Azure)
  • Advanced Query Capabilities: Supports both vector similarity search and metadata filtering

Implementation Details

Vector Index Structure

The core implementation revolves around the LanceDbVectorIndex struct:

/// ```

Schema Definition

LanceDB requires a specific schema for storing embeddings:

    Schema::new(Fields::from(vec![
        Field::new("id", DataType::Utf8, false),
        Field::new("definition", DataType::Utf8, false),
        Field::new(
            "embedding",
            DataType::FixedSizeList(
                Arc::new(Field::new("item", DataType::Float64, true)),
                dims as i32,
            ),
            false,
        ),
    ]))
}

Search Parameters

Configurable search parameters include:

  • Distance type (Cosine, L2)
  • Number of candidates
  • Query vector dimension validation

Usage Examples

1. Local Storage with ANN (Approximate Nearest Neighbors)

use rig_lancedb::{LanceDbVectorIndex, SearchParams};
 
// Initialize local database
let db = lancedb::connect("data/lancedb-store").execute().await?;
 
// Create vector index
let vector_store = LanceDbVectorIndex::new(
    table,
    model,
    "id",
    SearchParams::default()
).await?;
 
// Perform search
let results = vector_store
    .top_n::<Document>("search query", 5)
    .await?;

2. S3 Storage with IVF-PQ Index

    let model = openai_client.embedding_model(TEXT_EMBEDDING_ADA_002);

    // Initialize LanceDB on S3.
    // Note: see below docs for more options and IAM permission required to read/write to S3.
    // https://lancedb.github.io/lancedb/guides/storage/#aws-s3
    let db = lancedb::connect("s3://lancedb-test-829666124233")
        .execute()
        .await?;

    // Generate embeddings for the test data.
    let embeddings = EmbeddingsBuilder::new(model.clone())
        .documents(words())?
        // Note: need at least 256 rows in order to create an index so copy the definition 256 times for testing purposes.
        .documents(
            (0..256)
                .map(|i| Word {
                    id: format!("doc{}", i),
                    definition: "Definition of *flumbuzzle (noun)*: A sudden, inexplicable urge to rearrange or reorganize small objects, such as desk items or books, for no apparent reason.".to_string()
                })

LanceDB-Specific Features

1. Record Batch Processing

LanceDB uses Arrow’s RecordBatch for efficient data handling:

}

impl QueryToJson for lancedb::query::VectorQuery {
    async fn execute_query(&self) -> Result<Vec<serde_json::Value>, VectorStoreError> {
        let record_batches = self
            .execute()
            .await
            .map_err(lancedb_to_rig_error)?
            .try_collect::<Vec<_>>()
            .await
            .map_err(lancedb_to_rig_error)?;

        record_batches.deserialize()

2. Column Filtering

Automatic filtering of embedding columns:

}

/// Filter out the columns from a table that do not include embeddings. Return the vector of column names.
pub(crate) trait FilterTableColumns {
    fn filter_embeddings(self) -> Vec<String>;
}

impl FilterTableColumns for Arc<Schema> {
    fn filter_embeddings(self) -> Vec<String> {
        self.fields()
            .iter()
            .filter_map(|field| match field.data_type() {
                DataType::FixedSizeList(inner, ..) => match inner.data_type() {
                    DataType::Float64 => None,

3. Automatic Deserialization

Built-in support for converting Arrow types to Rust types:

use std::sync::Arc;

use arrow_array::{
    cast::AsArray,
    types::{
        ArrowDictionaryKeyType, BinaryType, ByteArrayType, Date32Type, Date64Type, Decimal128Type,
        DurationMicrosecondType, DurationMillisecondType, DurationNanosecondType,
        DurationSecondType, Float32Type, Float64Type, Int16Type, Int32Type, Int64Type, Int8Type,
        IntervalDayTime, IntervalDayTimeType, IntervalMonthDayNano, IntervalMonthDayNanoType,
        IntervalYearMonthType, LargeBinaryType, LargeUtf8Type, RunEndIndexType,
        Time32MillisecondType, Time32SecondType, Time64MicrosecondType, Time64NanosecondType,
        TimestampMicrosecondType, TimestampMillisecondType, TimestampNanosecondType,
        TimestampSecondType, UInt16Type, UInt32Type, UInt64Type, UInt8Type, Utf8Type,
    },
    Array, ArrowPrimitiveType, OffsetSizeTrait, RecordBatch, RunArray, StructArray, UnionArray,
};
use lancedb::arrow::arrow_schema::{ArrowError, DataType, IntervalUnit, TimeUnit};
use rig::vector_store::VectorStoreError;
use serde::Serialize;
use serde_json::{json, Value};

use crate::serde_to_rig_error;

fn arrow_to_rig_error(e: ArrowError) -> VectorStoreError {
    VectorStoreError::DatastoreError(Box::new(e))
}

/// Trait used to deserialize data returned from LanceDB queries into a serde_json::Value vector.
/// Data returned by LanceDB is a vector of `RecordBatch` items.
pub(crate) trait RecordBatchDeserializer {
    fn deserialize(&self) -> Result<Vec<serde_json::Value>, VectorStoreError>;
}

impl RecordBatchDeserializer for Vec<RecordBatch> {
    fn deserialize(&self) -> Result<Vec<serde_json::Value>, VectorStoreError> {
        Ok(self
            .iter()
            .map(|record_batch| record_batch.deserialize())
            .collect::<Result<Vec<_>, _>>()?
            .into_iter()
            .flatten()
            .collect())
    }
}

Best Practices

  1. Index Creation:

    • Minimum of 256 rows required for IVF-PQ indexing
    • Choose appropriate distance metrics based on your use case
  2. Schema Design:

    • Use appropriate data types for columns
    • Consider embedding dimension requirements
  3. Query Optimization:

    • Use column filtering to reduce data transfer
    • Leverage metadata filtering when possible

Limitations and Considerations

  1. Data Size:

    • Local storage is suitable for smaller datasets
    • Use cloud storage for large-scale deployments
  2. Index Requirements:

    • IVF-PQ index requires minimum dataset size
    • Consider memory requirements for large indices
  3. Performance:

    • ANN provides better performance but lower accuracy
    • ENN provides exact results but slower performance

Integration Example

A complete example showing document embedding and search:

#[path = "./fixtures/lib.rs"]
mod fixture;

#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
    // Initialize OpenAI client. Use this to generate embeddings (and generate test data for RAG demo).
    let openai_client = Client::from_env();

    // Select an embedding model.
    let model = openai_client.embedding_model(TEXT_EMBEDDING_ADA_002);

    // Initialize LanceDB locally.
    let db = lancedb::connect("data/lancedb-store").execute().await?;

    // Generate embeddings for the test data.
    let embeddings = EmbeddingsBuilder::new(model.clone())
        .documents(words())?
        // Note: need at least 256 rows in order to create an index so copy the definition 256 times for testing purposes.
        .documents(
            (0..256)

For detailed API reference and additional features, see the LanceDB documentation and Rig’s API documentation.