Embeddings in Rig
Rig provides a comprehensive embeddings system for converting text and other data types into numerical vector representations that can be used for semantic search, similarity comparisons, and other NLP tasks.
Core Concepts
Embeddings
An embedding is a vector representation of data (usually text) where semantically similar items are mapped to nearby points in the vector space. In Rig, embeddings are represented by the Embedding
struct which contains:
- The original document text
- The vector representation as
Vec<f64>
The Embedding Process
-
Document Preparation
- Documents implement the
Embed
trait - The
TextEmbedder
accumulates text to be embedded - Built-in implementations for common types (strings, numbers, JSON)
- Documents implement the
-
Batch Processing
- The
EmbeddingsBuilder
collects multiple documents - Documents are batched for efficient API calls
- Handles concurrent embedding generation
- The
-
Vector Generation
- An
EmbeddingModel
converts text to vectors - Providers like OpenAI implement the model interface
- Results include both document text and vectors
- An
Working with Embeddings
Basic Usage
use rig::{embeddings::EmbeddingsBuilder, providers::openai};
// Create embedding model
let model = openai_client.embedding_model("text-embedding-ada-002");
// Build embeddings
let embeddings = EmbeddingsBuilder::new(model)
.document("Some text")?
.document("More text")?
.build()
.await?;
Vector Operations
Rig provides several distance metrics for comparing embeddings:
- Cosine similarity
- Angular distance
- Euclidean distance
- Manhattan distance
- Chebyshev distance
Example:
let similarity = embedding1.cosine_similarity(&embedding2, false);
let distance = embedding1.euclidean_distance(&embedding2);
Custom Types
To make your types embeddable, implement the Embed
trait:
struct Document {
title: String,
content: String
}
impl Embed for Document {
fn embed(&self, embedder: &mut TextEmbedder) -> Result<(), EmbedError> {
embedder.embed(self.title.clone());
embedder.embed(self.content.clone());
Ok(())
}
}
Best Practices
-
Document Preparation
- Clean and normalize text before embedding
- Consider chunking large documents
- Remove irrelevant embedding content
-
Error Handling
- Handle provider API errors gracefully
- Validate vector dimensions
- Check for empty or invalid input
-
Batching
- Use
EmbeddingsBuilder
for multiple documents - Respects provider’s max batch size
- Automatically handles concurrent processing
- Use
See Also
API Reference (Embeddings)