Embeddings in Rig

Rig provides a comprehensive embeddings system for converting text and other data types into numerical vector representations that can be used for semantic search, similarity comparisons, and other NLP tasks.

Core Concepts

Embeddings

An embedding is a vector representation of data (usually text) where semantically similar items are mapped to nearby points in the vector space. In Rig, embeddings are represented by the Embedding struct which contains:

  • The original document text
  • The vector representation as Vec<f64>

The Embedding Process

In order for a type (or piece of data) to be considered a valid embeddable document, it needs to implement the Embed trait. While you can write the impl Embed type manually, we also have a derive macro to help you implement this quickly and without hassle (note that the Embed derive macro requires the derive feature of rig-core to be enabled).

The following type is a valid Embed type:

#[derive(Embed)]
struct Foo {
    id: i32,
    #[embed]
    name: String
}

The following type is also a valid Embed type:

struct Foo {
    id: i32,
    name: String
}
 
impl Embed for WordDefinition {
    fn embed(&self, embedder: &mut TextEmbedder) -> Result<(), EmbedError> {
       // Embeddings only need to be generated for `definition` field.
       // Split the definitions by comma and collect them into a vector of strings.
       // That way, different embeddings can be generated for each definition in the `definitions` string.
       embedder.embed(self.name.to_owned());
 
       Ok(())
    }
}

To use this in a workflow, it is ideal to use the EmbeddingsBuilder struct to be able to process multiple documents at once:

// required trait import for embedding_model fn to exist
use rig::client::embeddings::EmbeddingsClient;
let documents = vec![
    Foo {
        id: 1,
        name: "Rig".to_string()
    },
    Foo {
        id: 2,
        name: "Playgrounds".to_string()
    }
];
 
let model = rig::providers::openai::Client::from_env().embedding_model("text-embedding-ada-002");
 
let embeddings = EmbeddingsBuilder::new(model)
    .documents(documents)?
    .build()
    .await?;

The returned embeddings type signature will now be an iterator over (T, OneOrMany<Embedding>) where T implements Embed. You can either collect these into a vector and use it elsewhere, or dump it directly into a vector store:

use rig::vector_store::InsertDocuments;
// note: this function is pseudo-code
// look into specific crate integrations for more indepth
// usage explanations
let qdrant = create_qdrant_vector_store();
 
qdrant.insert_documents(embeddings).await?;

Best Practices

  1. Document Preparation

    • Clean and normalize text before embedding
    • Consider chunking large documents
    • Remove irrelevant embedding content
  2. Error Handling

    • Handle provider API errors gracefully
    • Validate vector dimensions
    • Check for empty or invalid input
  3. Batching

    • Use EmbeddingsBuilder for multiple documents
    • Respects provider’s max batch size
    • Automatically handles concurrent processing

See Also


API Reference (Embeddings)