Rig Extractors: Structured Data Extraction

The Extractor system in Rig provides a high-level abstraction for extracting structured data from unstructured text using LLMs. It enables automatic parsing of text into strongly-typed Rust structures with minimal boilerplate.

Core Concepts

Extractor Structure

The Extractor combines:

An LLM Agent
A target data structure
A submission tool
Type-safe deserialization

Reference:

rig-core/src/extractor.rs [55:59]

/// Extractor for structured data from text
pub struct Extractor<M: CompletionModel, T: JsonSchema + for<'a> Deserialize<'a> + Send + Sync> {
    agent: Agent<M>,
    _t: PhantomData<T>,
}

Target Data Requirements

Structures must implement:

serde::Deserialize
serde::Serialize
schemars::JsonSchema

Usage

Basic Example

use rig::providers::openai;
 
// Define target structure
#[derive(serde::Deserialize, serde::Serialize, schemars::JsonSchema)]
struct Person {
    name: Option<String>,
    age: Option<u8>,
    profession: Option<String>,
}
 
// Create and use extractor
let openai = openai::Client::new(api_key);
let extractor = openai.extractor::<Person>(openai::GPT_4O).build();
 
let person = extractor.extract("John Doe is a 30 year old doctor.").await?;

Error Handling

The system provides comprehensive error handling through ExtractionError:

rig-core/src/extractor.rs [43:53]

#[derive(Debug, thiserror::Error)]
pub enum ExtractionError {
    #[error("No data extracted")]
    NoData,
 
    #[error("Failed to deserialize the extracted data: {0}")]
    DeserializationError(#[from] serde_json::Error),
 
    #[error("PromptError: {0}")]
    PromptError(#[from] PromptError),
}

Key Features

1. Type Safety

Compile-time type checking
Automatic schema generation
Structured error handling

2. Flexible Extraction

The extractor can be customized with:

Additional context
Custom preamble
Model configuration

let extractor = openai.extractor::<Person>(model)
    .preamble("Extract person details with high precision")
    .context("Additional context about person formats")
    .build();

3. Submit Tool Integration

The system uses a specialized tool for data submission:

rig-core/src/extractor.rs [134:152]

impl<T: JsonSchema + for<'a> Deserialize<'a> + Serialize + Send + Sync> Tool for SubmitTool<T> {
    const NAME: &'static str = "submit";
    type Error = SubmitError;
    type Args = T;
    type Output = T;
 
    async fn definition(&self, _prompt: String) -> ToolDefinition {
        ToolDefinition {
            name: Self::NAME.to_string(),
            description: "Submit the structured data you extracted from the provided text."
                .to_string(),
            parameters: json!(schema_for!(T)),
        }
    }
 
    async fn call(&self, data: Self::Args) -> Result<Self::Output, Self::Error> {
        Ok(data)
    }
}

Best Practices

Structure Design
- Use Option<T> for optional fields
- Keep structures focused and minimal
- Document field requirements
Error Handling
- Handle both extraction and deserialization errors
- Provide fallback values where appropriate
- Log extraction failures for debugging
Context Management
- Provide clear extraction instructions
- Include relevant domain context
- Set appropriate model parameters

Common Patterns

Basic Extraction

let extractor = client.extractor::<SimpleType>(model).build();
let data = extractor.extract("raw text").await?;

Contextual Extraction

let extractor = client.extractor::<ComplexType>(model)
    .preamble("Extract with following rules...")
    .context("Domain-specific information...")
    .build();

Batch Processing

async fn process_documents(extractor: &Extractor<Model, DataType>, docs: Vec<String>) -> Vec<Result<DataType, ExtractionError>> {
    let mut results = Vec::new();
    for doc in docs {
        results.push(extractor.extract(&doc).await);
    }
    results
}

Integration Examples

With File Loaders

let docs = FileLoader::with_glob("*.txt")?
    .read()
    .ignore_errors();
 
let extractor = client.extractor::<DocumentData>(model).build();
 
for doc in docs {
    let structured_data = extractor.extract(&doc).await?;
    // Process structured data
}

With Agents

The extractor can be used as part of a larger agent system:

let data_extractor = client.extractor::<StructuredData>(model).build();
let agent = client.agent(model)
    .tool(data_extractor)
    .build();