Rig Extractors: Structured Data Extraction

The Extractor system in Rig provides a high-level abstraction for extracting structured data from unstructured text using LLMs. It enables automatic parsing of text into strongly-typed Rust structures with minimal boilerplate.

Core Concepts

Extractor Structure

The Extractor combines:

  1. An LLM Agent
  2. A target data structure
  3. A submission tool
  4. Type-safe deserialization

Reference:

rig-core/src/extractor.rs [55:59]
/// Extractor for structured data from text
pub struct Extractor<M: CompletionModel, T: JsonSchema + for<'a> Deserialize<'a> + Send + Sync> {
    agent: Agent<M>,
    _t: PhantomData<T>,
}

Target Data Requirements

Structures must implement:

  • serde::Deserialize
  • serde::Serialize
  • schemars::JsonSchema

Usage

Basic Example

use rig::providers::openai;
 
// Define target structure
#[derive(serde::Deserialize, serde::Serialize, schemars::JsonSchema)]
struct Person {
    name: Option<String>,
    age: Option<u8>,
    profession: Option<String>,
}
 
// Create and use extractor
let openai = openai::Client::new(api_key);
let extractor = openai.extractor::<Person>(openai::GPT_4O).build();
 
let person = extractor.extract("John Doe is a 30 year old doctor.").await?;

Error Handling

The system provides comprehensive error handling through ExtractionError:

rig-core/src/extractor.rs [43:53]
#[derive(Debug, thiserror::Error)]
pub enum ExtractionError {
    #[error("No data extracted")]
    NoData,
 
    #[error("Failed to deserialize the extracted data: {0}")]
    DeserializationError(#[from] serde_json::Error),
 
    #[error("PromptError: {0}")]
    PromptError(#[from] PromptError),
}

Key Features

1. Type Safety

  • Compile-time type checking
  • Automatic schema generation
  • Structured error handling

2. Flexible Extraction

The extractor can be customized with:

  • Additional context
  • Custom preamble
  • Model configuration
let extractor = openai.extractor::<Person>(model)
    .preamble("Extract person details with high precision")
    .context("Additional context about person formats")
    .build();

3. Submit Tool Integration

The system uses a specialized tool for data submission:

rig-core/src/extractor.rs [134:152]
impl<T: JsonSchema + for<'a> Deserialize<'a> + Serialize + Send + Sync> Tool for SubmitTool<T> {
    const NAME: &'static str = "submit";
    type Error = SubmitError;
    type Args = T;
    type Output = T;
 
    async fn definition(&self, _prompt: String) -> ToolDefinition {
        ToolDefinition {
            name: Self::NAME.to_string(),
            description: "Submit the structured data you extracted from the provided text."
                .to_string(),
            parameters: json!(schema_for!(T)),
        }
    }
 
    async fn call(&self, data: Self::Args) -> Result<Self::Output, Self::Error> {
        Ok(data)
    }
}

Best Practices

  1. Structure Design

    • Use Option<T> for optional fields
    • Keep structures focused and minimal
    • Document field requirements
  2. Error Handling

    • Handle both extraction and deserialization errors
    • Provide fallback values where appropriate
    • Log extraction failures for debugging
  3. Context Management

    • Provide clear extraction instructions
    • Include relevant domain context
    • Set appropriate model parameters

Common Patterns

Basic Extraction

let extractor = client.extractor::<SimpleType>(model).build();
let data = extractor.extract("raw text").await?;

Contextual Extraction

let extractor = client.extractor::<ComplexType>(model)
    .preamble("Extract with following rules...")
    .context("Domain-specific information...")
    .build();

Batch Processing

async fn process_documents(extractor: &Extractor<Model, DataType>, docs: Vec<String>) -> Vec<Result<DataType, ExtractionError>> {
    let mut results = Vec::new();
    for doc in docs {
        results.push(extractor.extract(&doc).await);
    }
    results
}

Integration Examples

With File Loaders

let docs = FileLoader::with_glob("*.txt")?
    .read()
    .ignore_errors();
 
let extractor = client.extractor::<DocumentData>(model).build();
 
for doc in docs {
    let structured_data = extractor.extract(&doc).await?;
    // Process structured data
}

With Agents

The extractor can be used as part of a larger agent system:

let data_extractor = client.extractor::<StructuredData>(model).build();
let agent = client.agent(model)
    .tool(data_extractor)
    .build();

Troubleshooting

If you are receiving the NoData error, the probable cause of this is that the model you are using is too weak to reliably call the tool. If the inner tool doesn’t get called, no JSON will be generated.

See Also


API Reference (Loaders)