Rig Extractors: Structured Data Extraction
The Extractor system in Rig provides a high-level abstraction for extracting structured data from unstructured text using LLMs. It enables automatic parsing of text into strongly-typed Rust structures with minimal boilerplate.
Core Concepts
Extractor Structure
The Extractor combines:
- An LLM Agent
- A target data structure
- A submission tool
- Type-safe deserialization
Reference:
rig-core/src/extractor.rs [55:59]
/// Extractor for structured data from text
pub struct Extractor<M: CompletionModel, T: JsonSchema + for<'a> Deserialize<'a> + Send + Sync> {
agent: Agent<M>,
_t: PhantomData<T>,
}
Target Data Requirements
Structures must implement:
serde::Deserialize
serde::Serialize
schemars::JsonSchema
Usage
Basic Example
use rig::providers::openai;
// Define target structure
#[derive(serde::Deserialize, serde::Serialize, schemars::JsonSchema)]
struct Person {
name: Option<String>,
age: Option<u8>,
profession: Option<String>,
}
// Create and use extractor
let openai = openai::Client::new(api_key);
let extractor = openai.extractor::<Person>(openai::GPT_4O).build();
let person = extractor.extract("John Doe is a 30 year old doctor.").await?;
Error Handling
The system provides comprehensive error handling through ExtractionError
:
rig-core/src/extractor.rs [43:53]
#[derive(Debug, thiserror::Error)]
pub enum ExtractionError {
#[error("No data extracted")]
NoData,
#[error("Failed to deserialize the extracted data: {0}")]
DeserializationError(#[from] serde_json::Error),
#[error("PromptError: {0}")]
PromptError(#[from] PromptError),
}
Key Features
1. Type Safety
- Compile-time type checking
- Automatic schema generation
- Structured error handling
2. Flexible Extraction
The extractor can be customized with:
- Additional context
- Custom preamble
- Model configuration
let extractor = openai.extractor::<Person>(model)
.preamble("Extract person details with high precision")
.context("Additional context about person formats")
.build();
3. Submit Tool Integration
The system uses a specialized tool for data submission:
rig-core/src/extractor.rs [134:152]
impl<T: JsonSchema + for<'a> Deserialize<'a> + Serialize + Send + Sync> Tool for SubmitTool<T> {
const NAME: &'static str = "submit";
type Error = SubmitError;
type Args = T;
type Output = T;
async fn definition(&self, _prompt: String) -> ToolDefinition {
ToolDefinition {
name: Self::NAME.to_string(),
description: "Submit the structured data you extracted from the provided text."
.to_string(),
parameters: json!(schema_for!(T)),
}
}
async fn call(&self, data: Self::Args) -> Result<Self::Output, Self::Error> {
Ok(data)
}
}
Best Practices
-
Structure Design
- Use
Option<T>
for optional fields - Keep structures focused and minimal
- Document field requirements
- Use
-
Error Handling
- Handle both extraction and deserialization errors
- Provide fallback values where appropriate
- Log extraction failures for debugging
-
Context Management
- Provide clear extraction instructions
- Include relevant domain context
- Set appropriate model parameters
Common Patterns
Basic Extraction
let extractor = client.extractor::<SimpleType>(model).build();
let data = extractor.extract("raw text").await?;
Contextual Extraction
let extractor = client.extractor::<ComplexType>(model)
.preamble("Extract with following rules...")
.context("Domain-specific information...")
.build();
Batch Processing
async fn process_documents(extractor: &Extractor<Model, DataType>, docs: Vec<String>) -> Vec<Result<DataType, ExtractionError>> {
let mut results = Vec::new();
for doc in docs {
results.push(extractor.extract(&doc).await);
}
results
}
Integration Examples
With File Loaders
let docs = FileLoader::with_glob("*.txt")?
.read()
.ignore_errors();
let extractor = client.extractor::<DocumentData>(model).build();
for doc in docs {
let structured_data = extractor.extract(&doc).await?;
// Process structured data
}
With Agents
The extractor can be used as part of a larger agent system:
let data_extractor = client.extractor::<StructuredData>(model).build();
let agent = client.agent(model)
.tool(data_extractor)
.build();
See Also
API Reference (Loaders)