Rig Extractors: Structured Data Extraction
The Extractor system in Rig provides a high-level abstraction for extracting structured data from unstructured text using LLMs. It enables automatic parsing of text into strongly-typed Rust structures with minimal boilerplate.
Core Concepts
Extractor Structure
The Extractor combines:
- An LLM Agent
- A target data structure
- A submission tool
- Type-safe deserialization
Reference:
/// Extractor for structured data from text
pub struct Extractor<M: CompletionModel, T: JsonSchema + for<'a> Deserialize<'a> + Send + Sync> {
agent: Agent<M>,
_t: PhantomData<T>,
}
Target Data Requirements
Structures must implement:
serde::Deserialize
serde::Serialize
schemars::JsonSchema
Usage
Basic Example
use rig::providers::openai;
// Define target structure
#[derive(serde::Deserialize, serde::Serialize, schemars::JsonSchema)]
struct Person {
name: Option<String>,
age: Option<u8>,
profession: Option<String>,
}
// Create and use extractor
let openai = openai::Client::new(api_key);
let extractor = openai.extractor::<Person>(openai::GPT_4O).build();
let person = extractor.extract("John Doe is a 30 year old doctor.").await?;
Error Handling
The system provides comprehensive error handling through ExtractionError
:
#[derive(Debug, thiserror::Error)]
pub enum ExtractionError {
#[error("No data extracted")]
NoData,
#[error("Failed to deserialize the extracted data: {0}")]
DeserializationError(#[from] serde_json::Error),
#[error("PromptError: {0}")]
PromptError(#[from] PromptError),
}
Key Features
1. Type Safety
- Compile-time type checking
- Automatic schema generation
- Structured error handling
2. Flexible Extraction
The extractor can be customized with:
- Additional context
- Custom preamble
- Model configuration
let extractor = openai.extractor::<Person>(model)
.preamble("Extract person details with high precision")
.context("Additional context about person formats")
.build();
3. Submit Tool Integration
The system uses a specialized tool for data submission:
impl<T: JsonSchema + for<'a> Deserialize<'a> + Serialize + Send + Sync> Tool for SubmitTool<T> {
const NAME: &'static str = "submit";
type Error = SubmitError;
type Args = T;
type Output = T;
async fn definition(&self, _prompt: String) -> ToolDefinition {
ToolDefinition {
name: Self::NAME.to_string(),
description: "Submit the structured data you extracted from the provided text."
.to_string(),
parameters: json!(schema_for!(T)),
}
}
async fn call(&self, data: Self::Args) -> Result<Self::Output, Self::Error> {
Ok(data)
}
}
Best Practices
-
Structure Design
- Use
Option<T>
for optional fields - Keep structures focused and minimal
- Document field requirements
- Use
-
Error Handling
- Handle both extraction and deserialization errors
- Provide fallback values where appropriate
- Log extraction failures for debugging
-
Context Management
- Provide clear extraction instructions
- Include relevant domain context
- Set appropriate model parameters
Common Patterns
Basic Extraction
let extractor = client.extractor::<SimpleType>(model).build();
let data = extractor.extract("raw text").await?;
Contextual Extraction
let extractor = client.extractor::<ComplexType>(model)
.preamble("Extract with following rules...")
.context("Domain-specific information...")
.build();
Batch Processing
async fn process_documents(extractor: &Extractor<Model, DataType>, docs: Vec<String>) -> Vec<Result<DataType, ExtractionError>> {
let mut results = Vec::new();
for doc in docs {
results.push(extractor.extract(&doc).await);
}
results
}
Integration Examples
With File Loaders
let docs = FileLoader::with_glob("*.txt")?
.read()
.ignore_errors();
let extractor = client.extractor::<DocumentData>(model).build();
for doc in docs {
let structured_data = extractor.extract(&doc).await?;
// Process structured data
}
With Agents
The extractor can be used as part of a larger agent system:
let data_extractor = client.extractor::<StructuredData>(model).build();
let agent = client.agent(model)
.tool(data_extractor)
.build();
Troubleshooting
If you are receiving the NoData
error, the probable cause of this is that the model you are using is too weak to reliably call the tool. If the inner tool doesn’t get called, no JSON will be generated.
See Also
API Reference (Loaders)