Image, Audio & Transcription
As of v0.31.0, Rig provides unified abstractions for image generation, audio generation (text-to-speech), and audio transcription (speech-to-text) alongside its core text completion and embedding capabilities.
Image Generation
Requires the
imagefeature flag:cargo add rig-core -F image
The rig::image_generation module provides the ImageGenerationModel trait for generating images from text prompts.
Core Types
pub trait ImageGenerationModel: Clone + Send + Sync {
type Response: Send + Sync;
async fn image_generation(
&self,
request: ImageGenerationRequest,
) -> Result<ImageGenerationResponse, ImageGenerationError>;
}Request Building
Use ImageGenerationRequestBuilder to construct requests:
use rig::image_generation::{ImageGeneration, ImageGenerationRequest};
let response = model
.image_generation_request("A futuristic city at sunset")
.size("1024x1024")
.send()
.await?;
// Access the generated image
let image_data = response.image;ImageGenerationResponse
pub struct ImageGenerationResponse {
/// The generated image data
pub image: Vec<u8>,
/// The raw provider response
pub raw_response: serde_json::Value,
}Using with Agents
Image generation models can be accessed through providers that support them:
let openai = openai::Client::from_env();
let dalle = openai.image_generation_model("dall-e-3");
let response = dalle
.image_generation_request("A robot painting a landscape")
.send()
.await?;Audio Generation (Text-to-Speech)
Requires the
audiofeature flag:cargo add rig-core -F audio
The rig::audio_generation module provides the AudioGenerationModel trait for converting text to speech.
Core Types
pub trait AudioGenerationModel:
Sized
+ Clone
+ WasmCompatSend
+ WasmCompatSync {
type Response: Send + Sync;
type Client;
// Required methods
fn make(client: &Self::Client, model: impl Into<String>) -> Self;
fn audio_generation(
&self,
request: AudioGenerationRequest,
) -> impl Future<Output = Result<AudioGenerationResponse<Self::Response>, AudioGenerationError>> + Send;
// Provided method
fn audio_generation_request(&self) -> AudioGenerationRequestBuilder<Self> { ... }
}Request Building
use rig::audio_generation::AudioGeneration;
let response = model
.audio_generation_request("Hello, how can I help you today?")
.voice("alloy")
.send()
.await?;
// Access the generated audio
let audio_bytes = response.audio;AudioGenerationResponse
pub struct AudioGenerationResponse {
/// The generated audio data
pub audio: Vec<u8>,
/// The raw provider response
pub raw_response: serde_json::Value,
}Audio Transcription (Speech-to-Text)
The rig::transcription module provides the TranscriptionModel trait for transcribing audio to text.
Core Trait
pub trait TranscriptionModel:
Clone
+ WasmCompatSend
+ WasmCompatSync {
type Response: WasmCompatSend + WasmCompatSync;
type Client;
// Required methods
fn make(client: &Self::Client, model: impl Into<String>) -> Self;
fn transcription(
&self,
request: TranscriptionRequest,
) -> impl Future<Output = Result<TranscriptionResponse<Self::Response>, TranscriptionError>> + WasmCompatSend;
// Provided method
fn transcription_request(&self) -> TranscriptionRequestBuilder<Self> { ... }
}Request Building
use rig::transcription::Transcription;
let audio_data: Vec<u8> = std::fs::read("audio.mp3")?;
let response = model
.transcription_request(audio_data)
.language("en")
.send()
.await?;
println!("Transcription: {}", response.text);TranscriptionResponse
pub struct TranscriptionResponse {
/// The transcribed text
pub text: String,
/// The raw provider response
pub raw_response: serde_json::Value,
}Provider Support
Not all providers support all media types. Here is a summary of current support:
| Provider | Image Generation | Audio Generation | Transcription |
|---|---|---|---|
| OpenAI | Yes (DALL-E) | Yes (TTS) | Yes (Whisper) |
| Other providers | Varies | Varies | Varies |
Check the individual provider documentation for specific model support.
Client Trait Integration
These capabilities integrate with the provider client system. Use the corresponding client traits to create models:
use rig::client::CompletionClient;
// Image generation, audio generation, and transcription are accessed
// through provider-specific methods on the client.
let openai = openai::Client::from_env();
// Completion model
let gpt4 = openai.completion_model("gpt-4o");
// Embedding model
let embed = openai.embedding_model("text-embedding-3-small");
// Image generation model (requires `image` feature)
let dalle = openai.image_generation_model("dall-e-3");
// Audio generation model (requires `audio` feature)
let tts = openai.audio_generation_model("tts-1");
// Transcription model
let whisper = openai.transcription_model("whisper-1");Best Practices
-
Feature Flags: Only enable the feature flags you need (
image,audio) to minimize compile times and binary size. -
Error Handling: Each media type has its own error type (
ImageGenerationError,AudioGenerationError,TranscriptionError). Handle them appropriately. -
Large Payloads: Audio and image data can be large. Consider streaming where possible and be mindful of memory usage.
-
Model Selection: Different models within the same provider may have different capabilities, pricing, and quality. Refer to provider documentation for guidance.
See Also
- Completion — Text completion
- Provider Clients — Provider client capabilities