Overview
CARTER uses Cartesia’s Sonic model for ultra-realistic voice generation with emotional expression. This reference covers the key endpoints and parameters.
Text-to-Speech Endpoints
Generate Audio (Bytes)
Generate complete audio files:
POST / tts / bytes
{
"model_id" : "sonic" ,
"transcript" : "Your text here" ,
"voice" : {
"mode" : "id" ,
"id" : "voice-id"
},
"output_format" : {
"container" : "mp3" ,
"encoding" : "mp3" ,
"sample_rate" : 44100
}
}
Stream Audio (SSE)
Server-Sent Events for streaming:
POST / tts / sse
{
"model_id" : "sonic" ,
"transcript" : "Streaming text" ,
"voice" : {
"mode" : "id" ,
"id" : "voice-id"
},
"output_format" : {
"container" : "raw" ,
"encoding" : "pcm_s16le" ,
"sample_rate" : 16000
},
"stream" : true
}
WebSocket Connection
For lowest latency:
WSS wss : //api.cartesia.ai/tts/websocket
// Send message
{
"model_id" : "sonic" ,
"voice" : {
"mode" : "id" ,
"id" : "voice-id"
},
"transcript" : "Real-time text" ,
"context_id" : "conversation-1"
}
Parameters
Model ID
The model to use for generation. Use "sonic" for latest features. Options: "sonic", "sonic-2", "sonic-turbo"
Transcript
The text to convert to speech. Supports SSML tags for advanced control.
Voice
Voice configuration object {
"mode" : "id" , // or "embedding"
"id" : "voice-id" // Cartesia voice ID
}
Audio output configuration {
"container" : "mp3" , // mp3, raw, wav
"encoding" : "mp3" , // mp3, pcm_s16le, pcm_f32le
"sample_rate" : 44100 // 16000, 22050, 44100, 48000
}
Experimental Voice Controls
_experimental_voice_controls
Control emotions and speech characteristics {
"emotion" : [ "positivity:highest" , "excitement" ],
"speed" : "fast" , // slow, normal, fast
"volume" : "high" // low, normal, high
}
Voice Emotions
Available emotional controls:
Emotion Levels
Positivity lowest, low, high, highest Anger Use tag directly Sadness Use tag directly Surprise Use tag directly Curiosity Use tag directly
Example:
{
"_experimental_voice_controls" : {
"emotion" : [ "positivity:highest" , "curiosity" ]
}
}
Bytes Response
{
"audio" : "<base64-encoded-audio>" ,
"context_id" : "conversation-1"
}
Stream Response (SSE)
data: {"audio": "<base64-chunk>", "context_id": "conv-1"}
data: {"audio": "<base64-chunk>", "context_id": "conv-1"}
data: {"done": true}
WebSocket Messages
// Audio chunk
{
"type" : "chunk" ,
"data" : "<base64-audio>" ,
"context_id" : "conv-1"
}
// Done
{
"type" : "done" ,
"context_id" : "conv-1"
}
Voice Management
List Voices
GET / voices
Response :
[
{
"id" : "voice-id" ,
"name" : "Voice Name" ,
"description" : "Voice description" ,
"language" : "en"
}
]
Get Voice
GET / voices / { voice_id }
Response :
{
"id" : "voice-id" ,
"name" : "Voice Name" ,
"description" : "Description" ,
"language" : "en" ,
"created_at" : "2024-01-01T00:00:00Z"
}
Clone Voice
POST / voices / clone
{
"name" : "Custom Voice" ,
"description" : "My custom voice" ,
"audio_files" : [ "base64-audio-1" , "base64-audio-2" ]
}
Rate Limits
Plan Requests/min Concurrent Streams
Free 20 2 Pro 100 10 Enterprise Custom Custom
Error Codes
Code Description
400 Bad Request - Invalid parameters 401 Unauthorized - Invalid API key 429 Rate Limit Exceeded 500 Server Error
Example error response:
{
"error" : {
"message" : "Rate limit exceeded" ,
"type" : "rate_limit_error" ,
"code" : 429
}
}
SDK Methods
import Cartesia from '@cartesia/cartesia-js' ;
const cartesia = new Cartesia ({ apiKey: 'your-key' });
// Generate audio bytes
const audio = await cartesia . tts . bytes ({
model_id: "sonic" ,
transcript: "Hello" ,
voice: { mode: "id" , id: voiceId }
});
// Stream audio
const stream = await cartesia . tts . sse ({
model_id: "sonic" ,
transcript: "Streaming" ,
voice: { mode: "id" , id: voiceId }
});
// WebSocket
const ws = await cartesia . tts . websocket ({
model_id: "sonic" ,
voice: { mode: "id" , id: voiceId }
});
Best Practices
Choose the Right Endpoint
Bytes : For pre-generated audio files
SSE : For streaming in web applications
WebSocket : For lowest latency in real-time apps
16000 Hz: Voice applications (lowest bandwidth)
22050 Hz: Balanced quality/size
44100 Hz: High quality music/effects
Maintain context_id across related requests for better latency and coherence.
Implement exponential backoff and retry logic for 429 errors.
Resources