Skip to main content

Overview

CARTER uses Cartesia’s Sonic model for ultra-realistic voice generation with emotional expression. This reference covers the key endpoints and parameters.

Text-to-Speech Endpoints

Generate Audio (Bytes)

Generate complete audio files:
POST /tts/bytes

{
  "model_id": "sonic",
  "transcript": "Your text here",
  "voice": {
    "mode": "id",
    "id": "voice-id"
  },
  "output_format": {
    "container": "mp3",
    "encoding": "mp3",
    "sample_rate": 44100
  }
}

Stream Audio (SSE)

Server-Sent Events for streaming:
POST /tts/sse

{
  "model_id": "sonic",
  "transcript": "Streaming text",
  "voice": {
    "mode": "id",
    "id": "voice-id"
  },
  "output_format": {
    "container": "raw",
    "encoding": "pcm_s16le",
    "sample_rate": 16000
  },
  "stream": true
}

WebSocket Connection

For lowest latency:
WSS wss://api.cartesia.ai/tts/websocket

// Send message
{
  "model_id": "sonic",
  "voice": {
    "mode": "id",
    "id": "voice-id"
  },
  "transcript": "Real-time text",
  "context_id": "conversation-1"
}

Parameters

Model ID

model_id
string
required
The model to use for generation. Use "sonic" for latest features.Options: "sonic", "sonic-2", "sonic-turbo"

Transcript

transcript
string
required
The text to convert to speech. Supports SSML tags for advanced control.

Voice

voice
object
required
Voice configuration object
{
  "mode": "id",  // or "embedding"
  "id": "voice-id"  // Cartesia voice ID
}

Output Format

output_format
object
required
Audio output configuration
{
  "container": "mp3",  // mp3, raw, wav
  "encoding": "mp3",   // mp3, pcm_s16le, pcm_f32le
  "sample_rate": 44100 // 16000, 22050, 44100, 48000
}

Experimental Voice Controls

_experimental_voice_controls
object
Control emotions and speech characteristics
{
  "emotion": ["positivity:highest", "excitement"],
  "speed": "fast",  // slow, normal, fast
  "volume": "high"  // low, normal, high
}

Voice Emotions

Available emotional controls:
EmotionLevels
Positivitylowest, low, high, highest
AngerUse tag directly
SadnessUse tag directly
SurpriseUse tag directly
CuriosityUse tag directly
Example:
{
  "_experimental_voice_controls": {
    "emotion": ["positivity:highest", "curiosity"]
  }
}

Response Format

Bytes Response

{
  "audio": "<base64-encoded-audio>",
  "context_id": "conversation-1"
}

Stream Response (SSE)

data: {"audio": "<base64-chunk>", "context_id": "conv-1"}

data: {"audio": "<base64-chunk>", "context_id": "conv-1"}

data: {"done": true}

WebSocket Messages

// Audio chunk
{
  "type": "chunk",
  "data": "<base64-audio>",
  "context_id": "conv-1"
}

// Done
{
  "type": "done",
  "context_id": "conv-1"
}

Voice Management

List Voices

GET /voices

Response:
[
  {
    "id": "voice-id",
    "name": "Voice Name",
    "description": "Voice description",
    "language": "en"
  }
]

Get Voice

GET /voices/{voice_id}

Response:
{
  "id": "voice-id",
  "name": "Voice Name",
  "description": "Description",
  "language": "en",
  "created_at": "2024-01-01T00:00:00Z"
}

Clone Voice

POST /voices/clone

{
  "name": "Custom Voice",
  "description": "My custom voice",
  "audio_files": ["base64-audio-1", "base64-audio-2"]
}

Rate Limits

PlanRequests/minConcurrent Streams
Free202
Pro10010
EnterpriseCustomCustom

Error Codes

CodeDescription
400Bad Request - Invalid parameters
401Unauthorized - Invalid API key
429Rate Limit Exceeded
500Server Error
Example error response:
{
  "error": {
    "message": "Rate limit exceeded",
    "type": "rate_limit_error",
    "code": 429
  }
}

SDK Methods

import Cartesia from '@cartesia/cartesia-js';

const cartesia = new Cartesia({ apiKey: 'your-key' });

// Generate audio bytes
const audio = await cartesia.tts.bytes({
  model_id: "sonic",
  transcript: "Hello",
  voice: { mode: "id", id: voiceId }
});

// Stream audio
const stream = await cartesia.tts.sse({
  model_id: "sonic",
  transcript: "Streaming",
  voice: { mode: "id", id: voiceId }
});

// WebSocket
const ws = await cartesia.tts.websocket({
  model_id: "sonic",
  voice: { mode: "id", id: voiceId }
});

Best Practices

  • Bytes: For pre-generated audio files
  • SSE: For streaming in web applications
  • WebSocket: For lowest latency in real-time apps
  • 16000 Hz: Voice applications (lowest bandwidth)
  • 22050 Hz: Balanced quality/size
  • 44100 Hz: High quality music/effects
Maintain context_id across related requests for better latency and coherence.
Implement exponential backoff and retry logic for 429 errors.

Resources

For the most up-to-date API reference, always check the official Cartesia documentation.