Voice API Reference

Overview

CARTER uses Cartesia’s Sonic model for ultra-realistic voice generation with emotional expression. This reference covers the key endpoints and parameters.

Text-to-Speech Endpoints

Generate Audio (Bytes)

Generate complete audio files:

POST /tts/bytes

{
  "model_id": "sonic",
  "transcript": "Your text here",
  "voice": {
    "mode": "id",
    "id": "voice-id"
  },
  "output_format": {
    "container": "mp3",
    "encoding": "mp3",
    "sample_rate": 44100
  }
}

Stream Audio (SSE)

Server-Sent Events for streaming:

POST /tts/sse

{
  "model_id": "sonic",
  "transcript": "Streaming text",
  "voice": {
    "mode": "id",
    "id": "voice-id"
  },
  "output_format": {
    "container": "raw",
    "encoding": "pcm_s16le",
    "sample_rate": 16000
  },
  "stream": true
}

WebSocket Connection

For lowest latency:

WSS wss://api.cartesia.ai/tts/websocket

// Send message
{
  "model_id": "sonic",
  "voice": {
    "mode": "id",
    "id": "voice-id"
  },
  "transcript": "Real-time text",
  "context_id": "conversation-1"
}

Parameters

Model ID

model_id

string

required

The model to use for generation. Use "sonic" for latest features.Options: "sonic", "sonic-2", "sonic-turbo"

Transcript

transcript

string

required

The text to convert to speech. Supports SSML tags for advanced control.

Voice

voice

object

required

Voice configuration object

{
  "mode": "id",  // or "embedding"
  "id": "voice-id"  // Cartesia voice ID
}

Output Format

output_format

object

required

Audio output configuration

{
  "container": "mp3",  // mp3, raw, wav
  "encoding": "mp3",   // mp3, pcm_s16le, pcm_f32le
  "sample_rate": 44100 // 16000, 22050, 44100, 48000
}

Experimental Voice Controls

_experimental_voice_controls

object

Control emotions and speech characteristics

{
  "emotion": ["positivity:highest", "excitement"],
  "speed": "fast",  // slow, normal, fast
  "volume": "high"  // low, normal, high
}

Voice Emotions

Available emotional controls:

Emotion	Levels
Positivity	lowest, low, high, highest
Anger	Use tag directly
Sadness	Use tag directly
Surprise	Use tag directly
Curiosity	Use tag directly

Example:

{
  "_experimental_voice_controls": {
    "emotion": ["positivity:highest", "curiosity"]
  }
}

Response Format

Bytes Response

{
  "audio": "<base64-encoded-audio>",
  "context_id": "conversation-1"
}

Stream Response (SSE)

data: {"audio": "<base64-chunk>", "context_id": "conv-1"}

data: {"audio": "<base64-chunk>", "context_id": "conv-1"}

data: {"done": true}

WebSocket Messages

// Audio chunk
{
  "type": "chunk",
  "data": "<base64-audio>",
  "context_id": "conv-1"
}

// Done
{
  "type": "done",
  "context_id": "conv-1"
}

Voice Management

List Voices

GET /voices

Response:
[
  {
    "id": "voice-id",
    "name": "Voice Name",
    "description": "Voice description",
    "language": "en"
  }
]

Get Voice

GET /voices/{voice_id}

Response:
{
  "id": "voice-id",
  "name": "Voice Name",
  "description": "Description",
  "language": "en",
  "created_at": "2024-01-01T00:00:00Z"
}

Clone Voice

POST /voices/clone

{
  "name": "Custom Voice",
  "description": "My custom voice",
  "audio_files": ["base64-audio-1", "base64-audio-2"]
}

Rate Limits

Plan	Requests/min	Concurrent Streams
Free	20	2
Pro	100	10
Enterprise	Custom	Custom

Error Codes

Code	Description
400	Bad Request - Invalid parameters
401	Unauthorized - Invalid API key
429	Rate Limit Exceeded
500	Server Error

Example error response:

{
  "error": {
    "message": "Rate limit exceeded",
    "type": "rate_limit_error",
    "code": 429
  }
}

SDK Methods

import Cartesia from '@cartesia/cartesia-js';

const cartesia = new Cartesia({ apiKey: 'your-key' });

// Generate audio bytes
const audio = await cartesia.tts.bytes({
  model_id: "sonic",
  transcript: "Hello",
  voice: { mode: "id", id: voiceId }
});

// Stream audio
const stream = await cartesia.tts.sse({
  model_id: "sonic",
  transcript: "Streaming",
  voice: { mode: "id", id: voiceId }
});

// WebSocket
const ws = await cartesia.tts.websocket({
  model_id: "sonic",
  voice: { mode: "id", id: voiceId }
});

Best Practices

Choose the Right Endpoint

Bytes: For pre-generated audio files
SSE: For streaming in web applications
WebSocket: For lowest latency in real-time apps

Optimize Sample Rate

16000 Hz: Voice applications (lowest bandwidth)
22050 Hz: Balanced quality/size
44100 Hz: High quality music/effects

Use Context IDs

Maintain context_id across related requests for better latency and coherence.

Handle Rate Limits

Implement exponential backoff and retry logic for 429 errors.

Resources

For the most up-to-date API reference, always check the official Cartesia documentation.

Developers

​Overview

​Text-to-Speech Endpoints

​Generate Audio (Bytes)

​Stream Audio (SSE)

​WebSocket Connection

​Parameters

​Model ID

​Transcript

​Voice

​Output Format

​Experimental Voice Controls

​Voice Emotions

​Response Format

​Bytes Response

​Stream Response (SSE)

​WebSocket Messages

​Voice Management

​List Voices

​Get Voice

​Clone Voice

​Rate Limits

​Error Codes

​SDK Methods

​Best Practices

​Resources

Overview

Text-to-Speech Endpoints

Generate Audio (Bytes)

Stream Audio (SSE)

WebSocket Connection

Parameters

Model ID

Transcript

Voice

Output Format

Experimental Voice Controls

Voice Emotions

Response Format

Bytes Response

Stream Response (SSE)

WebSocket Messages

Voice Management

List Voices

Get Voice

Clone Voice

Rate Limits

Error Codes

SDK Methods

Best Practices

Resources