Skip to content

tts

Run a local TTS (Text-to-Speech) server with two backend options:

  • Kokoro - High-quality neural TTS with GPU acceleration (CUDA/MPS/CPU)
  • Piper - Fast, CPU-friendly ONNX-based synthesis

Note

Quick Start with Kokoro (GPU-accelerated, auto-downloads from HuggingFace):

pip install "agent-cli[kokoro]"
agent-cli server tts --backend kokoro

Quick Start with Piper (CPU-friendly):

pip install "agent-cli[piper]"
agent-cli server tts --backend piper

Server runs at http://localhost:10201. Verify with curl http://localhost:10201/health.

Features

  • OpenAI-compatible API at /v1/audio/speech - drop-in replacement for OpenAI's TTS API
  • Wyoming protocol for Home Assistant voice integration
  • TTL-based memory management - models unload after idle period
  • Multiple backends - Kokoro (GPU) or Piper (CPU)
  • Auto-download - Models and voices download automatically on first use
  • Multiple voices - Run different voices with independent TTLs

Usage

agent-cli server tts [OPTIONS]

Examples

# Run with Kokoro (auto-downloads model and voice from HuggingFace)
agent-cli server tts --backend kokoro

# Run with Piper (CPU-friendly)
agent-cli server tts --backend piper

# Piper with specific voice and 10-minute TTL
agent-cli server tts --backend piper --model en_US-ryan-high --ttl 600

# Run multiple Piper voices
agent-cli server tts --backend piper --model en_US-lessac-medium --model en_GB-alan-medium

# Preload models at startup
agent-cli server tts --preload

Options

Options

Option Default Description
--model, -m - Model/voice(s) to load. Piper: en_US-lessac-medium, en_GB-alan-medium. Kokoro: af_heart, af_bella, am_adam. Auto-downloads on first use
--default-model - Voice to use when client doesn't specify one. Must be in the --model list
--device, -d auto Compute device: auto, cpu, cuda, mps. Piper is CPU-only; Kokoro supports GPU acceleration
--cache-dir - Custom directory for downloaded models (default: ~/.cache/agent-cli/tts/)
--ttl 300 Seconds of inactivity before unloading model from memory. Set to 0 to keep loaded indefinitely
--preload false Load model(s) immediately at startup instead of on first request. Useful for reducing first-request latency
--host 0.0.0.0 Network interface to bind. Use 0.0.0.0 for all interfaces
--port, -p 10201 Port for OpenAI-compatible HTTP API (/v1/audio/speech)
--wyoming-port 10200 Port for Wyoming protocol (Home Assistant integration)
--no-wyoming false Disable Wyoming protocol server (only run HTTP API)
--download-only false Download model(s)/voice(s) to cache and exit. Useful for Docker builds
--backend, -b auto TTS engine: auto (prefer Kokoro if available), piper (CPU, many languages), kokoro (GPU, high quality)

General Options

Option Default Description
--log-level info Set logging level.

API Endpoints

Endpoint Method Description
/v1/audio/speech POST OpenAI-compatible speech synthesis (supports stream_format=audio for Kokoro)
/v1/audio/speech/json POST Alternative endpoint accepting JSON body
/v1/voices GET List available voices (models)
/v1/model/unload POST Manually unload a model from memory
/health GET Health check with model status
/docs GET Interactive API documentation

Using the API

curl Example

# Synthesize speech (JSON body, OpenAI-compatible)
curl -X POST http://localhost:10201/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello, world!", "model": "tts-1", "voice": "alloy", "response_format": "wav"}' \
  --output speech.wav

# With speed adjustment
curl -X POST http://localhost:10201/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "This is faster speech", "voice": "echo", "speed": 1.5, "response_format": "wav"}' \
  --output fast.wav

Python Example (OpenAI SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:10201/v1", api_key="not-needed")

# Synthesize speech
response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Hello, this is a test of the local TTS server.",
)
response.write_to_file("output.wav")

Streaming Synthesis (Kokoro)

The Kokoro backend supports streaming synthesis following OpenAI's API convention:

  • stream_format=audio enables streaming
  • response_format=pcm is required (this is the default)

This enables lower latency for real-time playback.

Streaming Example

# Stream audio directly to speaker (pcm is the default format)
curl -X POST http://localhost:10201/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello world. This is a streaming test.", "voice": "af_heart", "stream_format": "audio"}' \
  --output - | aplay -r 24000 -f S16_LE -c 1

Python Streaming Example (OpenAI SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:10201/v1", api_key="not-needed")

# Stream audio chunks as they're generated
with client.audio.speech.with_streaming_response.create(
    model="tts-1",
    voice="af_heart",
    input="Hello, this is a streaming test.",
    extra_body={"stream_format": "audio"},
) as response:
    for chunk in response.iter_bytes():
        # Process audio chunk (24kHz, 16-bit signed PCM, mono)
        process_audio(chunk)

Response Format

  • Content-Type: audio/pcm
  • Headers: X-Sample-Rate: 24000, X-Sample-Width: 2, X-Channels: 1
  • Body: Raw 16-bit signed PCM audio chunks (same as OpenAI's PCM format)

Wyoming Protocol Streaming

When using the Kokoro backend via Wyoming protocol (port 10200), streaming is automatic - audio chunks are sent as they're generated via AudioChunk messages.

Architecture

The Kokoro backend runs synthesis in an isolated subprocess. This design provides:

  • Memory cleanup: When the model is unloaded (via TTL or /v1/model/unload), the subprocess terminates and all GPU/CPU memory is immediately released
  • Low latency: Streaming delivers audio chunks as Kokoro generates them, reducing time-to-first-audio
  • Stability: Subprocess isolation prevents memory leaks from affecting the main server process

Voice Selection

Kokoro Voices

Kokoro voices are specified per-request via the API voice parameter. Voices auto-download from HuggingFace on first use.

The voice name prefix indicates accent: af_ = American Female, am_ = American Male, bf_ = British Female, bm_ = British Male.

Voice Accent Gender Notes
af_heart American Female Default voice
af_bella American Female
af_nova American Female
af_sky American Female
am_adam American Male
am_michael American Male
bf_emma British Female
bm_george British Male

Browse all 30+ voices at hexgrad/Kokoro-82M.

Piper Voices

With Piper, the model name IS the voice. Use --model to specify.

Voice Language Quality
en_US-lessac-medium English (US) Medium
en_US-ryan-high English (US) High
en_GB-alan-medium English (UK) Medium
de_DE-thorsten-high German High
fr_FR-upmc-medium French Medium

Browse all voices at rhasspy/piper.

Installation

Kokoro (GPU-accelerated)

pip install "agent-cli[kokoro]"
# or
uv sync --extra kokoro

Kokoro requires PyTorch. For GPU acceleration: - CUDA: Install PyTorch with CUDA support - Apple Silicon: PyTorch automatically uses MPS

Piper (CPU-friendly)

pip install "agent-cli[piper]"
# or
uv sync --extra piper