tts

Run a local TTS (Text-to-Speech) server with two backend options:

Kokoro - High-quality neural TTS with GPU acceleration (CUDA/MPS/CPU)
Piper - Fast, CPU-friendly ONNX-based synthesis

Note

Quick Start with Kokoro (GPU-accelerated, auto-downloads from HuggingFace):

pip install "agent-cli[kokoro]"
agent-cli server tts --backend kokoro

Quick Start with Piper (CPU-friendly):

pip install "agent-cli[piper]"
agent-cli server tts --backend piper

Server runs at http://localhost:10201. Verify with curl http://localhost:10201/health.

Features

OpenAI-compatible API at /v1/audio/speech - drop-in replacement for OpenAI's TTS API
Wyoming protocol for Home Assistant voice integration
TTL-based memory management - models unload after idle period
Multiple backends - Kokoro (GPU) or Piper (CPU)
Auto-download - Models and voices download automatically on first use
Multiple voices - Run different voices with independent TTLs

Usage

agent-cli server tts [OPTIONS]

Examples

# Run with Kokoro (auto-downloads model and voice from HuggingFace)
agent-cli server tts --backend kokoro

# Run with Piper (CPU-friendly)
agent-cli server tts --backend piper

# Piper with specific voice and 10-minute TTL
agent-cli server tts --backend piper --model en_US-ryan-high --ttl 600

# Run multiple Piper voices
agent-cli server tts --backend piper --model en_US-lessac-medium --model en_GB-alan-medium

# Preload models at startup
agent-cli server tts --preload

Options

Option	Default	Description
`--model, -m`	-	Model/voice(s) to load. Piper: `en_US-lessac-medium`, `en_GB-alan-medium`. Kokoro: `af_heart`, `af_bella`, `am_adam`. Auto-downloads on first use
`--default-model`	-	Voice to use when client doesn't specify one. Must be in the `--model` list
`--device, -d`	`auto`	Compute device: `auto`, `cpu`, `cuda`, `mps`. Piper is CPU-only; Kokoro supports GPU acceleration
`--cache-dir`	-	Custom directory for downloaded models (default: ~/.cache/agent-cli/tts/)
`--ttl`	`300`	Seconds of inactivity before unloading model from memory. Set to 0 to keep loaded indefinitely
`--preload`	`false`	Load model(s) immediately at startup instead of on first request. Useful for reducing first-request latency
`--host`	`0.0.0.0`	Network interface to bind. Use `0.0.0.0` for all interfaces
`--port, -p`	`10201`	Port for OpenAI-compatible HTTP API (`/v1/audio/speech`)
`--wyoming-port`	`10200`	Port for Wyoming protocol (Home Assistant integration)
`--no-wyoming`	`false`	Disable Wyoming protocol server (only run HTTP API)
`--download-only`	`false`	Download model(s)/voice(s) to cache and exit. Useful for Docker builds
`--backend, -b`	`auto`	TTS engine: `auto` (prefer Kokoro if available), `piper` (CPU, many languages), `kokoro` (GPU, high quality)

General Options

Option	Default	Description
`--log-level`	`info`	Set logging level.

API Endpoints

Endpoint	Method	Description
`/v1/audio/speech`	POST	OpenAI-compatible speech synthesis (supports `stream_format=audio` for Kokoro)
`/v1/audio/speech/json`	POST	Alternative endpoint accepting JSON body
`/v1/voices`	GET	List available voices (models)
`/v1/model/unload`	POST	Manually unload a model from memory
`/health`	GET	Health check with model status
`/docs`	GET	Interactive API documentation

Using the API

curl Example

# Synthesize speech (JSON body, OpenAI-compatible)
curl -X POST http://localhost:10201/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello, world!", "model": "tts-1", "voice": "alloy", "response_format": "wav"}' \
  --output speech.wav

# With speed adjustment
curl -X POST http://localhost:10201/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "This is faster speech", "voice": "echo", "speed": 1.5, "response_format": "wav"}' \
  --output fast.wav

Python Example (OpenAI SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:10201/v1", api_key="not-needed")

# Synthesize speech
response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Hello, this is a test of the local TTS server.",
)
response.write_to_file("output.wav")

Streaming Synthesis (Kokoro)

The Kokoro backend supports streaming synthesis following OpenAI's API convention:

stream_format=audio enables streaming
response_format=pcm is required (this is the default)

This enables lower latency for real-time playback.

Streaming Example

# Stream audio directly to speaker (pcm is the default format)
curl -X POST http://localhost:10201/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello world. This is a streaming test.", "voice": "af_heart", "stream_format": "audio"}' \
  --output - | aplay -r 24000 -f S16_LE -c 1

Python Streaming Example (OpenAI SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:10201/v1", api_key="not-needed")

# Stream audio chunks as they're generated
with client.audio.speech.with_streaming_response.create(
    model="tts-1",
    voice="af_heart",
    input="Hello, this is a streaming test.",
    extra_body={"stream_format": "audio"},
) as response:
    for chunk in response.iter_bytes():
        # Process audio chunk (24kHz, 16-bit signed PCM, mono)
        process_audio(chunk)

Response Format

Content-Type: audio/pcm
Headers: X-Sample-Rate: 24000, X-Sample-Width: 2, X-Channels: 1
Body: Raw 16-bit signed PCM audio chunks (same as OpenAI's PCM format)

Wyoming Protocol Streaming

When using the Kokoro backend via Wyoming protocol (port 10200), streaming is automatic - audio chunks are sent as they're generated via AudioChunk messages.

Architecture

The Kokoro backend runs synthesis in an isolated subprocess. This design provides:

Memory cleanup: When the model is unloaded (via TTL or /v1/model/unload), the subprocess terminates and all GPU/CPU memory is immediately released
Low latency: Streaming delivers audio chunks as Kokoro generates them, reducing time-to-first-audio
Stability: Subprocess isolation prevents memory leaks from affecting the main server process

Voice Selection

Kokoro Voices

Kokoro voices are specified per-request via the API voice parameter. Voices auto-download from HuggingFace on first use.

The voice name prefix indicates accent: af_ = American Female, am_ = American Male, bf_ = British Female, bm_ = British Male.

Voice	Accent	Gender	Notes
`af_heart`	American	Female	Default voice
`af_bella`	American	Female
`af_nova`	American	Female
`af_sky`	American	Female
`am_adam`	American	Male
`am_michael`	American	Male
`bf_emma`	British	Female
`bm_george`	British	Male

Browse all 30+ voices at hexgrad/Kokoro-82M.

Piper Voices

With Piper, the model name IS the voice. Use --model to specify.

Voice	Language	Quality
`en_US-lessac-medium`	English (US)	Medium
`en_US-ryan-high`	English (US)	High
`en_GB-alan-medium`	English (UK)	Medium
`de_DE-thorsten-high`	German	High
`fr_FR-upmc-medium`	French	Medium

Browse all voices at rhasspy/piper.

Installation

Kokoro (GPU-accelerated)

pip install "agent-cli[kokoro]"
# or
uv sync --extra kokoro

Kokoro requires PyTorch. For GPU acceleration: - CUDA: Install PyTorch with CUDA support - Apple Silicon: PyTorch automatically uses MPS

Piper (CPU-friendly)

pip install "agent-cli[piper]"
# or
uv sync --extra piper