tts
Run a local TTS (Text-to-Speech) server with two backend options:
- Kokoro - High-quality neural TTS with GPU acceleration (CUDA/MPS/CPU)
- Piper - Fast, CPU-friendly ONNX-based synthesis
Note
Quick Start with Kokoro (GPU-accelerated, auto-downloads from HuggingFace):
Quick Start with Piper (CPU-friendly):
Server runs at http://localhost:10201. Verify with curl http://localhost:10201/health.
Features
- OpenAI-compatible API at
/v1/audio/speech- drop-in replacement for OpenAI's TTS API - Wyoming protocol for Home Assistant voice integration
- TTL-based memory management - models unload after idle period
- Multiple backends - Kokoro (GPU) or Piper (CPU)
- Auto-download - Models and voices download automatically on first use
- Multiple voices - Run different voices with independent TTLs
Usage
Examples
# Run with Kokoro (auto-downloads model and voice from HuggingFace)
agent-cli server tts --backend kokoro
# Run with Piper (CPU-friendly)
agent-cli server tts --backend piper
# Piper with specific voice and 10-minute TTL
agent-cli server tts --backend piper --model en_US-ryan-high --ttl 600
# Run multiple Piper voices
agent-cli server tts --backend piper --model en_US-lessac-medium --model en_GB-alan-medium
# Preload models at startup
agent-cli server tts --preload
Options
Options
| Option | Default | Description |
|---|---|---|
--model, -m |
- | Model/voice(s) to load. Piper: en_US-lessac-medium, en_GB-alan-medium. Kokoro: af_heart, af_bella, am_adam. Auto-downloads on first use |
--default-model |
- | Voice to use when client doesn't specify one. Must be in the --model list |
--device, -d |
auto |
Compute device: auto, cpu, cuda, mps. Piper is CPU-only; Kokoro supports GPU acceleration |
--cache-dir |
- | Custom directory for downloaded models (default: ~/.cache/agent-cli/tts/) |
--ttl |
300 |
Seconds of inactivity before unloading model from memory. Set to 0 to keep loaded indefinitely |
--preload |
false |
Load model(s) immediately at startup instead of on first request. Useful for reducing first-request latency |
--host |
0.0.0.0 |
Network interface to bind. Use 0.0.0.0 for all interfaces |
--port, -p |
10201 |
Port for OpenAI-compatible HTTP API (/v1/audio/speech) |
--wyoming-port |
10200 |
Port for Wyoming protocol (Home Assistant integration) |
--no-wyoming |
false |
Disable Wyoming protocol server (only run HTTP API) |
--download-only |
false |
Download model(s)/voice(s) to cache and exit. Useful for Docker builds |
--backend, -b |
auto |
TTS engine: auto (prefer Kokoro if available), piper (CPU, many languages), kokoro (GPU, high quality) |
General Options
| Option | Default | Description |
|---|---|---|
--log-level |
info |
Set logging level. |
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/v1/audio/speech |
POST | OpenAI-compatible speech synthesis (supports stream_format=audio for Kokoro) |
/v1/audio/speech/json |
POST | Alternative endpoint accepting JSON body |
/v1/voices |
GET | List available voices (models) |
/v1/model/unload |
POST | Manually unload a model from memory |
/health |
GET | Health check with model status |
/docs |
GET | Interactive API documentation |
Using the API
curl Example
# Synthesize speech (JSON body, OpenAI-compatible)
curl -X POST http://localhost:10201/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Hello, world!", "model": "tts-1", "voice": "alloy", "response_format": "wav"}' \
--output speech.wav
# With speed adjustment
curl -X POST http://localhost:10201/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "This is faster speech", "voice": "echo", "speed": 1.5, "response_format": "wav"}' \
--output fast.wav
Python Example (OpenAI SDK)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:10201/v1", api_key="not-needed")
# Synthesize speech
response = client.audio.speech.create(
model="tts-1",
voice="alloy",
input="Hello, this is a test of the local TTS server.",
)
response.write_to_file("output.wav")
Streaming Synthesis (Kokoro)
The Kokoro backend supports streaming synthesis following OpenAI's API convention:
stream_format=audioenables streamingresponse_format=pcmis required (this is the default)
This enables lower latency for real-time playback.
Streaming Example
# Stream audio directly to speaker (pcm is the default format)
curl -X POST http://localhost:10201/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Hello world. This is a streaming test.", "voice": "af_heart", "stream_format": "audio"}' \
--output - | aplay -r 24000 -f S16_LE -c 1
Python Streaming Example (OpenAI SDK)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:10201/v1", api_key="not-needed")
# Stream audio chunks as they're generated
with client.audio.speech.with_streaming_response.create(
model="tts-1",
voice="af_heart",
input="Hello, this is a streaming test.",
extra_body={"stream_format": "audio"},
) as response:
for chunk in response.iter_bytes():
# Process audio chunk (24kHz, 16-bit signed PCM, mono)
process_audio(chunk)
Response Format
- Content-Type:
audio/pcm - Headers:
X-Sample-Rate: 24000,X-Sample-Width: 2,X-Channels: 1 - Body: Raw 16-bit signed PCM audio chunks (same as OpenAI's PCM format)
Wyoming Protocol Streaming
When using the Kokoro backend via Wyoming protocol (port 10200), streaming is automatic - audio chunks are sent as they're generated via AudioChunk messages.
Architecture
The Kokoro backend runs synthesis in an isolated subprocess. This design provides:
- Memory cleanup: When the model is unloaded (via TTL or
/v1/model/unload), the subprocess terminates and all GPU/CPU memory is immediately released - Low latency: Streaming delivers audio chunks as Kokoro generates them, reducing time-to-first-audio
- Stability: Subprocess isolation prevents memory leaks from affecting the main server process
Voice Selection
Kokoro Voices
Kokoro voices are specified per-request via the API voice parameter. Voices auto-download from HuggingFace on first use.
The voice name prefix indicates accent: af_ = American Female, am_ = American Male, bf_ = British Female, bm_ = British Male.
| Voice | Accent | Gender | Notes |
|---|---|---|---|
af_heart |
American | Female | Default voice |
af_bella |
American | Female | |
af_nova |
American | Female | |
af_sky |
American | Female | |
am_adam |
American | Male | |
am_michael |
American | Male | |
bf_emma |
British | Female | |
bm_george |
British | Male |
Browse all 30+ voices at hexgrad/Kokoro-82M.
Piper Voices
With Piper, the model name IS the voice. Use --model to specify.
| Voice | Language | Quality |
|---|---|---|
en_US-lessac-medium |
English (US) | Medium |
en_US-ryan-high |
English (US) | High |
en_GB-alan-medium |
English (UK) | Medium |
de_DE-thorsten-high |
German | High |
fr_FR-upmc-medium |
French | Medium |
Browse all voices at rhasspy/piper.
Installation
Kokoro (GPU-accelerated)
Kokoro requires PyTorch. For GPU acceleration: - CUDA: Install PyTorch with CUDA support - Apple Silicon: PyTorch automatically uses MPS