whisper
Run a local Whisper ASR server with automatic backend selection based on your platform:
- macOS Apple Silicon → mlx-whisper (Metal acceleration)
- Linux/CUDA → faster-whisper (CTranslate2)
- HuggingFace → transformers (supports safetensors models)
Note
Quick Start - Get transcription working in 30 seconds:
Server is now running athttp://localhost:10301. Verify with curl http://localhost:10301/health.
Apple Silicon MLX-only setup:
Use it with any OpenAI-compatible client, or configure agent-cli to use it - see Configuration.
Features
- OpenAI-compatible API at
/v1/audio/transcriptions- drop-in replacement for OpenAI's Whisper API - Wyoming protocol for Home Assistant voice integration (Wyoming is the standard protocol for local voice services)
- TTL-based memory management - models unload after idle period, freeing RAM/VRAM
- Multiple models - run different model sizes with independent TTLs
- Background preloading - downloads start at startup without blocking; use
--preloadto wait - Multi-platform support - automatically uses the optimal backend for your hardware
Usage
Examples
# Run with default large-v3 model (5-minute TTL)
agent-cli server whisper
# Use smaller model with 10-minute TTL
agent-cli server whisper --model small --ttl 600
# Run multiple models (requests can specify which to use)
agent-cli server whisper --model large-v3 --model small --default-model large-v3
# Force CPU mode
agent-cli server whisper --device cpu
# Download model without starting server (requires faster-whisper)
agent-cli server whisper --model large-v3 --download-only
# Preload model at startup and wait until ready
agent-cli server whisper --preload
Options
Options
| Option | Default | Description |
|---|---|---|
--model, -m |
- | Whisper model(s) to load. Common models: tiny, base, small, medium, large-v3, distil-large-v3. Can specify multiple for different accuracy/speed tradeoffs. Default: large-v3 |
--default-model |
- | Model to use when client doesn't specify one. Must be in the --model list |
--device, -d |
auto |
Compute device: auto (detect GPU), cuda, cuda:0, cpu. MLX backend always uses Apple Silicon |
--compute-type |
auto |
Precision for faster-whisper: auto, float16, int8, int8_float16. Lower precision = faster + less VRAM |
--cache-dir |
- | Custom directory for downloaded models (default: HuggingFace cache) |
--ttl |
300 |
Seconds of inactivity before unloading model from memory. Set to 0 to keep loaded indefinitely |
--preload |
false |
Load model(s) immediately at startup instead of on first request. Useful for reducing first-request latency |
--host |
0.0.0.0 |
Network interface to bind. Use 0.0.0.0 for all interfaces |
--port, -p |
10301 |
Port for OpenAI-compatible HTTP API (/v1/audio/transcriptions) |
--wyoming-port |
10300 |
Port for Wyoming protocol (Home Assistant integration) |
--no-wyoming |
false |
Disable Wyoming protocol server (only run HTTP API) |
--download-only |
false |
Download model(s) to cache and exit. Useful for Docker builds |
--backend, -b |
auto |
Inference backend: auto (faster-whisper on CUDA/CPU, MLX on Apple Silicon), faster-whisper, mlx, transformers (HuggingFace, supports safetensors) |
General Options
| Option | Default | Description |
|---|---|---|
--log-level |
info |
Set logging level. |
API Endpoints
Once running, the server exposes:
| Endpoint | Method | Description |
|---|---|---|
/v1/audio/transcriptions |
POST | OpenAI-compatible transcription |
/v1/audio/translations |
POST | OpenAI-compatible translation (to English) |
/v1/audio/transcriptions/stream |
WebSocket | Real-time streaming transcription |
/v1/model/unload |
POST | Manually unload a model from memory |
/health |
GET | Health check with model status |
/docs |
GET | Interactive API documentation |
Using the API
curl Example
# Transcribe an audio file
curl -X POST http://localhost:10301/v1/audio/transcriptions \
-F "file=@recording.wav" \
-F "model=whisper-1"
# With language hint and verbose output
curl -X POST http://localhost:10301/v1/audio/transcriptions \
-F "file=@recording.wav" \
-F "model=whisper-1" \
-F "language=en" \
-F "response_format=verbose_json"
# Get SRT subtitles
curl -X POST http://localhost:10301/v1/audio/transcriptions \
-F "file=@recording.wav" \
-F "model=whisper-1" \
-F "response_format=srt"
Python Example (OpenAI SDK)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:10301/v1", api_key="not-needed")
# Transcribe audio
with open("recording.wav", "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f,
)
print(transcript.text)
WebSocket Streaming Protocol
The /v1/audio/transcriptions/stream endpoint provides real-time streaming transcription.
Protocol
- Connect to
ws://localhost:10301/v1/audio/transcriptions/stream?model=whisper-1 - Send binary audio chunks (16kHz, 16-bit, mono PCM)
- Send
EOS(3 bytes:0x45 0x4F 0x53) to signal end of audio - Receive JSON response with transcription
Message Format
Server response:
{
"type": "final",
"text": "transcribed text here",
"is_final": true,
"language": "en",
"duration": 3.5,
"segments": [...]
}
Error response:
Model Selection Guide
| Model | Disk | VRAM | Speed | Accuracy | Use Case |
|---|---|---|---|---|---|
large-v3 |
~3GB | ~4GB | Slow | Best | Highest accuracy, batch processing |
medium |
~1.5GB | ~2GB | Medium | Good | Balanced accuracy/speed |
small |
~500MB | ~1GB | Fast | Fair | Real-time, lower VRAM |
tiny |
~75MB | ~300MB | Fastest | Basic | Very limited VRAM, quick transcription |
Tip
Use --model small --model large-v3 to run both models. Clients can request either via the model parameter.
Troubleshooting
| Issue | Solution |
|---|---|
| CUDA out of memory | Use --device cpu or smaller model (e.g., --model small) |
| Port already in use | Use --port XXXX to specify different port |
| Model download fails | Check network connection, or use --download-only first |
| Slow first request | Model is still downloading/loading. Use --preload to wait at startup |
| Wyoming not working | Ensure port 10300 is not blocked; check with nc -zv localhost 10300 |
Using with agent-cli Commands
The Whisper server is designed to work seamlessly with other agent-cli commands. See Configuration: Local Whisper Server for setup instructions.
Installation
Requires server deps and a backend:
# faster-whisper backend (default on CUDA/CPU)
pip install "agent-cli[faster-whisper]"
# or
uv sync --extra faster-whisper
macOS Apple Silicon
For optimal performance on M1/M2/M3/M4 Macs, install mlx-whisper:
The server will automatically detect and use the MLX backend when available.
HuggingFace Transformers
For loading models in safetensors format (instead of CTranslate2 .bin files):
This uses HuggingFace's transformers library, which supports loading .safetensors models directly from the Hub.
Docker
Pre-built images are available from GitHub Container Registry:
# Run with GPU support
docker run -p 10300:10300 -p 10301:10301 --gpus all ghcr.io/basnijholt/agent-cli-whisper:latest-cuda
# Run CPU-only
docker run -p 10300:10300 -p 10301:10301 ghcr.io/basnijholt/agent-cli-whisper:latest-cpu
Or build from source using the whisper.Dockerfile:
# Build and run with GPU support
docker build -f docker/whisper.Dockerfile --target cuda -t agent-cli-whisper:cuda .
docker run -p 10300:10300 -p 10301:10301 --gpus all agent-cli-whisper:cuda
# Build and run CPU-only
docker build -f docker/whisper.Dockerfile --target cpu -t agent-cli-whisper:cpu .
docker run -p 10300:10300 -p 10301:10301 agent-cli-whisper:cpu
Or use Docker Compose:
# With GPU
docker compose -f docker/docker-compose.whisper.yml --profile cuda up
# CPU only
docker compose -f docker/docker-compose.whisper.yml --profile cpu up
Configure via environment variables:
| Variable | Default | Description |
|---|---|---|
WHISPER_MODEL |
large-v3 |
Model to load |
WHISPER_TTL |
300 |
Seconds before unloading idle model |
WHISPER_DEVICE |
cuda/cpu |
Device (set by target) |
WHISPER_LOG_LEVEL |
info |
Logging level |
WHISPER_EXTRA_ARGS |
- | Additional CLI arguments |