Skip to content

whisper

Run a local Whisper ASR server with automatic backend selection based on your platform:

Note

Quick Start - Get transcription working in 30 seconds:

pip install "agent-cli[faster-whisper]"
agent-cli server whisper
Server is now running at http://localhost:10301. Verify with curl http://localhost:10301/health.

Apple Silicon MLX-only setup:

pip install "agent-cli[mlx-whisper]"
agent-cli server whisper --backend mlx

Use it with any OpenAI-compatible client, or configure agent-cli to use it - see Configuration.

Features

  • OpenAI-compatible API at /v1/audio/transcriptions - drop-in replacement for OpenAI's Whisper API
  • Wyoming protocol for Home Assistant voice integration (Wyoming is the standard protocol for local voice services)
  • TTL-based memory management - models unload after idle period, freeing RAM/VRAM
  • Multiple models - run different model sizes with independent TTLs
  • Background preloading - downloads start at startup without blocking; use --preload to wait
  • Multi-platform support - automatically uses the optimal backend for your hardware

Usage

agent-cli server whisper [OPTIONS]

Examples

# Run with default large-v3 model (5-minute TTL)
agent-cli server whisper

# Use smaller model with 10-minute TTL
agent-cli server whisper --model small --ttl 600

# Run multiple models (requests can specify which to use)
agent-cli server whisper --model large-v3 --model small --default-model large-v3

# Force CPU mode
agent-cli server whisper --device cpu

# Download model without starting server (requires faster-whisper)
agent-cli server whisper --model large-v3 --download-only

# Preload model at startup and wait until ready
agent-cli server whisper --preload

Options

Options

Option Default Description
--model, -m - Whisper model(s) to load. Common models: tiny, base, small, medium, large-v3, distil-large-v3. Can specify multiple for different accuracy/speed tradeoffs. Default: large-v3
--default-model - Model to use when client doesn't specify one. Must be in the --model list
--device, -d auto Compute device: auto (detect GPU), cuda, cuda:0, cpu. MLX backend always uses Apple Silicon
--compute-type auto Precision for faster-whisper: auto, float16, int8, int8_float16. Lower precision = faster + less VRAM
--cache-dir - Custom directory for downloaded models (default: HuggingFace cache)
--ttl 300 Seconds of inactivity before unloading model from memory. Set to 0 to keep loaded indefinitely
--preload false Load model(s) immediately at startup instead of on first request. Useful for reducing first-request latency
--host 0.0.0.0 Network interface to bind. Use 0.0.0.0 for all interfaces
--port, -p 10301 Port for OpenAI-compatible HTTP API (/v1/audio/transcriptions)
--wyoming-port 10300 Port for Wyoming protocol (Home Assistant integration)
--no-wyoming false Disable Wyoming protocol server (only run HTTP API)
--download-only false Download model(s) to cache and exit. Useful for Docker builds
--backend, -b auto Inference backend: auto (faster-whisper on CUDA/CPU, MLX on Apple Silicon), faster-whisper, mlx, transformers (HuggingFace, supports safetensors)

General Options

Option Default Description
--log-level info Set logging level.

API Endpoints

Once running, the server exposes:

Endpoint Method Description
/v1/audio/transcriptions POST OpenAI-compatible transcription
/v1/audio/translations POST OpenAI-compatible translation (to English)
/v1/audio/transcriptions/stream WebSocket Real-time streaming transcription
/v1/model/unload POST Manually unload a model from memory
/health GET Health check with model status
/docs GET Interactive API documentation

Using the API

curl Example

# Transcribe an audio file
curl -X POST http://localhost:10301/v1/audio/transcriptions \
  -F "file=@recording.wav" \
  -F "model=whisper-1"

# With language hint and verbose output
curl -X POST http://localhost:10301/v1/audio/transcriptions \
  -F "file=@recording.wav" \
  -F "model=whisper-1" \
  -F "language=en" \
  -F "response_format=verbose_json"

# Get SRT subtitles
curl -X POST http://localhost:10301/v1/audio/transcriptions \
  -F "file=@recording.wav" \
  -F "model=whisper-1" \
  -F "response_format=srt"

Python Example (OpenAI SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:10301/v1", api_key="not-needed")

# Transcribe audio
with open("recording.wav", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
    )
print(transcript.text)

WebSocket Streaming Protocol

The /v1/audio/transcriptions/stream endpoint provides real-time streaming transcription.

Protocol

  1. Connect to ws://localhost:10301/v1/audio/transcriptions/stream?model=whisper-1
  2. Send binary audio chunks (16kHz, 16-bit, mono PCM)
  3. Send EOS (3 bytes: 0x45 0x4F 0x53) to signal end of audio
  4. Receive JSON response with transcription

Message Format

Server response:

{
  "type": "final",
  "text": "transcribed text here",
  "is_final": true,
  "language": "en",
  "duration": 3.5,
  "segments": [...]
}

Error response:

{"type": "error", "message": "error description"}

Model Selection Guide

Model Disk VRAM Speed Accuracy Use Case
large-v3 ~3GB ~4GB Slow Best Highest accuracy, batch processing
medium ~1.5GB ~2GB Medium Good Balanced accuracy/speed
small ~500MB ~1GB Fast Fair Real-time, lower VRAM
tiny ~75MB ~300MB Fastest Basic Very limited VRAM, quick transcription

Tip

Use --model small --model large-v3 to run both models. Clients can request either via the model parameter.

Troubleshooting

Issue Solution
CUDA out of memory Use --device cpu or smaller model (e.g., --model small)
Port already in use Use --port XXXX to specify different port
Model download fails Check network connection, or use --download-only first
Slow first request Model is still downloading/loading. Use --preload to wait at startup
Wyoming not working Ensure port 10300 is not blocked; check with nc -zv localhost 10300

Using with agent-cli Commands

The Whisper server is designed to work seamlessly with other agent-cli commands. See Configuration: Local Whisper Server for setup instructions.

Installation

Requires server deps and a backend:

# faster-whisper backend (default on CUDA/CPU)
pip install "agent-cli[faster-whisper]"
# or
uv sync --extra faster-whisper

macOS Apple Silicon

For optimal performance on M1/M2/M3/M4 Macs, install mlx-whisper:

pip install "agent-cli[mlx-whisper]"

The server will automatically detect and use the MLX backend when available.

HuggingFace Transformers

For loading models in safetensors format (instead of CTranslate2 .bin files):

pip install "agent-cli[whisper-transformers]"
agent-cli server whisper --backend transformers

This uses HuggingFace's transformers library, which supports loading .safetensors models directly from the Hub.

Docker

Pre-built images are available from GitHub Container Registry:

# Run with GPU support
docker run -p 10300:10300 -p 10301:10301 --gpus all ghcr.io/basnijholt/agent-cli-whisper:latest-cuda

# Run CPU-only
docker run -p 10300:10300 -p 10301:10301 ghcr.io/basnijholt/agent-cli-whisper:latest-cpu

Or build from source using the whisper.Dockerfile:

# Build and run with GPU support
docker build -f docker/whisper.Dockerfile --target cuda -t agent-cli-whisper:cuda .
docker run -p 10300:10300 -p 10301:10301 --gpus all agent-cli-whisper:cuda

# Build and run CPU-only
docker build -f docker/whisper.Dockerfile --target cpu -t agent-cli-whisper:cpu .
docker run -p 10300:10300 -p 10301:10301 agent-cli-whisper:cpu

Or use Docker Compose:

# With GPU
docker compose -f docker/docker-compose.whisper.yml --profile cuda up

# CPU only
docker compose -f docker/docker-compose.whisper.yml --profile cpu up

Configure via environment variables:

Variable Default Description
WHISPER_MODEL large-v3 Model to load
WHISPER_TTL 300 Seconds before unloading idle model
WHISPER_DEVICE cuda/cpu Device (set by target)
WHISPER_LOG_LEVEL info Logging level
WHISPER_EXTRA_ARGS - Additional CLI arguments