whisper

Run a local Whisper ASR server with automatic backend selection based on your platform:

macOS Apple Silicon → mlx-whisper (Metal acceleration)
Linux/CUDA → faster-whisper (CTranslate2)
HuggingFace → transformers (supports safetensors models)

Note

Quick Start - Get transcription working in 30 seconds:

pip install "agent-cli[faster-whisper]"
agent-cli server whisper

Server is now running at http://localhost:10301. Verify with curl http://localhost:10301/health.

Apple Silicon MLX-only setup:

pip install "agent-cli[mlx-whisper]"
agent-cli server whisper --backend mlx

Use it with any OpenAI-compatible client, or configure agent-cli to use it - see Configuration.

Features

OpenAI-compatible API at /v1/audio/transcriptions - drop-in replacement for OpenAI's Whisper API
Wyoming protocol for Home Assistant voice integration (Wyoming is the standard protocol for local voice services)
TTL-based memory management - models unload after idle period, freeing RAM/VRAM
Multiple models - run different model sizes with independent TTLs
Background preloading - downloads start at startup without blocking; use --preload to wait
Multi-platform support - automatically uses the optimal backend for your hardware

Usage

agent-cli server whisper [OPTIONS]

Examples

# Run with default large-v3 model (5-minute TTL)
agent-cli server whisper

# Use smaller model with 10-minute TTL
agent-cli server whisper --model small --ttl 600

# Run multiple models (requests can specify which to use)
agent-cli server whisper --model large-v3 --model small --default-model large-v3

# Force CPU mode
agent-cli server whisper --device cpu

# Download model without starting server (requires faster-whisper)
agent-cli server whisper --model large-v3 --download-only

# Preload model at startup and wait until ready
agent-cli server whisper --preload

Options

Option	Default	Description
`--model, -m`	-	Whisper model(s) to load. Common models: `tiny`, `base`, `small`, `medium`, `large-v3`, `distil-large-v3`. Can specify multiple for different accuracy/speed tradeoffs. Default: `large-v3`
`--default-model`	-	Model to use when client doesn't specify one. Must be in the `--model` list
`--device, -d`	`auto`	Compute device: `auto` (detect GPU), `cuda`, `cuda:0`, `cpu`. MLX backend always uses Apple Silicon
`--compute-type`	`auto`	Precision for faster-whisper: `auto`, `float16`, `int8`, `int8_float16`. Lower precision = faster + less VRAM
`--cache-dir`	-	Custom directory for downloaded models (default: HuggingFace cache)
`--ttl`	`300`	Seconds of inactivity before unloading model from memory. Set to 0 to keep loaded indefinitely
`--preload`	`false`	Load model(s) immediately at startup instead of on first request. Useful for reducing first-request latency
`--host`	`0.0.0.0`	Network interface to bind. Use `0.0.0.0` for all interfaces
`--port, -p`	`10301`	Port for OpenAI-compatible HTTP API (`/v1/audio/transcriptions`)
`--wyoming-port`	`10300`	Port for Wyoming protocol (Home Assistant integration)
`--no-wyoming`	`false`	Disable Wyoming protocol server (only run HTTP API)
`--download-only`	`false`	Download model(s) to cache and exit. Useful for Docker builds
`--backend, -b`	`auto`	Inference backend: `auto` (faster-whisper on CUDA/CPU, MLX on Apple Silicon), `faster-whisper`, `mlx`, `transformers` (HuggingFace, supports safetensors)

General Options

Option	Default	Description
`--log-level`	`info`	Set logging level.

API Endpoints

Once running, the server exposes:

Endpoint	Method	Description
`/v1/audio/transcriptions`	POST	OpenAI-compatible transcription
`/v1/audio/translations`	POST	OpenAI-compatible translation (to English)
`/v1/audio/transcriptions/stream`	WebSocket	Real-time streaming transcription
`/v1/model/unload`	POST	Manually unload a model from memory
`/health`	GET	Health check with model status
`/docs`	GET	Interactive API documentation

Using the API

curl Example

# Transcribe an audio file
curl -X POST http://localhost:10301/v1/audio/transcriptions \
  -F "file=@recording.wav" \
  -F "model=whisper-1"

# With language hint and verbose output
curl -X POST http://localhost:10301/v1/audio/transcriptions \
  -F "file=@recording.wav" \
  -F "model=whisper-1" \
  -F "language=en" \
  -F "response_format=verbose_json"

# Get SRT subtitles
curl -X POST http://localhost:10301/v1/audio/transcriptions \
  -F "file=@recording.wav" \
  -F "model=whisper-1" \
  -F "response_format=srt"

Python Example (OpenAI SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:10301/v1", api_key="not-needed")

# Transcribe audio
with open("recording.wav", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
    )
print(transcript.text)

WebSocket Streaming Protocol

The /v1/audio/transcriptions/stream endpoint provides real-time streaming transcription.

Protocol

Connect to ws://localhost:10301/v1/audio/transcriptions/stream?model=whisper-1
Send binary audio chunks (16kHz, 16-bit, mono PCM)
Send EOS (3 bytes: 0x45 0x4F 0x53) to signal end of audio
Receive JSON response with transcription

Message Format

Server response:

{
  "type": "final",
  "text": "transcribed text here",
  "is_final": true,
  "language": "en",
  "duration": 3.5,
  "segments": [...]
}

Error response:

{"type": "error", "message": "error description"}

Model Selection Guide

Model	Disk	VRAM	Speed	Accuracy	Use Case
`large-v3`	~3GB	~4GB	Slow	Best	Highest accuracy, batch processing
`medium`	~1.5GB	~2GB	Medium	Good	Balanced accuracy/speed
`small`	~500MB	~1GB	Fast	Fair	Real-time, lower VRAM
`tiny`	~75MB	~300MB	Fastest	Basic	Very limited VRAM, quick transcription

Tip

Use --model small --model large-v3 to run both models. Clients can request either via the model parameter.

Troubleshooting

Issue	Solution
CUDA out of memory	Use `--device cpu` or smaller model (e.g., `--model small`)
Port already in use	Use `--port XXXX` to specify different port
Model download fails	Check network connection, or use `--download-only` first
Slow first request	Model is still downloading/loading. Use `--preload` to wait at startup
Wyoming not working	Ensure port 10300 is not blocked; check with `nc -zv localhost 10300`

Using with agent-cli Commands

The Whisper server is designed to work seamlessly with other agent-cli commands. See Configuration: Local Whisper Server for setup instructions.

Installation

Requires server deps and a backend:

# faster-whisper backend (default on CUDA/CPU)
pip install "agent-cli[faster-whisper]"
# or
uv sync --extra faster-whisper

macOS Apple Silicon

For optimal performance on M1/M2/M3/M4 Macs, install mlx-whisper:

pip install "agent-cli[mlx-whisper]"

The server will automatically detect and use the MLX backend when available.

HuggingFace Transformers

For loading models in safetensors format (instead of CTranslate2 .bin files):

pip install "agent-cli[whisper-transformers]"
agent-cli server whisper --backend transformers

This uses HuggingFace's transformers library, which supports loading .safetensors models directly from the Hub.

Docker

Pre-built images are available from GitHub Container Registry:

# Run with GPU support
docker run -p 10300:10300 -p 10301:10301 --gpus all ghcr.io/basnijholt/agent-cli-whisper:latest-cuda

# Run CPU-only
docker run -p 10300:10300 -p 10301:10301 ghcr.io/basnijholt/agent-cli-whisper:latest-cpu

Or build from source using the whisper.Dockerfile:

# Build and run with GPU support
docker build -f docker/whisper.Dockerfile --target cuda -t agent-cli-whisper:cuda .
docker run -p 10300:10300 -p 10301:10301 --gpus all agent-cli-whisper:cuda

# Build and run CPU-only
docker build -f docker/whisper.Dockerfile --target cpu -t agent-cli-whisper:cpu .
docker run -p 10300:10300 -p 10301:10301 agent-cli-whisper:cpu

Or use Docker Compose:

# With GPU
docker compose -f docker/docker-compose.whisper.yml --profile cuda up

# CPU only
docker compose -f docker/docker-compose.whisper.yml --profile cpu up

Configure via environment variables:

Variable	Default	Description
`WHISPER_MODEL`	`large-v3`	Model to load
`WHISPER_TTL`	`300`	Seconds before unloading idle model
`WHISPER_DEVICE`	`cuda`/`cpu`	Device (set by target)
`WHISPER_LOG_LEVEL`	`info`	Logging level
`WHISPER_EXTRA_ARGS`	-	Additional CLI arguments