transcribe

Transcribe audio from your microphone to text.

Usage

agent-cli transcribe [OPTIONS]

Description

This command:

Starts listening to your microphone immediately
Records your speech
When you press Ctrl+C, stops recording and finalizes transcription (Wyoming streams live; OpenAI uploads after stop)
Copies the transcribed text to your clipboard
Optionally uses an LLM to clean up the transcript

Examples

# Basic transcription
agent-cli transcribe --input-device-index 1

# With LLM cleanup
agent-cli transcribe --input-device-index 1 --llm

# List available audio devices
agent-cli transcribe --list-devices

# Transcribe from a saved file (supports wav, mp3, m4a, ogg, flac, aac, webm)
agent-cli transcribe --from-file recording.wav

# Transcribe an MP3 file with OpenAI
agent-cli transcribe --from-file podcast.mp3 --asr-provider openai

# Transcribe an M4A voice memo with Gemini
agent-cli transcribe --from-file voice_memo.m4a --asr-provider gemini

# Re-transcribe most recent recording
agent-cli transcribe --last-recording 1

# Transcribe with speaker diarization (identifies different speakers)
agent-cli transcribe --diarize --hf-token YOUR_HF_TOKEN

# Diarization with JSON output format
agent-cli transcribe --diarize --diarize-format json --hf-token YOUR_HF_TOKEN

# Persist unmatched voices as stable UNKNOWN_### profiles
agent-cli transcribe --diarize --remember-unknown-speakers --hf-token YOUR_HF_TOKEN

# Inspect and name remembered speaker profiles
agent-cli speakers list
agent-cli speakers rename UNKNOWN_001 Alice
agent-cli speakers merge UNKNOWN_002 Alice

# Enroll current diarization labels directly when you already know who is who
agent-cli transcribe --last-recording 1 --diarize --enroll-speakers SPEAKER_00=Alice --hf-token YOUR_HF_TOKEN

# Diarize a file with known number of speakers
agent-cli transcribe --from-file meeting.wav --diarize --min-speakers 2 --max-speakers 4 --hf-token YOUR_HF_TOKEN

# Use wav2vec2 for word-level alignment (more accurate but slower)
agent-cli transcribe --from-file meeting.wav --diarize --align-words --hf-token YOUR_HF_TOKEN

Supported Audio Formats

The --from-file option supports multiple audio formats:

Provider	Supported Formats
OpenAI	mp3, mp4, mpeg, mpga, m4a, wav, webm
Gemini	wav, mp3, aiff, aac, ogg, flac, m4a
Wyoming	Any format (converted via ffmpeg)

Note

For non-WAV formats with the Wyoming provider, ffmpeg must be installed on your system.

Options

LLM Configuration

Option	Default	Description
`--extra-instructions`	-	Extra instructions appended to the LLM cleanup prompt (requires `--llm`).
`--llm/--no-llm`	`false`	Clean up transcript with LLM: fix errors, add punctuation, remove filler words. Uses `--extra-instructions` if set (via CLI or config file). Not compatible with --diarize.

Audio Recovery

Option	Default	Description
`--from-file`	-	Transcribe from audio file instead of microphone. Supports wav, mp3, m4a, ogg, flac, aac, webm. Requires `ffmpeg` for non-WAV formats with Wyoming.
`--last-recording`	`0`	Re-transcribe a saved recording (1=most recent, 2=second-to-last, etc). Useful after connection failures or to retry with different options.
`--save-recording/--no-save-recording`	`true`	Save recordings to ~/.cache/agent-cli/ for `--last-recording` recovery.

Provider Selection

Option	Default	Description
`--asr-provider`	`wyoming`	The ASR provider to use ('wyoming', 'openai', 'gemini').
`--llm-provider`	`ollama`	The LLM provider to use ('ollama', 'openai', 'gemini').

Audio Input

Option	Default	Description
`--input-device-index`	-	Audio input device index (see `--list-devices`). Uses system default if omitted.
`--input-device-name`	-	Select input device by name substring (e.g., `MacBook` or `USB`).
`--list-devices`	`false`	List available audio devices with their indices and exit.

Audio Input: Wyoming

Option	Default	Description
`--asr-wyoming-ip`	`localhost`	Wyoming ASR server IP address.
`--asr-wyoming-port`	`10300`	Wyoming ASR server port.

Audio Input: OpenAI-compatible

Option	Default	Description
`--asr-openai-model`	`whisper-1`	The OpenAI model to use for ASR (transcription).
`--asr-openai-base-url`	-	Custom base URL for OpenAI-compatible ASR API (e.g., for custom Whisper server: http://localhost:9898).
`--asr-openai-prompt`	-	Custom prompt to guide transcription (optional).

Audio Input: Gemini

Option	Default	Description
`--asr-gemini-model`	`gemini-3-flash-preview`	The Gemini model to use for ASR (transcription).

LLM: Ollama

Option	Default	Description
`--llm-ollama-model`	`gemma3:4b`	The Ollama model to use. Default is gemma3:4b.
`--llm-ollama-host`	`http://localhost:11434`	The Ollama server host. Default is http://localhost:11434.

LLM: OpenAI-compatible

Option	Default	Description
`--llm-openai-model`	`gpt-5-mini`	The OpenAI model to use for LLM tasks.
`--openai-api-key`	-	Your OpenAI API key. Can also be set with the OPENAI_API_KEY environment variable.
`--openai-base-url`	-	Custom base URL for OpenAI-compatible API (e.g., for llama-server: http://localhost:8080/v1).

LLM: Gemini

Option	Default	Description
`--llm-gemini-model`	`gemini-3-flash-preview`	The Gemini model to use for LLM tasks.
`--gemini-api-key`	-	Your Gemini API key. Can also be set with the GEMINI_API_KEY environment variable.

Process Management

Option	Default	Description
`--stop`	`false`	Stop any running instance of this command.
`--status`	`false`	Check if an instance is currently running.
`--toggle`	`false`	Start if not running, stop if running. Ideal for hotkey binding.

General Options

Option	Default	Description
`--clipboard/--no-clipboard`	`true`	Copy result to clipboard.
`--log-level`	`warning`	Set logging level.
`--log-file`	-	Path to a file to write logs to.
`--quiet, -q`	`false`	Suppress console output from rich.
`--json`	`false`	Output result as JSON (implies `--quiet` and `--no-clipboard`).
`--config`	-	Path to a TOML configuration file.
`--print-args`	`false`	Print the command line arguments, including variables taken from the configuration file.
`--transcription-log`	-	Append transcripts to JSONL file (timestamp, hostname, model, raw/processed text). Recent entries provide context for LLM cleanup.

Diarization

Option	Default	Description
`--diarize/--no-diarize`	`false`	Enable speaker diarization (requires pyannote-audio). Install with: pip install agent-cli[diarization]
`--diarize-format`	`inline`	Output format for diarization ('inline' for [Speaker N]: text, 'json' for structured output).
`--hf-token`	-	HuggingFace token for pyannote models. Required for diarization. Token must have 'Read access to contents of all public gated repos you can access' permission. Accept licenses at: https://hf.co/pyannote/speaker-diarization-3.1, https://hf.co/pyannote/segmentation-3.0, https://hf.co/pyannote/wespeaker-voxceleb-resnet34-LM
`--min-speakers`	-	Minimum number of speakers (optional hint for diarization).
`--max-speakers`	-	Maximum number of speakers (optional hint for diarization).
`--align-words/--no-align-words`	`false`	Use wav2vec2 forced alignment for word-level speaker assignment (more accurate but slower).
`--align-language`	`en`	Language code for word alignment model (e.g., 'en', 'fr', 'de', 'es', 'it').
`--enroll-speakers`	-	Enroll current speaker labels or remembered profile IDs into persistent voice profiles, e.g. SPEAKER_00=Alice or UNKNOWN_001=Alice. For simple renames, use `agent-cli speakers rename`.
`--identify-speakers/--no-identify-speakers`	`true`	Match diarized speakers against persistent voice profiles when profiles exist.
`--remember-unknown-speakers/--no-remember-unknown-speakers`	`false`	Persist unmatched speaker embeddings as stable UNKNOWN_### voice profiles.
`--speaker-profiles-file`	`/home/runner/.config/agent-cli/speaker-profiles.json`	JSON file storing persistent speaker voice embeddings.
`--speaker-match-threshold`	`0.7`	Cosine-similarity threshold for matching diarized speakers to stored profiles.

Workflow Integration

Toggle Recording Hotkey

The --toggle flag is designed for hotkey integration:

# First press: starts recording
agent-cli transcribe --toggle --input-device-index 1

# Second press: stops recording and transcribes
agent-cli transcribe --toggle

macOS Hotkey (skhd)

cmd + shift + r : /path/to/agent-cli transcribe --toggle --input-device-index 1

Transcription Log

Log all transcriptions with timestamps:

agent-cli transcribe --transcription-log ~/.config/agent-cli/transcriptions.log

Tips

Use --list-devices to find your microphone's index
Enable --llm for cleaner output with proper punctuation
Use --last-recording 1 to re-transcribe if you need to adjust settings

Speaker Diarization

Speaker diarization identifies and labels different speakers in the transcript. This is useful for meeting recordings, interviews, or any multi-speaker audio.

Requirements

Install the diarization extra:

pip install agent-cli[diarization]
# or with uv
uv sync --extra diarization

HuggingFace token: The pyannote-audio models are gated. You need to:
Accept the license for all three models:
Get your token from HuggingFace settings
Token must have "Read access to contents of all public gated repos you can access" permission
Provide it via --hf-token or the HF_TOKEN environment variable

Output Formats

Inline format (default):

[SPEAKER_00]: Hello, how are you today?
[SPEAKER_01]: I'm doing well, thanks for asking!
[SPEAKER_00]: Great to hear.

With persistent speaker profiles, later runs output stored names like Alice instead of run-local labels like SPEAKER_00.

JSON format (--diarize-format json):

{
  "segments": [
    {"speaker": "SPEAKER_00", "start": 0.0, "end": 2.5, "text": "Hello, how are you today?"},
    {"speaker": "SPEAKER_01", "start": 2.7, "end": 4.1, "text": "I'm doing well, thanks for asking!"},
    {"speaker": "SPEAKER_00", "start": 4.3, "end": 5.2, "text": "Great to hear."}
  ]
}

Speaker Hints

If you know how many speakers are in the recording, use --min-speakers and --max-speakers to improve accuracy:

# For a two-person interview
agent-cli transcribe --from-file interview.wav --diarize --min-speakers 2 --max-speakers 2 --hf-token YOUR_TOKEN

Note

Diarization requires the audio file to be saved. When using live recording with --diarize, ensure --save-recording is enabled (it's enabled by default).