transcribe
Transcribe audio from your microphone to text.
Usage
Description
This command:
- Starts listening to your microphone immediately
- Records your speech
- When you press
Ctrl+C, stops recording and finalizes transcription (Wyoming streams live; OpenAI uploads after stop) - Copies the transcribed text to your clipboard
- Optionally uses an LLM to clean up the transcript
Examples
# Basic transcription
agent-cli transcribe --input-device-index 1
# With LLM cleanup
agent-cli transcribe --input-device-index 1 --llm
# List available audio devices
agent-cli transcribe --list-devices
# Transcribe from a saved file (supports wav, mp3, m4a, ogg, flac, aac, webm)
agent-cli transcribe --from-file recording.wav
# Transcribe an MP3 file with OpenAI
agent-cli transcribe --from-file podcast.mp3 --asr-provider openai
# Transcribe an M4A voice memo with Gemini
agent-cli transcribe --from-file voice_memo.m4a --asr-provider gemini
# Re-transcribe most recent recording
agent-cli transcribe --last-recording 1
# Transcribe with speaker diarization (identifies different speakers)
agent-cli transcribe --diarize --hf-token YOUR_HF_TOKEN
# Diarization with JSON output format
agent-cli transcribe --diarize --diarize-format json --hf-token YOUR_HF_TOKEN
# Persist unmatched voices as stable UNKNOWN_### profiles
agent-cli transcribe --diarize --remember-unknown-speakers --hf-token YOUR_HF_TOKEN
# Inspect and name remembered speaker profiles
agent-cli speakers list
agent-cli speakers rename UNKNOWN_001 Alice
agent-cli speakers merge UNKNOWN_002 Alice
# Enroll current diarization labels directly when you already know who is who
agent-cli transcribe --last-recording 1 --diarize --enroll-speakers SPEAKER_00=Alice --hf-token YOUR_HF_TOKEN
# Diarize a file with known number of speakers
agent-cli transcribe --from-file meeting.wav --diarize --min-speakers 2 --max-speakers 4 --hf-token YOUR_HF_TOKEN
# Use wav2vec2 for word-level alignment (more accurate but slower)
agent-cli transcribe --from-file meeting.wav --diarize --align-words --hf-token YOUR_HF_TOKEN
Supported Audio Formats
The --from-file option supports multiple audio formats:
| Provider | Supported Formats |
|---|---|
| OpenAI | mp3, mp4, mpeg, mpga, m4a, wav, webm |
| Gemini | wav, mp3, aiff, aac, ogg, flac, m4a |
| Wyoming | Any format (converted via ffmpeg) |
Note
For non-WAV formats with the Wyoming provider, ffmpeg must be installed on your system.
Options
LLM Configuration
| Option | Default | Description |
|---|---|---|
--extra-instructions |
- | Extra instructions appended to the LLM cleanup prompt (requires --llm). |
--llm/--no-llm |
false |
Clean up transcript with LLM: fix errors, add punctuation, remove filler words. Uses --extra-instructions if set (via CLI or config file). Not compatible with --diarize. |
Audio Recovery
| Option | Default | Description |
|---|---|---|
--from-file |
- | Transcribe from audio file instead of microphone. Supports wav, mp3, m4a, ogg, flac, aac, webm. Requires ffmpeg for non-WAV formats with Wyoming. |
--last-recording |
0 |
Re-transcribe a saved recording (1=most recent, 2=second-to-last, etc). Useful after connection failures or to retry with different options. |
--save-recording/--no-save-recording |
true |
Save recordings to ~/.cache/agent-cli/ for --last-recording recovery. |
Provider Selection
| Option | Default | Description |
|---|---|---|
--asr-provider |
wyoming |
The ASR provider to use ('wyoming', 'openai', 'gemini'). |
--llm-provider |
ollama |
The LLM provider to use ('ollama', 'openai', 'gemini'). |
Audio Input
| Option | Default | Description |
|---|---|---|
--input-device-index |
- | Audio input device index (see --list-devices). Uses system default if omitted. |
--input-device-name |
- | Select input device by name substring (e.g., MacBook or USB). |
--list-devices |
false |
List available audio devices with their indices and exit. |
Audio Input: Wyoming
| Option | Default | Description |
|---|---|---|
--asr-wyoming-ip |
localhost |
Wyoming ASR server IP address. |
--asr-wyoming-port |
10300 |
Wyoming ASR server port. |
Audio Input: OpenAI-compatible
| Option | Default | Description |
|---|---|---|
--asr-openai-model |
whisper-1 |
The OpenAI model to use for ASR (transcription). |
--asr-openai-base-url |
- | Custom base URL for OpenAI-compatible ASR API (e.g., for custom Whisper server: http://localhost:9898). |
--asr-openai-prompt |
- | Custom prompt to guide transcription (optional). |
Audio Input: Gemini
| Option | Default | Description |
|---|---|---|
--asr-gemini-model |
gemini-3-flash-preview |
The Gemini model to use for ASR (transcription). |
LLM: Ollama
| Option | Default | Description |
|---|---|---|
--llm-ollama-model |
gemma3:4b |
The Ollama model to use. Default is gemma3:4b. |
--llm-ollama-host |
http://localhost:11434 |
The Ollama server host. Default is http://localhost:11434. |
LLM: OpenAI-compatible
| Option | Default | Description |
|---|---|---|
--llm-openai-model |
gpt-5-mini |
The OpenAI model to use for LLM tasks. |
--openai-api-key |
- | Your OpenAI API key. Can also be set with the OPENAI_API_KEY environment variable. |
--openai-base-url |
- | Custom base URL for OpenAI-compatible API (e.g., for llama-server: http://localhost:8080/v1). |
LLM: Gemini
| Option | Default | Description |
|---|---|---|
--llm-gemini-model |
gemini-3-flash-preview |
The Gemini model to use for LLM tasks. |
--gemini-api-key |
- | Your Gemini API key. Can also be set with the GEMINI_API_KEY environment variable. |
Process Management
| Option | Default | Description |
|---|---|---|
--stop |
false |
Stop any running instance of this command. |
--status |
false |
Check if an instance is currently running. |
--toggle |
false |
Start if not running, stop if running. Ideal for hotkey binding. |
General Options
| Option | Default | Description |
|---|---|---|
--clipboard/--no-clipboard |
true |
Copy result to clipboard. |
--log-level |
warning |
Set logging level. |
--log-file |
- | Path to a file to write logs to. |
--quiet, -q |
false |
Suppress console output from rich. |
--json |
false |
Output result as JSON (implies --quiet and --no-clipboard). |
--config |
- | Path to a TOML configuration file. |
--print-args |
false |
Print the command line arguments, including variables taken from the configuration file. |
--transcription-log |
- | Append transcripts to JSONL file (timestamp, hostname, model, raw/processed text). Recent entries provide context for LLM cleanup. |
Diarization
| Option | Default | Description |
|---|---|---|
--diarize/--no-diarize |
false |
Enable speaker diarization (requires pyannote-audio). Install with: pip install agent-cli[diarization] |
--diarize-format |
inline |
Output format for diarization ('inline' for [Speaker N]: text, 'json' for structured output). |
--hf-token |
- | HuggingFace token for pyannote models. Required for diarization. Token must have 'Read access to contents of all public gated repos you can access' permission. Accept licenses at: https://hf.co/pyannote/speaker-diarization-3.1, https://hf.co/pyannote/segmentation-3.0, https://hf.co/pyannote/wespeaker-voxceleb-resnet34-LM |
--min-speakers |
- | Minimum number of speakers (optional hint for diarization). |
--max-speakers |
- | Maximum number of speakers (optional hint for diarization). |
--align-words/--no-align-words |
false |
Use wav2vec2 forced alignment for word-level speaker assignment (more accurate but slower). |
--align-language |
en |
Language code for word alignment model (e.g., 'en', 'fr', 'de', 'es', 'it'). |
--enroll-speakers |
- | Enroll current speaker labels or remembered profile IDs into persistent voice profiles, e.g. SPEAKER_00=Alice or UNKNOWN_001=Alice. For simple renames, use agent-cli speakers rename. |
--identify-speakers/--no-identify-speakers |
true |
Match diarized speakers against persistent voice profiles when profiles exist. |
--remember-unknown-speakers/--no-remember-unknown-speakers |
false |
Persist unmatched speaker embeddings as stable UNKNOWN_### voice profiles. |
--speaker-profiles-file |
/home/runner/.config/agent-cli/speaker-profiles.json |
JSON file storing persistent speaker voice embeddings. |
--speaker-match-threshold |
0.7 |
Cosine-similarity threshold for matching diarized speakers to stored profiles. |
Workflow Integration
Toggle Recording Hotkey
The --toggle flag is designed for hotkey integration:
# First press: starts recording
agent-cli transcribe --toggle --input-device-index 1
# Second press: stops recording and transcribes
agent-cli transcribe --toggle
macOS Hotkey (skhd)
Transcription Log
Log all transcriptions with timestamps:
Tips
- Use
--list-devicesto find your microphone's index - Enable
--llmfor cleaner output with proper punctuation - Use
--last-recording 1to re-transcribe if you need to adjust settings
Speaker Diarization
Speaker diarization identifies and labels different speakers in the transcript. This is useful for meeting recordings, interviews, or any multi-speaker audio.
Requirements
-
Install the diarization extra:
-
HuggingFace token: The pyannote-audio models are gated. You need to:
- Accept the license for all three models:
- Get your token from HuggingFace settings
- Token must have "Read access to contents of all public gated repos you can access" permission
- Provide it via
--hf-tokenor theHF_TOKENenvironment variable
Output Formats
Inline format (default):
[SPEAKER_00]: Hello, how are you today?
[SPEAKER_01]: I'm doing well, thanks for asking!
[SPEAKER_00]: Great to hear.
With persistent speaker profiles, later runs output stored names like Alice
instead of run-local labels like SPEAKER_00.
JSON format (--diarize-format json):
{
"segments": [
{"speaker": "SPEAKER_00", "start": 0.0, "end": 2.5, "text": "Hello, how are you today?"},
{"speaker": "SPEAKER_01", "start": 2.7, "end": 4.1, "text": "I'm doing well, thanks for asking!"},
{"speaker": "SPEAKER_00", "start": 4.3, "end": 5.2, "text": "Great to hear."}
]
}
Speaker Hints
If you know how many speakers are in the recording, use --min-speakers and --max-speakers to improve accuracy:
# For a two-person interview
agent-cli transcribe --from-file interview.wav --diarize --min-speakers 2 --max-speakers 2 --hf-token YOUR_TOKEN
Note
Diarization requires the audio file to be saved. When using live recording with --diarize, ensure --save-recording is enabled (it's enabled by default).