Master Multimodal Whisper
Enterprise-grade skill for openai, general, purpose, speech. Includes structured workflows, validation checks, and reusable patterns for ai research.
OpenAI Whisper -- Robust Speech Recognition
Overview
A comprehensive skill for speech-to-text transcription using OpenAI's Whisper model. Whisper is a general-purpose speech recognition model trained on 680,000 hours of multilingual audio data, supporting transcription in 99 languages and translation to English. The model uses an encoder-decoder Transformer architecture that processes audio spectrograms and generates text tokens. Available in multiple sizes from the 39M-parameter tiny model to the 1.55B-parameter large model, plus the optimized turbo variant that provides near-large quality at 8x the speed. This skill covers local inference, CLI usage, the faster-whisper alternative, and integration with downstream pipelines.
When to Use
- Transcribing audio or video files to text (podcasts, meetings, interviews)
- Generating subtitles in SRT or WebVTT format for video content
- Building automated meeting notes or call summarization systems
- Translating non-English audio directly to English text
- Processing noisy, accented, or domain-specific audio where other ASR systems fail
- Word-level timestamp extraction for alignment or karaoke-style display
- Batch transcription of audio archives or media libraries
Quick Start
# Install Whisper (requires Python 3.8-3.11) pip install -U openai-whisper # Install ffmpeg (required for audio processing) # macOS: brew install ffmpeg # Ubuntu: sudo apt install ffmpeg
import whisper # Load model (downloads automatically on first use) model = whisper.load_model("turbo") # Transcribe audio file result = model.transcribe("meeting_recording.mp3") # Full text print(result["text"]) # Timestamped segments for segment in result["segments"]: print(f"[{segment['start']:.1f}s - {segment['end']:.1f}s] {segment['text']}")
Core Concepts
Model Sizes and Performance
| Model | Parameters | VRAM | Speed (vs realtime) | English WER | Best For |
|---|---|---|---|---|---|
tiny | 39M | ~1 GB | ~32x | 7.7% | Quick prototyping, embedded |
base | 74M | ~1 GB | ~16x | 5.9% | Development, testing |
small | 244M | ~2 GB | ~6x | 4.4% | Balanced speed/accuracy |
medium | 769M | ~5 GB | ~2x | 3.6% | High accuracy, English |
large | 1.55B | ~10 GB | ~1x | 3.0% | Best accuracy, multilingual |
turbo | 809M | ~6 GB | ~8x | 3.2% | Recommended default |
English-only variants (tiny.en, base.en, small.en, medium.en) offer slightly better English performance.
Architecture
Audio Input ──► Mel Spectrogram (80 channels, 30-sec chunks)
│
Encoder (Transformer)
│
Audio Features
│
Decoder (Transformer) ──► Text Tokens
│
Beam Search / Greedy Decoding
│
Transcribed Text + Timestamps
Transcription Options
import whisper model = whisper.load_model("turbo") result = model.transcribe( "audio.mp3", language="en", # Skip language detection (faster) task="transcribe", # "transcribe" or "translate" (to English) initial_prompt="Technical podcast about Kubernetes and Docker.", word_timestamps=True, # Per-word timing temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0), # Fallback temperatures no_speech_threshold=0.6, # Filter silence condition_on_previous_text=True, # Use prior context ) # Word-level timestamps for segment in result["segments"]: for word in segment["words"]: print(f" {word['word']} [{word['start']:.2f}s - {word['end']:.2f}s]")
Translation to English
# Transcribe foreign audio directly to English text result = model.transcribe( "spanish_interview.mp3", task="translate", # Input: any language -> Output: English text ) print(result["text"])
Initial Prompt for Domain Accuracy
# Improve recognition of technical terms and proper nouns result = model.transcribe( "tech_talk.mp3", initial_prompt=( "Discussion about PyTorch, CUDA, bitsandbytes quantization, " "and LoRA fine-tuning for LLaMA models." ), )
CLI Usage
# Basic transcription whisper recording.mp3 --model turbo # Specify language and output format whisper meeting.mp3 --model turbo --language English --output_format srt # Translation to English whisper german_podcast.mp3 --task translate --model turbo # All output formats at once whisper video.mp4 --model turbo --output_format all # Multiple files whisper file1.mp3 file2.mp3 file3.mp3 --model turbo --output_dir ./transcripts
Output Formats
| Format | Flag | Description |
|---|---|---|
| Plain text | --output_format txt | Raw transcription text |
| SRT | --output_format srt | Subtitles with timestamps |
| WebVTT | --output_format vtt | Web video subtitles |
| JSON | --output_format json | Full metadata with segments |
| TSV | --output_format tsv | Tab-separated values |
| All | --output_format all | Generate all formats |
faster-whisper -- 4x Faster Alternative
faster-whisper uses CTranslate2 for optimized inference, delivering 4x speedup with lower memory:
pip install faster-whisper
from faster_whisper import WhisperModel # Load model with CTranslate2 backend model = WhisperModel("large-v3", device="cuda", compute_type="float16") # Transcribe with streaming segments segments, info = model.transcribe("audio.mp3", beam_size=5) print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})") for segment in segments: print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
faster-whisper Configuration
| Parameter | Default | Description |
|---|---|---|
beam_size | 5 | Beam search width (1 = greedy) |
best_of | 5 | Candidates when sampling |
patience | 1.0 | Beam search patience factor |
length_penalty | 1.0 | Exponential length penalty |
temperature | 0.0 | Sampling temperature |
vad_filter | False | Voice activity detection filter |
vad_parameters | None | VAD threshold configuration |
word_timestamps | False | Enable per-word timestamps |
VAD Filtering for Cleaner Output
segments, info = model.transcribe( "noisy_meeting.mp3", vad_filter=True, # Filter non-speech segments vad_parameters=dict( threshold=0.5, # Speech probability threshold min_speech_duration_ms=250, max_speech_duration_s=float("inf"), min_silence_duration_ms=2000, speech_pad_ms=400, ), )
Batch Processing
import os from pathlib import Path model = whisper.load_model("turbo") audio_dir = Path("./recordings") output_dir = Path("./transcripts") output_dir.mkdir(exist_ok=True) for audio_file in audio_dir.glob("*.mp3"): print(f"Processing: {audio_file.name}") result = model.transcribe(str(audio_file), language="en") # Save as text txt_path = output_dir / f"{audio_file.stem}.txt" txt_path.write_text(result["text"]) # Save as SRT srt_path = output_dir / f"{audio_file.stem}.srt" with open(srt_path, "w") as f: for i, seg in enumerate(result["segments"], 1): start = format_timestamp(seg["start"]) end = format_timestamp(seg["end"]) f.write(f"{i}\n{start} --> {end}\n{seg['text'].strip()}\n\n") def format_timestamp(seconds): h = int(seconds // 3600) m = int((seconds % 3600) // 60) s = int(seconds % 60) ms = int((seconds % 1) * 1000) return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
Integration Patterns
Extract Audio from Video
# Extract audio track with ffmpeg before transcription ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav # Then transcribe whisper audio.wav --model turbo
With LangChain for RAG
from langchain_community.document_loaders import WhisperTranscriptionLoader from langchain_chroma import Chroma from langchain_openai import OpenAIEmbeddings from langchain_text_splitters import RecursiveCharacterTextSplitter # Load and transcribe loader = WhisperTranscriptionLoader(file_path="podcast.mp3", model="turbo") docs = loader.load() # Split into chunks splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) chunks = splitter.split_documents(docs) # Index for retrieval vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings())
Speaker Diarization with pyannote
from pyannote.audio import Pipeline as DiarizationPipeline from faster_whisper import WhisperModel # Step 1: Diarize speakers diarization = DiarizationPipeline.from_pretrained( "pyannote/speaker-diarization-3.1", use_auth_token="your-hf-token", ) diarization_result = diarization("meeting.wav") # Step 2: Transcribe model = WhisperModel("turbo", device="cuda", compute_type="float16") segments, _ = model.transcribe("meeting.wav", word_timestamps=True) # Step 3: Align speakers with transcript segments for segment in segments: midpoint = (segment.start + segment.end) / 2 speaker = get_speaker_at_time(diarization_result, midpoint) print(f"[{speaker}] {segment.text}")
Best Practices
- Use
turbomodel as default -- It provides near-large-v3 accuracy at 8x the speed, making it the best balance for most workloads. - Specify language when known -- Setting
language="en"skips auto-detection and reduces the chance of language confusion on short or noisy clips. - Provide initial prompts for technical content -- Feed domain-specific terms, proper nouns, and acronyms via
initial_promptto dramatically improve recognition accuracy. - Use
faster-whisperfor production -- The CTranslate2 backend delivers 4x faster inference and lower memory usage with identical output quality. - Enable VAD filtering for noisy audio -- Voice Activity Detection removes silence and non-speech segments, producing cleaner transcripts and faster processing.
- Split long audio into chunks -- Whisper processes audio in 30-second windows internally. Files longer than 30 minutes may accumulate timestamp drift; split at natural boundaries.
- Convert to 16kHz mono WAV -- Whisper internally resamples all audio to 16kHz mono. Pre-converting avoids on-the-fly resampling overhead.
- Use GPU for production throughput -- GPU inference is 10-20x faster than CPU. Even a modest GPU dramatically improves batch processing times.
- Add speaker diarization separately -- Whisper does not identify speakers. Pair with pyannote-audio for speaker-attributed transcripts.
- Monitor for hallucinations -- Whisper can repeat phrases or generate text during silence. Use VAD filtering and post-process to detect repeated segments.
Troubleshooting
Whisper hallucinating or repeating text during silence:
Enable VAD filtering (vad_filter=True in faster-whisper) or increase no_speech_threshold to 0.7. Post-process to detect and remove segments with very low log-probability.
Transcription quality degrades on long files:
Split audio into 10-20 minute chunks at natural pauses. Use condition_on_previous_text=False if hallucinations compound over time.
Wrong language detected:
Explicitly set language="en" (or the correct language code). Auto-detection analyzes only the first 30 seconds and can misidentify on music intros or silence.
ffmpeg not found error:
Install ffmpeg system-wide: brew install ffmpeg (macOS), sudo apt install ffmpeg (Ubuntu), or choco install ffmpeg (Windows).
GPU not being used despite CUDA being available:
Explicitly set device: model = whisper.load_model("turbo", device="cuda"). Verify CUDA is available with torch.cuda.is_available().
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.