OpenAI Whisper -- Robust Speech Recognition

Overview

A comprehensive skill for speech-to-text transcription using OpenAI's Whisper model. Whisper is a general-purpose speech recognition model trained on 680,000 hours of multilingual audio data, supporting transcription in 99 languages and translation to English. The model uses an encoder-decoder Transformer architecture that processes audio spectrograms and generates text tokens. Available in multiple sizes from the 39M-parameter tiny model to the 1.55B-parameter large model, plus the optimized turbo variant that provides near-large quality at 8x the speed. This skill covers local inference, CLI usage, the faster-whisper alternative, and integration with downstream pipelines.

When to Use

Transcribing audio or video files to text (podcasts, meetings, interviews)
Generating subtitles in SRT or WebVTT format for video content
Building automated meeting notes or call summarization systems
Translating non-English audio directly to English text
Processing noisy, accented, or domain-specific audio where other ASR systems fail
Word-level timestamp extraction for alignment or karaoke-style display
Batch transcription of audio archives or media libraries

Quick Start


# Install Whisper (requires Python 3.8-3.11)
pip install -U openai-whisper

# Install ffmpeg (required for audio processing)
# macOS:
brew install ffmpeg
# Ubuntu:
sudo apt install ffmpeg


import whisper

# Load model (downloads automatically on first use)
model = whisper.load_model("turbo")

# Transcribe audio file
result = model.transcribe("meeting_recording.mp3")

# Full text
print(result["text"])

# Timestamped segments
for segment in result["segments"]:
    print(f"[{segment['start']:.1f}s - {segment['end']:.1f}s] {segment['text']}")

Core Concepts

Model Sizes and Performance

Model	Parameters	VRAM	Speed (vs realtime)	English WER	Best For
`tiny`	39M	~1 GB	~32x	7.7%	Quick prototyping, embedded
`base`	74M	~1 GB	~16x	5.9%	Development, testing
`small`	244M	~2 GB	~6x	4.4%	Balanced speed/accuracy
`medium`	769M	~5 GB	~2x	3.6%	High accuracy, English
`large`	1.55B	~10 GB	~1x	3.0%	Best accuracy, multilingual
`turbo`	809M	~6 GB	~8x	3.2%	Recommended default

English-only variants (tiny.en, base.en, small.en, medium.en) offer slightly better English performance.

Architecture

Audio Input ──► Mel Spectrogram (80 channels, 30-sec chunks)
                        │
                  Encoder (Transformer)
                        │
                  Audio Features
                        │
                  Decoder (Transformer) ──► Text Tokens
                        │
              Beam Search / Greedy Decoding
                        │
                  Transcribed Text + Timestamps

Transcription Options


import whisper

model = whisper.load_model("turbo")

result = model.transcribe(
    "audio.mp3",
    language="en",              # Skip language detection (faster)
    task="transcribe",          # "transcribe" or "translate" (to English)
    initial_prompt="Technical podcast about Kubernetes and Docker.",
    word_timestamps=True,       # Per-word timing
    temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0),  # Fallback temperatures
    no_speech_threshold=0.6,    # Filter silence
    condition_on_previous_text=True,  # Use prior context
)

# Word-level timestamps
for segment in result["segments"]:
    for word in segment["words"]:
        print(f"  {word['word']} [{word['start']:.2f}s - {word['end']:.2f}s]")

Translation to English


# Transcribe foreign audio directly to English text
result = model.transcribe(
    "spanish_interview.mp3",
    task="translate",  # Input: any language -> Output: English text
)
print(result["text"])

Initial Prompt for Domain Accuracy


# Improve recognition of technical terms and proper nouns
result = model.transcribe(
    "tech_talk.mp3",
    initial_prompt=(
        "Discussion about PyTorch, CUDA, bitsandbytes quantization, "
        "and LoRA fine-tuning for LLaMA models."
    ),
)

CLI Usage


# Basic transcription
whisper recording.mp3 --model turbo

# Specify language and output format
whisper meeting.mp3 --model turbo --language English --output_format srt

# Translation to English
whisper german_podcast.mp3 --task translate --model turbo

# All output formats at once
whisper video.mp4 --model turbo --output_format all

# Multiple files
whisper file1.mp3 file2.mp3 file3.mp3 --model turbo --output_dir ./transcripts

Output Formats

Format	Flag	Description
Plain text	`--output_format txt`	Raw transcription text
SRT	`--output_format srt`	Subtitles with timestamps
WebVTT	`--output_format vtt`	Web video subtitles
JSON	`--output_format json`	Full metadata with segments
TSV	`--output_format tsv`	Tab-separated values
All	`--output_format all`	Generate all formats

faster-whisper -- 4x Faster Alternative

faster-whisper uses CTranslate2 for optimized inference, delivering 4x speedup with lower memory:


pip install faster-whisper


from faster_whisper import WhisperModel

# Load model with CTranslate2 backend
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

# Transcribe with streaming segments
segments, info = model.transcribe("audio.mp3", beam_size=5)

print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

faster-whisper Configuration

Parameter	Default	Description
`beam_size`	5	Beam search width (1 = greedy)
`best_of`	5	Candidates when sampling
`patience`	1.0	Beam search patience factor
`length_penalty`	1.0	Exponential length penalty
`temperature`	0.0	Sampling temperature
`vad_filter`	False	Voice activity detection filter
`vad_parameters`	None	VAD threshold configuration
`word_timestamps`	False	Enable per-word timestamps

VAD Filtering for Cleaner Output


segments, info = model.transcribe(
    "noisy_meeting.mp3",
    vad_filter=True,          # Filter non-speech segments
    vad_parameters=dict(
        threshold=0.5,         # Speech probability threshold
        min_speech_duration_ms=250,
        max_speech_duration_s=float("inf"),
        min_silence_duration_ms=2000,
        speech_pad_ms=400,
    ),
)

Batch Processing


import os
from pathlib import Path

model = whisper.load_model("turbo")

audio_dir = Path("./recordings")
output_dir = Path("./transcripts")
output_dir.mkdir(exist_ok=True)

for audio_file in audio_dir.glob("*.mp3"):
    print(f"Processing: {audio_file.name}")
    result = model.transcribe(str(audio_file), language="en")

    # Save as text
    txt_path = output_dir / f"{audio_file.stem}.txt"
    txt_path.write_text(result["text"])

    # Save as SRT
    srt_path = output_dir / f"{audio_file.stem}.srt"
    with open(srt_path, "w") as f:
        for i, seg in enumerate(result["segments"], 1):
            start = format_timestamp(seg["start"])
            end = format_timestamp(seg["end"])
            f.write(f"{i}\n{start} --> {end}\n{seg['text'].strip()}\n\n")

def format_timestamp(seconds):
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    ms = int((seconds % 1) * 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

Integration Patterns

Extract Audio from Video


# Extract audio track with ffmpeg before transcription
ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav

# Then transcribe
whisper audio.wav --model turbo

With LangChain for RAG


from langchain_community.document_loaders import WhisperTranscriptionLoader
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load and transcribe
loader = WhisperTranscriptionLoader(file_path="podcast.mp3", model="turbo")
docs = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)

# Index for retrieval
vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings())

Speaker Diarization with pyannote


from pyannote.audio import Pipeline as DiarizationPipeline
from faster_whisper import WhisperModel

# Step 1: Diarize speakers
diarization = DiarizationPipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="your-hf-token",
)
diarization_result = diarization("meeting.wav")

# Step 2: Transcribe
model = WhisperModel("turbo", device="cuda", compute_type="float16")
segments, _ = model.transcribe("meeting.wav", word_timestamps=True)

# Step 3: Align speakers with transcript segments
for segment in segments:
    midpoint = (segment.start + segment.end) / 2
    speaker = get_speaker_at_time(diarization_result, midpoint)
    print(f"[{speaker}] {segment.text}")

Best Practices

Use turbo model as default -- It provides near-large-v3 accuracy at 8x the speed, making it the best balance for most workloads.
Specify language when known -- Setting language="en" skips auto-detection and reduces the chance of language confusion on short or noisy clips.
Provide initial prompts for technical content -- Feed domain-specific terms, proper nouns, and acronyms via initial_prompt to dramatically improve recognition accuracy.
Use faster-whisper for production -- The CTranslate2 backend delivers 4x faster inference and lower memory usage with identical output quality.
Enable VAD filtering for noisy audio -- Voice Activity Detection removes silence and non-speech segments, producing cleaner transcripts and faster processing.
Split long audio into chunks -- Whisper processes audio in 30-second windows internally. Files longer than 30 minutes may accumulate timestamp drift; split at natural boundaries.
Convert to 16kHz mono WAV -- Whisper internally resamples all audio to 16kHz mono. Pre-converting avoids on-the-fly resampling overhead.
Use GPU for production throughput -- GPU inference is 10-20x faster than CPU. Even a modest GPU dramatically improves batch processing times.
Add speaker diarization separately -- Whisper does not identify speakers. Pair with pyannote-audio for speaker-attributed transcripts.
Monitor for hallucinations -- Whisper can repeat phrases or generate text during silence. Use VAD filtering and post-process to detect repeated segments.

Troubleshooting

Whisper hallucinating or repeating text during silence: Enable VAD filtering (vad_filter=True in faster-whisper) or increase no_speech_threshold to 0.7. Post-process to detect and remove segments with very low log-probability.

Transcription quality degrades on long files: Split audio into 10-20 minute chunks at natural pauses. Use condition_on_previous_text=False if hallucinations compound over time.

Wrong language detected: Explicitly set language="en" (or the correct language code). Auto-detection analyzes only the first 30 seconds and can misidentify on music intros or silence.

ffmpeg not found error: Install ffmpeg system-wide: brew install ffmpeg (macOS), sudo apt install ffmpeg (Ubuntu), or choco install ffmpeg (Windows).

GPU not being used despite CUDA being available: Explicitly set device: model = whisper.load_model("turbo", device="cuda"). Verify CUDA is available with torch.cuda.is_available().

⚠️ Loading Issue

Master Multimodal Whisper