M

Master Multimodal Whisper

Enterprise-grade skill for openai, general, purpose, speech. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

OpenAI Whisper -- Robust Speech Recognition

Overview

A comprehensive skill for speech-to-text transcription using OpenAI's Whisper model. Whisper is a general-purpose speech recognition model trained on 680,000 hours of multilingual audio data, supporting transcription in 99 languages and translation to English. The model uses an encoder-decoder Transformer architecture that processes audio spectrograms and generates text tokens. Available in multiple sizes from the 39M-parameter tiny model to the 1.55B-parameter large model, plus the optimized turbo variant that provides near-large quality at 8x the speed. This skill covers local inference, CLI usage, the faster-whisper alternative, and integration with downstream pipelines.

When to Use

  • Transcribing audio or video files to text (podcasts, meetings, interviews)
  • Generating subtitles in SRT or WebVTT format for video content
  • Building automated meeting notes or call summarization systems
  • Translating non-English audio directly to English text
  • Processing noisy, accented, or domain-specific audio where other ASR systems fail
  • Word-level timestamp extraction for alignment or karaoke-style display
  • Batch transcription of audio archives or media libraries

Quick Start

# Install Whisper (requires Python 3.8-3.11) pip install -U openai-whisper # Install ffmpeg (required for audio processing) # macOS: brew install ffmpeg # Ubuntu: sudo apt install ffmpeg
import whisper # Load model (downloads automatically on first use) model = whisper.load_model("turbo") # Transcribe audio file result = model.transcribe("meeting_recording.mp3") # Full text print(result["text"]) # Timestamped segments for segment in result["segments"]: print(f"[{segment['start']:.1f}s - {segment['end']:.1f}s] {segment['text']}")

Core Concepts

Model Sizes and Performance

ModelParametersVRAMSpeed (vs realtime)English WERBest For
tiny39M~1 GB~32x7.7%Quick prototyping, embedded
base74M~1 GB~16x5.9%Development, testing
small244M~2 GB~6x4.4%Balanced speed/accuracy
medium769M~5 GB~2x3.6%High accuracy, English
large1.55B~10 GB~1x3.0%Best accuracy, multilingual
turbo809M~6 GB~8x3.2%Recommended default

English-only variants (tiny.en, base.en, small.en, medium.en) offer slightly better English performance.

Architecture

Audio Input ──► Mel Spectrogram (80 channels, 30-sec chunks)
                        │
                  Encoder (Transformer)
                        │
                  Audio Features
                        │
                  Decoder (Transformer) ──► Text Tokens
                        │
              Beam Search / Greedy Decoding
                        │
                  Transcribed Text + Timestamps

Transcription Options

import whisper model = whisper.load_model("turbo") result = model.transcribe( "audio.mp3", language="en", # Skip language detection (faster) task="transcribe", # "transcribe" or "translate" (to English) initial_prompt="Technical podcast about Kubernetes and Docker.", word_timestamps=True, # Per-word timing temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0), # Fallback temperatures no_speech_threshold=0.6, # Filter silence condition_on_previous_text=True, # Use prior context ) # Word-level timestamps for segment in result["segments"]: for word in segment["words"]: print(f" {word['word']} [{word['start']:.2f}s - {word['end']:.2f}s]")

Translation to English

# Transcribe foreign audio directly to English text result = model.transcribe( "spanish_interview.mp3", task="translate", # Input: any language -> Output: English text ) print(result["text"])

Initial Prompt for Domain Accuracy

# Improve recognition of technical terms and proper nouns result = model.transcribe( "tech_talk.mp3", initial_prompt=( "Discussion about PyTorch, CUDA, bitsandbytes quantization, " "and LoRA fine-tuning for LLaMA models." ), )

CLI Usage

# Basic transcription whisper recording.mp3 --model turbo # Specify language and output format whisper meeting.mp3 --model turbo --language English --output_format srt # Translation to English whisper german_podcast.mp3 --task translate --model turbo # All output formats at once whisper video.mp4 --model turbo --output_format all # Multiple files whisper file1.mp3 file2.mp3 file3.mp3 --model turbo --output_dir ./transcripts

Output Formats

FormatFlagDescription
Plain text--output_format txtRaw transcription text
SRT--output_format srtSubtitles with timestamps
WebVTT--output_format vttWeb video subtitles
JSON--output_format jsonFull metadata with segments
TSV--output_format tsvTab-separated values
All--output_format allGenerate all formats

faster-whisper -- 4x Faster Alternative

faster-whisper uses CTranslate2 for optimized inference, delivering 4x speedup with lower memory:

pip install faster-whisper
from faster_whisper import WhisperModel # Load model with CTranslate2 backend model = WhisperModel("large-v3", device="cuda", compute_type="float16") # Transcribe with streaming segments segments, info = model.transcribe("audio.mp3", beam_size=5) print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})") for segment in segments: print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

faster-whisper Configuration

ParameterDefaultDescription
beam_size5Beam search width (1 = greedy)
best_of5Candidates when sampling
patience1.0Beam search patience factor
length_penalty1.0Exponential length penalty
temperature0.0Sampling temperature
vad_filterFalseVoice activity detection filter
vad_parametersNoneVAD threshold configuration
word_timestampsFalseEnable per-word timestamps

VAD Filtering for Cleaner Output

segments, info = model.transcribe( "noisy_meeting.mp3", vad_filter=True, # Filter non-speech segments vad_parameters=dict( threshold=0.5, # Speech probability threshold min_speech_duration_ms=250, max_speech_duration_s=float("inf"), min_silence_duration_ms=2000, speech_pad_ms=400, ), )

Batch Processing

import os from pathlib import Path model = whisper.load_model("turbo") audio_dir = Path("./recordings") output_dir = Path("./transcripts") output_dir.mkdir(exist_ok=True) for audio_file in audio_dir.glob("*.mp3"): print(f"Processing: {audio_file.name}") result = model.transcribe(str(audio_file), language="en") # Save as text txt_path = output_dir / f"{audio_file.stem}.txt" txt_path.write_text(result["text"]) # Save as SRT srt_path = output_dir / f"{audio_file.stem}.srt" with open(srt_path, "w") as f: for i, seg in enumerate(result["segments"], 1): start = format_timestamp(seg["start"]) end = format_timestamp(seg["end"]) f.write(f"{i}\n{start} --> {end}\n{seg['text'].strip()}\n\n") def format_timestamp(seconds): h = int(seconds // 3600) m = int((seconds % 3600) // 60) s = int(seconds % 60) ms = int((seconds % 1) * 1000) return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

Integration Patterns

Extract Audio from Video

# Extract audio track with ffmpeg before transcription ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav # Then transcribe whisper audio.wav --model turbo

With LangChain for RAG

from langchain_community.document_loaders import WhisperTranscriptionLoader from langchain_chroma import Chroma from langchain_openai import OpenAIEmbeddings from langchain_text_splitters import RecursiveCharacterTextSplitter # Load and transcribe loader = WhisperTranscriptionLoader(file_path="podcast.mp3", model="turbo") docs = loader.load() # Split into chunks splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) chunks = splitter.split_documents(docs) # Index for retrieval vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings())

Speaker Diarization with pyannote

from pyannote.audio import Pipeline as DiarizationPipeline from faster_whisper import WhisperModel # Step 1: Diarize speakers diarization = DiarizationPipeline.from_pretrained( "pyannote/speaker-diarization-3.1", use_auth_token="your-hf-token", ) diarization_result = diarization("meeting.wav") # Step 2: Transcribe model = WhisperModel("turbo", device="cuda", compute_type="float16") segments, _ = model.transcribe("meeting.wav", word_timestamps=True) # Step 3: Align speakers with transcript segments for segment in segments: midpoint = (segment.start + segment.end) / 2 speaker = get_speaker_at_time(diarization_result, midpoint) print(f"[{speaker}] {segment.text}")

Best Practices

  1. Use turbo model as default -- It provides near-large-v3 accuracy at 8x the speed, making it the best balance for most workloads.
  2. Specify language when known -- Setting language="en" skips auto-detection and reduces the chance of language confusion on short or noisy clips.
  3. Provide initial prompts for technical content -- Feed domain-specific terms, proper nouns, and acronyms via initial_prompt to dramatically improve recognition accuracy.
  4. Use faster-whisper for production -- The CTranslate2 backend delivers 4x faster inference and lower memory usage with identical output quality.
  5. Enable VAD filtering for noisy audio -- Voice Activity Detection removes silence and non-speech segments, producing cleaner transcripts and faster processing.
  6. Split long audio into chunks -- Whisper processes audio in 30-second windows internally. Files longer than 30 minutes may accumulate timestamp drift; split at natural boundaries.
  7. Convert to 16kHz mono WAV -- Whisper internally resamples all audio to 16kHz mono. Pre-converting avoids on-the-fly resampling overhead.
  8. Use GPU for production throughput -- GPU inference is 10-20x faster than CPU. Even a modest GPU dramatically improves batch processing times.
  9. Add speaker diarization separately -- Whisper does not identify speakers. Pair with pyannote-audio for speaker-attributed transcripts.
  10. Monitor for hallucinations -- Whisper can repeat phrases or generate text during silence. Use VAD filtering and post-process to detect repeated segments.

Troubleshooting

Whisper hallucinating or repeating text during silence: Enable VAD filtering (vad_filter=True in faster-whisper) or increase no_speech_threshold to 0.7. Post-process to detect and remove segments with very low log-probability.

Transcription quality degrades on long files: Split audio into 10-20 minute chunks at natural pauses. Use condition_on_previous_text=False if hallucinations compound over time.

Wrong language detected: Explicitly set language="en" (or the correct language code). Auto-detection analyzes only the first 30 seconds and can misidentify on music intros or silence.

ffmpeg not found error: Install ffmpeg system-wide: brew install ffmpeg (macOS), sudo apt install ffmpeg (Ubuntu), or choco install ffmpeg (Windows).

GPU not being used despite CUDA being available: Explicitly set device: model = whisper.load_model("turbo", device="cuda"). Verify CUDA is available with torch.cuda.is_available().

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates