Advanced Transcribe
Powerful skill for transcribe, audio, files, text. Includes structured workflows, validation checks, and reusable patterns for media.
Advanced Transcribe
A practical skill for transcribing audio to text — covering speech-to-text processing with OpenAI Whisper, speaker diarization, timestamp generation, multi-language support, and batch transcription workflows for meetings, interviews, and media content.
When to Use This Skill
Choose Advanced Transcribe when you need to:
- Transcribe audio recordings to text (meetings, interviews, podcasts)
- Generate timestamped transcripts for video subtitles
- Identify and label different speakers in a recording
- Transcribe audio in multiple languages
- Process large batches of audio files automatically
Consider alternatives when:
- You need text-to-speech generation (use a speech synthesis skill)
- You need real-time live transcription (use a live captioning tool)
- You need audio editing or enhancement (use an audio processing skill)
Quick Start
# Install dependencies pip install openai
from openai import OpenAI client = OpenAI() def transcribe(audio_path, response_format="text"): """Transcribe audio file to text.""" with open(audio_path, "rb") as audio_file: transcript = client.audio.transcriptions.create( model="whisper-1", file=audio_file, response_format=response_format, ) return transcript # Simple text transcription text = transcribe("meeting.mp3") print(text) # With timestamps (SRT format for subtitles) srt = transcribe("meeting.mp3", response_format="srt") with open("meeting.srt", "w") as f: f.write(srt) # Verbose JSON with word-level timestamps import json result = transcribe("meeting.mp3", response_format="verbose_json") for segment in result.segments: print(f"[{segment['start']:.1f}s] {segment['text']}")
Core Concepts
Transcription Output Formats
| Format | Content | Use Case |
|---|---|---|
text | Plain text only | Quick transcription |
json | Text with metadata | Application integration |
verbose_json | Segments with timestamps | Detailed analysis |
srt | SubRip subtitle format | Video subtitles |
vtt | WebVTT subtitle format | Web video players |
Speaker Diarization
# Speaker diarization with pyannote from pyannote.audio import Pipeline import torch def diarize_audio(audio_path): """Identify speakers in audio.""" pipeline = Pipeline.from_pretrained( "pyannote/speaker-diarization-3.1", use_auth_token="YOUR_HF_TOKEN" ) diarization = pipeline(audio_path) speakers = [] for turn, _, speaker in diarization.itertracks(yield_label=True): speakers.append({ "speaker": speaker, "start": round(turn.start, 2), "end": round(turn.end, 2), }) return speakers # Combine with Whisper transcription def transcribe_with_speakers(audio_path): transcript = transcribe(audio_path, response_format="verbose_json") speakers = diarize_audio(audio_path) labeled = [] for segment in transcript.segments: mid_time = (segment["start"] + segment["end"]) / 2 speaker = "Unknown" for s in speakers: if s["start"] <= mid_time <= s["end"]: speaker = s["speaker"] break labeled.append({ "speaker": speaker, "start": segment["start"], "end": segment["end"], "text": segment["text"], }) return labeled
Batch Processing
from pathlib import Path from openai import OpenAI import json client = OpenAI() def batch_transcribe(input_dir, output_dir, response_format="text"): """Transcribe all audio files in a directory.""" Path(output_dir).mkdir(exist_ok=True) supported = {".mp3", ".wav", ".m4a", ".ogg", ".flac", ".webm"} results = [] for audio_path in sorted(Path(input_dir).iterdir()): if audio_path.suffix.lower() not in supported: continue print(f"Transcribing: {audio_path.name}") transcript = transcribe(str(audio_path), response_format) out_file = Path(output_dir) / f"{audio_path.stem}.txt" out_file.write_text(transcript if isinstance(transcript, str) else json.dumps(transcript, indent=2)) results.append({"file": audio_path.name, "output": str(out_file)}) print(f"\nTranscribed {len(results)} files → {output_dir}") return results batch_transcribe("./recordings", "./transcripts")
Configuration
| Parameter | Description | Example |
|---|---|---|
model | Whisper model version | "whisper-1" |
language | Source audio language (ISO 639-1) | "en" / "auto" |
response_format | Output format | "text" / "srt" |
temperature | Sampling temperature for decoding | 0.0 (deterministic) |
prompt | Optional context hint for accuracy | "Meeting about Q4 budget" |
diarize | Enable speaker identification | true |
Best Practices
-
Provide a prompt hint for domain-specific terms — Whisper can misrecognize technical terms, product names, and acronyms. Pass a prompt parameter with key terms:
prompt="Discussion about Kubernetes, kubectl, and EKS cluster management". This dramatically improves accuracy for specialized vocabulary. -
Pre-process audio for better results — Normalize volume, remove background noise, and convert to 16kHz mono WAV before transcription. Clean audio produces significantly more accurate transcriptions. Use ffmpeg:
ffmpeg -i input.mp3 -ac 1 -ar 16000 clean.wav. -
Use verbose_json format when you need timestamps — The verbose_json format gives segment-level timestamps that you can use for subtitle generation, speaker labeling, and audio-to-text alignment. Plain text format loses all timing information.
-
Break long audio files into chunks — Whisper has a 25MB file size limit per API call. For longer recordings, split the audio into chunks at silence points using ffmpeg or pydub, transcribe each chunk, then merge the results with adjusted timestamps.
-
Validate transcriptions of critical content — AI transcription is not perfect, especially for accented speech, multiple overlapping speakers, or poor audio quality. For legal, medical, or compliance-critical content, always have a human review the transcript.
Common Issues
Transcription accuracy drops with background noise — Background music, HVAC noise, and keyboard typing significantly reduce accuracy. Pre-process with noise reduction: ffmpeg -i input.mp3 -af "highpass=f=200,lowpass=f=3000,afftdn=nf=-25" clean.mp3. The highpass/lowpass filters focus on the speech frequency range.
Speaker diarization assigns wrong labels — Diarization models work best with clear turn-taking. Overlapping speech, interruptions, and similar-sounding voices cause errors. For meetings, ask participants to identify themselves at the start, and cross-reference diarization labels with the content.
Timestamps drift in long recordings — Whisper processes audio in 30-second chunks, and small timing errors accumulate over long files. For precise timestamp alignment (subtitles, synced playback), use the verbose_json format and recalibrate timestamps by matching known anchor points in the audio.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.