Podcast Transcriber Assistant
Enterprise-grade agent for audio, transcription, specialist, proactively. Includes structured workflows, validation checks, and reusable patterns for ffmpeg clip team.
Podcast Transcriber Assistant
Your agent for transcribing podcast audio to text β covering automated transcription, speaker diarization, timestamp generation, and transcript formatting for multiple output formats.
When to Use This Agent
Choose Podcast Transcriber Assistant when:
- Transcribing podcast episodes to text with speaker labels
- Generating timestamped transcripts for accessibility compliance
- Creating formatted transcripts for show notes or blog posts
- Implementing automated transcription pipelines
- Editing and correcting automated transcriptions
Consider alternatives when:
- You need audio editing β use a Podcast Ally or Audio Mixer agent
- You need metadata management β use a Podcast Metadata agent
- You need real-time captioning β use a streaming transcription service
Quick Start
# .claude/agents/podcast-transcriber.yml name: Podcast Transcriber Assistant model: claude-sonnet tools: - Read - Write - Edit - Bash - Glob - Grep description: Podcast transcription agent for speaker-labeled, timestamped transcripts in multiple output formats
Example invocation:
claude "Transcribe our latest episode, identify speakers, add timestamps every 30 seconds, and format as both SRT subtitles and a clean blog-post-style transcript"
Core Concepts
Transcription Pipeline
| Step | Activity | Tool |
|---|---|---|
| Pre-process | Enhance audio, reduce noise | FFmpeg |
| Transcribe | Speech-to-text conversion | Whisper, AssemblyAI, Deepgram |
| Diarize | Identify and label speakers | Pyannote, speaker ID |
| Timestamp | Add time codes | Whisper word-level timestamps |
| Format | Convert to target format | Custom formatting |
| Review | Edit for accuracy | Manual or AI-assisted |
Output Formats
SRT (Subtitles):
1
00:00:01,000 --> 00:00:05,500
[Host] Welcome back to the podcast.
2
00:00:06,000 --> 00:00:10,200
[Guest] Thanks for having me.
VTT (Web Video):
WEBVTT
00:01.000 --> 00:05.500
<v Host>Welcome back to the podcast.</v>
Blog Transcript:
**[00:00]** Host: Welcome back to the podcast.
**[00:06]** Guest: Thanks for having me.
Configuration
| Parameter | Description | Default |
|---|---|---|
transcription_engine | STT engine (whisper, assemblyai, deepgram) | whisper |
model_size | Whisper model size (tiny, base, small, medium, large) | medium |
language | Audio language | en |
output_formats | Output formats (srt, vtt, txt, json, blog) | srt,txt |
speaker_diarization | Enable speaker identification | true |
Best Practices
-
Pre-process audio before transcription. Noise reduction and normalization improve transcription accuracy significantly. Apply a high-pass filter at 80Hz, reduce background noise, and normalize to -16 LUFS before sending to the transcription engine.
-
Use the largest Whisper model your hardware supports. Larger models produce significantly more accurate transcriptions.
large-v3is the most accurate but requires 10GB+ VRAM.mediumis a good balance for most hardware. -
Always review automated transcriptions before publishing. No STT engine is 100% accurate. Technical terms, proper nouns, and accented speech have higher error rates. Schedule 15-20 minutes of editing time per hour of audio.
-
Include speaker labels in multi-person transcripts. Unlabeled transcripts are hard to follow. Use speaker diarization to identify speakers, then map speaker IDs (SPEAKER_00) to real names. Consistent labeling helps readers follow the conversation.
-
Generate multiple output formats from a single canonical source. Transcribe once to a rich format (JSON with timestamps and speakers), then transform to SRT, VTT, plain text, and blog-formatted versions. This ensures consistency across formats.
Common Issues
Whisper misidentifies speakers or doesn't support diarization natively. Whisper transcribes but doesn't diarize. Combine Whisper's transcription with Pyannote or similar diarization tools. Run diarization separately and merge the results using timestamp alignment.
Technical terms and jargon are consistently misspelled. Add a custom vocabulary or post-processing dictionary that corrects known misrecognitions. "Kubernetes" might be transcribed as "Kuber Netties" β a simple find-and-replace dictionary fixes these consistently.
Long episodes (2+ hours) run out of memory during transcription. Process long episodes in segments (10-15 minutes each) and concatenate the results. Ensure segments overlap by 5-10 seconds to prevent missing words at boundaries.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
API Endpoint Builder
Agent that scaffolds complete REST API endpoints with controller, service, route, types, and tests. Supports Express, Fastify, and NestJS.
Documentation Auto-Generator
Agent that reads your codebase and generates comprehensive documentation including API docs, architecture guides, and setup instructions.
Ai Ethics Advisor Partner
All-in-one agent covering ethics, responsible, development, specialist. Includes structured workflows, validation checks, and reusable patterns for ai specialists.