Podcast Transcriber Assistant

Your agent for transcribing podcast audio to text — covering automated transcription, speaker diarization, timestamp generation, and transcript formatting for multiple output formats.

When to Use This Agent

Choose Podcast Transcriber Assistant when:

Transcribing podcast episodes to text with speaker labels
Generating timestamped transcripts for accessibility compliance
Creating formatted transcripts for show notes or blog posts
Implementing automated transcription pipelines
Editing and correcting automated transcriptions

Consider alternatives when:

You need audio editing — use a Podcast Ally or Audio Mixer agent
You need metadata management — use a Podcast Metadata agent
You need real-time captioning — use a streaming transcription service

Quick Start


# .claude/agents/podcast-transcriber.yml
name: Podcast Transcriber Assistant
model: claude-sonnet
tools:
  - Read
  - Write
  - Edit
  - Bash
  - Glob
  - Grep
description: Podcast transcription agent for speaker-labeled, timestamped transcripts in multiple output formats

Example invocation:

claude "Transcribe our latest episode, identify speakers, add timestamps every 30 seconds, and format as both SRT subtitles and a clean blog-post-style transcript"

Core Concepts

Transcription Pipeline

Step	Activity	Tool
Pre-process	Enhance audio, reduce noise	FFmpeg
Transcribe	Speech-to-text conversion	Whisper, AssemblyAI, Deepgram
Diarize	Identify and label speakers	Pyannote, speaker ID
Timestamp	Add time codes	Whisper word-level timestamps
Format	Convert to target format	Custom formatting
Review	Edit for accuracy	Manual or AI-assisted

Output Formats

SRT (Subtitles):
  1
  00:00:01,000 --> 00:00:05,500
  [Host] Welcome back to the podcast.

  2
  00:00:06,000 --> 00:00:10,200
  [Guest] Thanks for having me.

VTT (Web Video):
  WEBVTT

  00:01.000 --> 00:05.500
  <v Host>Welcome back to the podcast.</v>

Blog Transcript:
  **[00:00]** Host: Welcome back to the podcast.

  **[00:06]** Guest: Thanks for having me.

Configuration

Parameter	Description	Default
`transcription_engine`	STT engine (whisper, assemblyai, deepgram)	whisper
`model_size`	Whisper model size (tiny, base, small, medium, large)	medium
`language`	Audio language	en
`output_formats`	Output formats (srt, vtt, txt, json, blog)	srt,txt
`speaker_diarization`	Enable speaker identification	true

Best Practices

Pre-process audio before transcription. Noise reduction and normalization improve transcription accuracy significantly. Apply a high-pass filter at 80Hz, reduce background noise, and normalize to -16 LUFS before sending to the transcription engine.
Use the largest Whisper model your hardware supports. Larger models produce significantly more accurate transcriptions. large-v3 is the most accurate but requires 10GB+ VRAM. medium is a good balance for most hardware.
Always review automated transcriptions before publishing. No STT engine is 100% accurate. Technical terms, proper nouns, and accented speech have higher error rates. Schedule 15-20 minutes of editing time per hour of audio.
Include speaker labels in multi-person transcripts. Unlabeled transcripts are hard to follow. Use speaker diarization to identify speakers, then map speaker IDs (SPEAKER_00) to real names. Consistent labeling helps readers follow the conversation.
Generate multiple output formats from a single canonical source. Transcribe once to a rich format (JSON with timestamps and speakers), then transform to SRT, VTT, plain text, and blog-formatted versions. This ensures consistency across formats.

Common Issues

Whisper misidentifies speakers or doesn't support diarization natively. Whisper transcribes but doesn't diarize. Combine Whisper's transcription with Pyannote or similar diarization tools. Run diarization separately and merge the results using timestamp alignment.

Technical terms and jargon are consistently misspelled. Add a custom vocabulary or post-processing dictionary that corrects known misrecognitions. "Kubernetes" might be transcribed as "Kuber Netties" — a simple find-and-replace dictionary fixes these consistently.

Long episodes (2+ hours) run out of memory during transcription. Process long episodes in segments (10-15 minutes each) and concatenate the results. Ensure segments overlap by 5-10 seconds to prevent missing words at boundaries.

⚠️ Loading Issue

Podcast Transcriber Assistant

Podcast Transcriber Assistant

When to Use This Agent

Quick Start

Core Concepts

Transcription Pipeline

Output Formats

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner