P

Podcast Transcriber Assistant

Enterprise-grade agent for audio, transcription, specialist, proactively. Includes structured workflows, validation checks, and reusable patterns for ffmpeg clip team.

AgentClipticsffmpeg clip teamv1.0.0MIT
0 views0 copies

Podcast Transcriber Assistant

Your agent for transcribing podcast audio to text β€” covering automated transcription, speaker diarization, timestamp generation, and transcript formatting for multiple output formats.

When to Use This Agent

Choose Podcast Transcriber Assistant when:

  • Transcribing podcast episodes to text with speaker labels
  • Generating timestamped transcripts for accessibility compliance
  • Creating formatted transcripts for show notes or blog posts
  • Implementing automated transcription pipelines
  • Editing and correcting automated transcriptions

Consider alternatives when:

  • You need audio editing β€” use a Podcast Ally or Audio Mixer agent
  • You need metadata management β€” use a Podcast Metadata agent
  • You need real-time captioning β€” use a streaming transcription service

Quick Start

# .claude/agents/podcast-transcriber.yml name: Podcast Transcriber Assistant model: claude-sonnet tools: - Read - Write - Edit - Bash - Glob - Grep description: Podcast transcription agent for speaker-labeled, timestamped transcripts in multiple output formats

Example invocation:

claude "Transcribe our latest episode, identify speakers, add timestamps every 30 seconds, and format as both SRT subtitles and a clean blog-post-style transcript"

Core Concepts

Transcription Pipeline

StepActivityTool
Pre-processEnhance audio, reduce noiseFFmpeg
TranscribeSpeech-to-text conversionWhisper, AssemblyAI, Deepgram
DiarizeIdentify and label speakersPyannote, speaker ID
TimestampAdd time codesWhisper word-level timestamps
FormatConvert to target formatCustom formatting
ReviewEdit for accuracyManual or AI-assisted

Output Formats

SRT (Subtitles):
  1
  00:00:01,000 --> 00:00:05,500
  [Host] Welcome back to the podcast.

  2
  00:00:06,000 --> 00:00:10,200
  [Guest] Thanks for having me.

VTT (Web Video):
  WEBVTT

  00:01.000 --> 00:05.500
  <v Host>Welcome back to the podcast.</v>

Blog Transcript:
  **[00:00]** Host: Welcome back to the podcast.

  **[00:06]** Guest: Thanks for having me.

Configuration

ParameterDescriptionDefault
transcription_engineSTT engine (whisper, assemblyai, deepgram)whisper
model_sizeWhisper model size (tiny, base, small, medium, large)medium
languageAudio languageen
output_formatsOutput formats (srt, vtt, txt, json, blog)srt,txt
speaker_diarizationEnable speaker identificationtrue

Best Practices

  1. Pre-process audio before transcription. Noise reduction and normalization improve transcription accuracy significantly. Apply a high-pass filter at 80Hz, reduce background noise, and normalize to -16 LUFS before sending to the transcription engine.

  2. Use the largest Whisper model your hardware supports. Larger models produce significantly more accurate transcriptions. large-v3 is the most accurate but requires 10GB+ VRAM. medium is a good balance for most hardware.

  3. Always review automated transcriptions before publishing. No STT engine is 100% accurate. Technical terms, proper nouns, and accented speech have higher error rates. Schedule 15-20 minutes of editing time per hour of audio.

  4. Include speaker labels in multi-person transcripts. Unlabeled transcripts are hard to follow. Use speaker diarization to identify speakers, then map speaker IDs (SPEAKER_00) to real names. Consistent labeling helps readers follow the conversation.

  5. Generate multiple output formats from a single canonical source. Transcribe once to a rich format (JSON with timestamps and speakers), then transform to SRT, VTT, plain text, and blog-formatted versions. This ensures consistency across formats.

Common Issues

Whisper misidentifies speakers or doesn't support diarization natively. Whisper transcribes but doesn't diarize. Combine Whisper's transcription with Pyannote or similar diarization tools. Run diarization separately and merge the results using timestamp alignment.

Technical terms and jargon are consistently misspelled. Add a custom vocabulary or post-processing dictionary that corrects known misrecognitions. "Kubernetes" might be transcribed as "Kuber Netties" β€” a simple find-and-replace dictionary fixes these consistently.

Long episodes (2+ hours) run out of memory during transcription. Process long episodes in segments (10-15 minutes each) and concatenate the results. Ensure segments overlap by 5-10 seconds to prevent missing words at boundaries.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates