Advanced Transcribe

A practical skill for transcribing audio to text — covering speech-to-text processing with OpenAI Whisper, speaker diarization, timestamp generation, multi-language support, and batch transcription workflows for meetings, interviews, and media content.

When to Use This Skill

Choose Advanced Transcribe when you need to:

Transcribe audio recordings to text (meetings, interviews, podcasts)
Generate timestamped transcripts for video subtitles
Identify and label different speakers in a recording
Transcribe audio in multiple languages
Process large batches of audio files automatically

Consider alternatives when:

You need text-to-speech generation (use a speech synthesis skill)
You need real-time live transcription (use a live captioning tool)
You need audio editing or enhancement (use an audio processing skill)

Quick Start


# Install dependencies
pip install openai


from openai import OpenAI

client = OpenAI()

def transcribe(audio_path, response_format="text"):
    """Transcribe audio file to text."""
    with open(audio_path, "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
            response_format=response_format,
        )
    return transcript

# Simple text transcription
text = transcribe("meeting.mp3")
print(text)

# With timestamps (SRT format for subtitles)
srt = transcribe("meeting.mp3", response_format="srt")
with open("meeting.srt", "w") as f:
    f.write(srt)

# Verbose JSON with word-level timestamps
import json
result = transcribe("meeting.mp3", response_format="verbose_json")
for segment in result.segments:
    print(f"[{segment['start']:.1f}s] {segment['text']}")

Core Concepts

Transcription Output Formats

Format	Content	Use Case
`text`	Plain text only	Quick transcription
`json`	Text with metadata	Application integration
`verbose_json`	Segments with timestamps	Detailed analysis
`srt`	SubRip subtitle format	Video subtitles
`vtt`	WebVTT subtitle format	Web video players

Speaker Diarization


# Speaker diarization with pyannote
from pyannote.audio import Pipeline
import torch

def diarize_audio(audio_path):
    """Identify speakers in audio."""
    pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1",
        use_auth_token="YOUR_HF_TOKEN"
    )
    diarization = pipeline(audio_path)

    speakers = []
    for turn, _, speaker in diarization.itertracks(yield_label=True):
        speakers.append({
            "speaker": speaker,
            "start": round(turn.start, 2),
            "end": round(turn.end, 2),
        })
    return speakers

# Combine with Whisper transcription
def transcribe_with_speakers(audio_path):
    transcript = transcribe(audio_path, response_format="verbose_json")
    speakers = diarize_audio(audio_path)

    labeled = []
    for segment in transcript.segments:
        mid_time = (segment["start"] + segment["end"]) / 2
        speaker = "Unknown"
        for s in speakers:
            if s["start"] <= mid_time <= s["end"]:
                speaker = s["speaker"]
                break
        labeled.append({
            "speaker": speaker,
            "start": segment["start"],
            "end": segment["end"],
            "text": segment["text"],
        })
    return labeled

Batch Processing


from pathlib import Path
from openai import OpenAI
import json

client = OpenAI()

def batch_transcribe(input_dir, output_dir, response_format="text"):
    """Transcribe all audio files in a directory."""
    Path(output_dir).mkdir(exist_ok=True)
    supported = {".mp3", ".wav", ".m4a", ".ogg", ".flac", ".webm"}
    results = []

    for audio_path in sorted(Path(input_dir).iterdir()):
        if audio_path.suffix.lower() not in supported:
            continue
        print(f"Transcribing: {audio_path.name}")
        transcript = transcribe(str(audio_path), response_format)
        out_file = Path(output_dir) / f"{audio_path.stem}.txt"
        out_file.write_text(transcript if isinstance(transcript, str) else json.dumps(transcript, indent=2))
        results.append({"file": audio_path.name, "output": str(out_file)})

    print(f"\nTranscribed {len(results)} files → {output_dir}")
    return results

batch_transcribe("./recordings", "./transcripts")

Configuration

Parameter	Description	Example
`model`	Whisper model version	`"whisper-1"`
`language`	Source audio language (ISO 639-1)	`"en"` / `"auto"`
`response_format`	Output format	`"text"` / `"srt"`
`temperature`	Sampling temperature for decoding	`0.0` (deterministic)
`prompt`	Optional context hint for accuracy	`"Meeting about Q4 budget"`
`diarize`	Enable speaker identification	`true`

Best Practices

Provide a prompt hint for domain-specific terms — Whisper can misrecognize technical terms, product names, and acronyms. Pass a prompt parameter with key terms: prompt="Discussion about Kubernetes, kubectl, and EKS cluster management". This dramatically improves accuracy for specialized vocabulary.
Pre-process audio for better results — Normalize volume, remove background noise, and convert to 16kHz mono WAV before transcription. Clean audio produces significantly more accurate transcriptions. Use ffmpeg: ffmpeg -i input.mp3 -ac 1 -ar 16000 clean.wav.
Use verbose_json format when you need timestamps — The verbose_json format gives segment-level timestamps that you can use for subtitle generation, speaker labeling, and audio-to-text alignment. Plain text format loses all timing information.
Break long audio files into chunks — Whisper has a 25MB file size limit per API call. For longer recordings, split the audio into chunks at silence points using ffmpeg or pydub, transcribe each chunk, then merge the results with adjusted timestamps.
Validate transcriptions of critical content — AI transcription is not perfect, especially for accented speech, multiple overlapping speakers, or poor audio quality. For legal, medical, or compliance-critical content, always have a human review the transcript.

Common Issues

Transcription accuracy drops with background noise — Background music, HVAC noise, and keyboard typing significantly reduce accuracy. Pre-process with noise reduction: ffmpeg -i input.mp3 -af "highpass=f=200,lowpass=f=3000,afftdn=nf=-25" clean.mp3. The highpass/lowpass filters focus on the speech frequency range.

Speaker diarization assigns wrong labels — Diarization models work best with clear turn-taking. Overlapping speech, interruptions, and similar-sounding voices cause errors. For meetings, ask participants to identify themselves at the start, and cross-reference diarization labels with the content.

Timestamps drift in long recordings — Whisper processes audio in 30-second chunks, and small timing errors accumulate over long files. For precise timestamp alignment (subtitles, synced playback), use the verbose_json format and recalibrate timestamps by matching known anchor points in the audio.

⚠️ Loading Issue

Advanced Transcribe

Advanced Transcribe

When to Use This Skill

Quick Start

Core Concepts

Transcription Output Formats

Speaker Diarization

Batch Processing

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace