A

Advanced Transcribe

Powerful skill for transcribe, audio, files, text. Includes structured workflows, validation checks, and reusable patterns for media.

SkillClipticsmediav1.0.0MIT
0 views0 copies

Advanced Transcribe

A practical skill for transcribing audio to text — covering speech-to-text processing with OpenAI Whisper, speaker diarization, timestamp generation, multi-language support, and batch transcription workflows for meetings, interviews, and media content.

When to Use This Skill

Choose Advanced Transcribe when you need to:

  • Transcribe audio recordings to text (meetings, interviews, podcasts)
  • Generate timestamped transcripts for video subtitles
  • Identify and label different speakers in a recording
  • Transcribe audio in multiple languages
  • Process large batches of audio files automatically

Consider alternatives when:

  • You need text-to-speech generation (use a speech synthesis skill)
  • You need real-time live transcription (use a live captioning tool)
  • You need audio editing or enhancement (use an audio processing skill)

Quick Start

# Install dependencies pip install openai
from openai import OpenAI client = OpenAI() def transcribe(audio_path, response_format="text"): """Transcribe audio file to text.""" with open(audio_path, "rb") as audio_file: transcript = client.audio.transcriptions.create( model="whisper-1", file=audio_file, response_format=response_format, ) return transcript # Simple text transcription text = transcribe("meeting.mp3") print(text) # With timestamps (SRT format for subtitles) srt = transcribe("meeting.mp3", response_format="srt") with open("meeting.srt", "w") as f: f.write(srt) # Verbose JSON with word-level timestamps import json result = transcribe("meeting.mp3", response_format="verbose_json") for segment in result.segments: print(f"[{segment['start']:.1f}s] {segment['text']}")

Core Concepts

Transcription Output Formats

FormatContentUse Case
textPlain text onlyQuick transcription
jsonText with metadataApplication integration
verbose_jsonSegments with timestampsDetailed analysis
srtSubRip subtitle formatVideo subtitles
vttWebVTT subtitle formatWeb video players

Speaker Diarization

# Speaker diarization with pyannote from pyannote.audio import Pipeline import torch def diarize_audio(audio_path): """Identify speakers in audio.""" pipeline = Pipeline.from_pretrained( "pyannote/speaker-diarization-3.1", use_auth_token="YOUR_HF_TOKEN" ) diarization = pipeline(audio_path) speakers = [] for turn, _, speaker in diarization.itertracks(yield_label=True): speakers.append({ "speaker": speaker, "start": round(turn.start, 2), "end": round(turn.end, 2), }) return speakers # Combine with Whisper transcription def transcribe_with_speakers(audio_path): transcript = transcribe(audio_path, response_format="verbose_json") speakers = diarize_audio(audio_path) labeled = [] for segment in transcript.segments: mid_time = (segment["start"] + segment["end"]) / 2 speaker = "Unknown" for s in speakers: if s["start"] <= mid_time <= s["end"]: speaker = s["speaker"] break labeled.append({ "speaker": speaker, "start": segment["start"], "end": segment["end"], "text": segment["text"], }) return labeled

Batch Processing

from pathlib import Path from openai import OpenAI import json client = OpenAI() def batch_transcribe(input_dir, output_dir, response_format="text"): """Transcribe all audio files in a directory.""" Path(output_dir).mkdir(exist_ok=True) supported = {".mp3", ".wav", ".m4a", ".ogg", ".flac", ".webm"} results = [] for audio_path in sorted(Path(input_dir).iterdir()): if audio_path.suffix.lower() not in supported: continue print(f"Transcribing: {audio_path.name}") transcript = transcribe(str(audio_path), response_format) out_file = Path(output_dir) / f"{audio_path.stem}.txt" out_file.write_text(transcript if isinstance(transcript, str) else json.dumps(transcript, indent=2)) results.append({"file": audio_path.name, "output": str(out_file)}) print(f"\nTranscribed {len(results)} files → {output_dir}") return results batch_transcribe("./recordings", "./transcripts")

Configuration

ParameterDescriptionExample
modelWhisper model version"whisper-1"
languageSource audio language (ISO 639-1)"en" / "auto"
response_formatOutput format"text" / "srt"
temperatureSampling temperature for decoding0.0 (deterministic)
promptOptional context hint for accuracy"Meeting about Q4 budget"
diarizeEnable speaker identificationtrue

Best Practices

  1. Provide a prompt hint for domain-specific terms — Whisper can misrecognize technical terms, product names, and acronyms. Pass a prompt parameter with key terms: prompt="Discussion about Kubernetes, kubectl, and EKS cluster management". This dramatically improves accuracy for specialized vocabulary.

  2. Pre-process audio for better results — Normalize volume, remove background noise, and convert to 16kHz mono WAV before transcription. Clean audio produces significantly more accurate transcriptions. Use ffmpeg: ffmpeg -i input.mp3 -ac 1 -ar 16000 clean.wav.

  3. Use verbose_json format when you need timestamps — The verbose_json format gives segment-level timestamps that you can use for subtitle generation, speaker labeling, and audio-to-text alignment. Plain text format loses all timing information.

  4. Break long audio files into chunks — Whisper has a 25MB file size limit per API call. For longer recordings, split the audio into chunks at silence points using ffmpeg or pydub, transcribe each chunk, then merge the results with adjusted timestamps.

  5. Validate transcriptions of critical content — AI transcription is not perfect, especially for accented speech, multiple overlapping speakers, or poor audio quality. For legal, medical, or compliance-critical content, always have a human review the transcript.

Common Issues

Transcription accuracy drops with background noise — Background music, HVAC noise, and keyboard typing significantly reduce accuracy. Pre-process with noise reduction: ffmpeg -i input.mp3 -af "highpass=f=200,lowpass=f=3000,afftdn=nf=-25" clean.mp3. The highpass/lowpass filters focus on the speech frequency range.

Speaker diarization assigns wrong labels — Diarization models work best with clear turn-taking. Overlapping speech, interruptions, and similar-sounding voices cause errors. For meetings, ask participants to identify themselves at the start, and cross-reference diarization labels with the content.

Timestamps drift in long recordings — Whisper processes audio in 30-second chunks, and small timing errors accumulate over long files. For precise timestamp alignment (subtitles, synced playback), use the verbose_json format and recalibrate timestamps by matching known anchor points in the audio.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates