Streamlining Audio Processing: A Deep Dive into Python-Based Transcription and Diarization
In today's digital age, the ability to efficiently process and analyze audio content has become increasingly important. Whether you're a podcaster, researcher, or content creator, having a robust system for transcribing and diarizing audio can save countless hours of manual work. In this post, we'll explore a Python script that combines various libraries and techniques to automate the process of audio transcription and speaker diarization.
At the heart of this script is the integration of several powerful libraries:
- MoviePy for handling video files
- Pydub for audio manipulation
- Whisper for speech recognition
- Pyannote for speaker diarization
Let's break down some of the key components and concepts implemented in this script.
Audio Preprocessing:
Before we can transcribe or diarize audio, it's often necessary to preprocess the file. This script includes a function to add a short silence at the beginning of the audio and convert it to WAV format:
def preprocess_audio(audio_filename, temp):
spacermilli = 2000
spacer = AudioSegment.silent(duration=spacermilli)
audio = AudioSegment.from_mp3(audio_filename)
audio = spacer.append(audio, crossfade=0)
audio.export(f'{temp}/audio.wav', format='wav')
return f'{temp}/audio.wav'
This preprocessing step ensures consistency in the audio format and can help improve the accuracy of subsequent processing steps.
Speaker Diarization:
One of the most interesting aspects of this script is its implementation of speaker diarization using the Pyannote library. Speaker diarization is the process of partitioning an audio stream into segments according to the identity of each speaker. Here's how it's implemented:
def diarize_audio(audio_path):
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="....")
with ProgressHook() as hook:
diarization = pipeline(audio_path, hook=hook)
return diarization
This function utilizes a pre-trained model from Pyannote to perform speaker diarization on the input audio file. The result is a timeline of speaker segments that can be used to split the audio for individual transcription.
Audio Segmentation:
After diarization, the script segments the audio based on speaker changes:
def segment_audio(audio_path, temp):
audio = AudioSegment.from_mp3(audio_path)
spacermilli = 0
spacer = AudioSegment.silent(duration=spacermilli)
sounds = spacer
segments = []
previous_speaker = None
speaker_count = {}
file_names = []
dz = open(f'{temp}/diarization.txt').read().splitlines()
for l in dz:
start, end = tuple(re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=l))
start = int(millisec(start))
end = int(millisec(end))
current_speaker = l.split()[-1]
if current_speaker != previous_speaker:
if len(sounds) > len(spacer) and len(sounds) / 1000 > 1:
speaker_count[previous_speaker] = speaker_count.get(previous_speaker, 0) + 1
sounds.export(f"{temp}/{previous_speaker}_{speaker_count[previous_speaker]}.wav", format="wav")
file_names.append(f"{previous_speaker}_{speaker_count[previous_speaker]}.wav")
segments.append(len(sounds))
sounds = spacer
sounds = sounds.append(audio[start:end], crossfade=0)
sounds = sounds.append(spacer, crossfade=0)
previous_speaker = current_speaker
# ... (handling the last segment)
return file_names
This function reads the diarization results, splits the audio into segments for each speaker, and saves them as separate files. This approach allows for more accurate transcription by processing each speaker's audio independently.
Transcription:
The script uses the Whisper model for transcription. Here's a simplified version of the transcription function:
def transcribe_audio_chunks_diarized2(list_chunks, model, filename, TEMP):
transcriptions = []
for i, chunk in enumerate(list_chunks, 1):
chunk_filename = os.path.join(TEMP, chunk)
audio = AudioSegment.from_file(chunk_filename)
wav_filename = chunk_filename.rsplit('.', 1)[0] + '.wav'
audio.export(wav_filename, format="wav")
segments, _ = model.transcribe(wav_filename)
chunk_transcription = " ".join([segment.text for segment in segments])
transcriptions.append(f"\n\nSpeaker {i}: {chunk_transcription}")
os.remove(wav_filename)
output_filename = f"{filename}.txt"
with open(output_filename, "w", encoding='utf-8') as f:
f.write("".join(transcriptions))
return output_filename
This function processes each audio chunk (corresponding to a single speaker segment), transcribes it using the Whisper model, and combines the results into a single transcription file with speaker annotations.
The script also includes several other useful features, such as:
- Removing silence from audio files to improve processing efficiency
- Handling various input formats (MP4, MKV, AAC, WAV, MP3)
- Implementing a semaphore system to manage concurrent processing
- Extensive logging for tracking progress and debugging
In conclusion, this Python script showcases a comprehensive approach to audio processing, combining speaker diarization and transcription to produce accurate, speaker-annotated transcripts. By leveraging state-of-the-art libraries and models, it demonstrates how complex audio processing tasks can be automated, potentially saving significant time and effort in various applications ranging from content creation to research and analysis.