Streamlining Audio Processing: A Deep Dive into Python-Based Transcription and Diarization

Streamlining Audio Processing: A Deep Dive into Python-Based Transcription and Diarization

In today's digital age, the ability to efficiently process and analyze audio content has become increasingly important. Whether you're a podcaster, researcher, or content creator, having a robust system for transcribing and diarizing audio can save countless hours of manual work. In this post, we'll explore a Python script that combines various libraries and techniques to automate the process of audio transcription and speaker diarization.

At the heart of this script is the integration of several powerful libraries:

  1. MoviePy for handling video files
  2. Pydub for audio manipulation
  3. Whisper for speech recognition
  4. Pyannote for speaker diarization

Let's break down some of the key components and concepts implemented in this script.

Audio Preprocessing:

Before we can transcribe or diarize audio, it's often necessary to preprocess the file. This script includes a function to add a short silence at the beginning of the audio and convert it to WAV format:

def preprocess_audio(audio_filename, temp):
    spacermilli = 2000
    spacer = AudioSegment.silent(duration=spacermilli)
    audio = AudioSegment.from_mp3(audio_filename)
    audio = spacer.append(audio, crossfade=0)
    audio.export(f'{temp}/audio.wav', format='wav')
    return f'{temp}/audio.wav'

This preprocessing step ensures consistency in the audio format and can help improve the accuracy of subsequent processing steps.

Speaker Diarization:

One of the most interesting aspects of this script is its implementation of speaker diarization using the Pyannote library. Speaker diarization is the process of partitioning an audio stream into segments according to the identity of each speaker. Here's how it's implemented:

def diarize_audio(audio_path):
    pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1",
        use_auth_token="....")
    with ProgressHook() as hook:
        diarization = pipeline(audio_path, hook=hook)
    return diarization

This function utilizes a pre-trained model from Pyannote to perform speaker diarization on the input audio file. The result is a timeline of speaker segments that can be used to split the audio for individual transcription.

Audio Segmentation:

After diarization, the script segments the audio based on speaker changes:

def segment_audio(audio_path, temp):
    audio = AudioSegment.from_mp3(audio_path)
    spacermilli = 0
    spacer = AudioSegment.silent(duration=spacermilli)
    sounds = spacer
    segments = []
    previous_speaker = None
    speaker_count = {}
    file_names = []
    dz = open(f'{temp}/diarization.txt').read().splitlines()
    for l in dz:
        start, end = tuple(re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=l))
        start = int(millisec(start))
        end = int(millisec(end))
        current_speaker = l.split()[-1]
        if current_speaker != previous_speaker:
            if len(sounds) > len(spacer) and len(sounds) / 1000 > 1:
                speaker_count[previous_speaker] = speaker_count.get(previous_speaker, 0) + 1
                sounds.export(f"{temp}/{previous_speaker}_{speaker_count[previous_speaker]}.wav", format="wav")
                file_names.append(f"{previous_speaker}_{speaker_count[previous_speaker]}.wav")
            segments.append(len(sounds))
            sounds = spacer
        sounds = sounds.append(audio[start:end], crossfade=0)
        sounds = sounds.append(spacer, crossfade=0)
        previous_speaker = current_speaker
    # ... (handling the last segment)
    return file_names

This function reads the diarization results, splits the audio into segments for each speaker, and saves them as separate files. This approach allows for more accurate transcription by processing each speaker's audio independently.

Transcription:

The script uses the Whisper model for transcription. Here's a simplified version of the transcription function:

def transcribe_audio_chunks_diarized2(list_chunks, model, filename, TEMP):
    transcriptions = []
    for i, chunk in enumerate(list_chunks, 1):
        chunk_filename = os.path.join(TEMP, chunk)
        audio = AudioSegment.from_file(chunk_filename)
        wav_filename = chunk_filename.rsplit('.', 1)[0] + '.wav'
        audio.export(wav_filename, format="wav")
        segments, _ = model.transcribe(wav_filename)
        chunk_transcription = " ".join([segment.text for segment in segments])
        transcriptions.append(f"\n\nSpeaker {i}: {chunk_transcription}")
        os.remove(wav_filename)
    output_filename = f"{filename}.txt"
    with open(output_filename, "w", encoding='utf-8') as f:
        f.write("".join(transcriptions))
    return output_filename

This function processes each audio chunk (corresponding to a single speaker segment), transcribes it using the Whisper model, and combines the results into a single transcription file with speaker annotations.

The script also includes several other useful features, such as:

  1. Removing silence from audio files to improve processing efficiency
  2. Handling various input formats (MP4, MKV, AAC, WAV, MP3)
  3. Implementing a semaphore system to manage concurrent processing
  4. Extensive logging for tracking progress and debugging

In conclusion, this Python script showcases a comprehensive approach to audio processing, combining speaker diarization and transcription to produce accurate, speaker-annotated transcripts. By leveraging state-of-the-art libraries and models, it demonstrates how complex audio processing tasks can be automated, potentially saving significant time and effort in various applications ranging from content creation to research and analysis.

Read more