Best Practices for labeling Speech Recognition Custom Data

Let’s see what are some of the best practices for Labeling Speech Data for Automatic Speech Recognition (ASR)

4 min readSep 22, 2023

Here is what we’ll talk about throughout this article:

Data Collection and preparation
Transcription Guidelines
Setting up UI
Training
Evaluation

Data Collection and Preparation

The recommended audio length is under 30 sec to train a whisper (speech to text) model. That would mean if we have call recordings of 5min; we would have to split them into chunks of around 25 sec.

But the question is how do we chunk them the right way? Because if we do it wrong way we could split at the exact moment a word was being said which could impact model’s results.

You can pick any audio or video from YouTube or other source. Let me know in comments if you want help in that.

We’ll be using Pydub; which will split the audio when the noise level falls below a certain threshold.

$ pip install pydub

import os
from pydub import AudioSegment, silence


def split_audio(
    input_audio_path,
    output_directory,
    max_chunk_duration=10_000,
    min_silence_duration=900,
):
    # Load the audio file
    audio = AudioSegment.from_file(input_audio_path)

    # Perform voice activity detection using pydub's silence module
    segments = silence.detect_silence(
        audio, min_silence_len=min_silence_duration, silence_thresh=-50
    )

    # Split the audio into chunks based on silence segments, avoiding chunks longer than max_chunk_duration
    chunk_number = 1
    start_time = 0
    for start, end in segments:
        if (end - start) <= max_chunk_duration:
            chunk = audio[start_time:end]
            chunk_duration = len(chunk)

            # Check if adding the current segment would exceed the max_chunk_duration
            if chunk_duration + (end - start_time) > max_chunk_duration:
                chunk = audio[start_time : start_time + max_chunk_duration]

            # Save the chunk to WAV format
            chunk_wav = os.path.join(output_directory, f"chunk_{chunk_number}.wav")
            chunk.export(chunk_wav, format="wav")

            print(
                f"Chunk {chunk_number}: {start_time} ms - {start_time + len(chunk)} ms"
            )
            chunk_number += 1
            start_time += len(chunk)


if __name__ == "__main__":
    # Example usage
    input_audio_path = "./sample.wav"
    output_directory = "./chunk_files"
    split_audio(input_audio_path, output_directory)

Transcription Guidelines

Before actually starting to label the data it is always a good idea to write down the rules and guidelines which other team members can also follow and consistency can be maintained. This can consist of:

Including punctuation or not.
What to do with garbage or just noise data?

Setting up UI

For our article, we’ll be using LabelStudio.

You can install label studio by

$ pip install label-studio

Once installed, you can open it by typing

$ label-studio start

Project Name — can be anything
Data Import — here you can add the chunk audio files we created in step 1. You can only import certain numbers of files at a time.
Labeling Setup — here you’ll have to go to Audio/Speech Processing section and select Automatic Speech Recognition

Click on save

Data Labeling

Once you’ve created a new project, you can start labeling the data.

If can label a few audio clips and then click on the project name

Click on export to export the data. Select the format as ASR Manifest

Data Labeling Export

It will download a zip file. Extract the contents of that zip.

We are going to take an easy step here; we’ll upload the content of extracted zip files directly to huggingface hub.

Go to Huggingface.co and click on + icon on left side

New Dataset creation on Huggingface

Fill the details and then upload the data using the upload function

Training

Here is where things get little tricky and interesting

Speech models require longer duration of training/fine-tuning if you’re using Google Colab or Kaggle. So, it is best if you store model weights somewhere so you can resume training again.

If you’re using Google Colab or Kaggle, the best method would be to use HuggingFace Hub. You can upload dataset and model there and you can resume training from last checkpoint without needing to change anything.

I’ll share the link of the Kaggle Notebook here

Fine-tune Whisper

Explore and run machine learning code with Kaggle Notebooks | Using data from [Private Datasource]

www.kaggle.com

You’ll have to update a few things.

Create Access Token from Hugging Face Hub and add it in Training Arguments at bottom. This will allow you to push your code. **You’ll have to create Access Token with write permission.
Paste the name of your dataset.

Let me know if you face any issue while running it.