Best Practices for labeling Speech Recognition Custom Data

Let’s see what are some of the best practices for Labeling Speech Data for Automatic Speech Recognition (ASR)

Gagandeep Singh
4 min readSep 22, 2023

Here is what we’ll talk about throughout this article:

  1. Data Collection and preparation
  2. Transcription Guidelines
  3. Setting up UI
  4. Training
  5. Evaluation

Data Collection and Preparation

The recommended audio length is under 30 sec to train a whisper (speech to text) model. That would mean if we have call recordings of 5min; we would have to split them into chunks of around 25 sec.

But the question is how do we chunk them the right way? Because if we do it wrong way we could split at the exact moment a word was being said which could impact model’s results.

You can pick any audio or video from YouTube or other source. Let me know in comments if you want help in that.

We’ll be using Pydub; which will split the audio when the noise level falls below a certain threshold.

$ pip install pydub
import os
from pydub import AudioSegment, silence


def split_audio(
input_audio_path,
output_directory,
max_chunk_duration=10_000,
min_silence_duration=900,
):
# Load the audio file
audio = AudioSegment.from_file(input_audio_path)

# Perform voice activity detection using pydub's silence module
segments = silence.detect_silence(
audio, min_silence_len=min_silence_duration, silence_thresh=-50
)

# Split the audio into chunks based on silence segments, avoiding chunks longer than max_chunk_duration
chunk_number = 1
start_time = 0
for start, end in segments:
if (end - start) <= max_chunk_duration:
chunk = audio[start_time:end]
chunk_duration = len(chunk)

# Check if adding the current segment would exceed the max_chunk_duration
if chunk_duration + (end - start_time) > max_chunk_duration:
chunk = audio[start_time : start_time + max_chunk_duration]

# Save the chunk to WAV format
chunk_wav = os.path.join(output_directory, f"chunk_{chunk_number}.wav")
chunk.export(chunk_wav, format="wav")

print(
f"Chunk {chunk_number}: {start_time} ms - {start_time + len(chunk)} ms"
)
chunk_number += 1
start_time += len(chunk)


if __name__ == "__main__":
# Example usage
input_audio_path = "./sample.wav"
output_directory = "./chunk_files"
split_audio(input_audio_path, output_directory)

Transcription Guidelines

Before actually starting to label the data it is always a good idea to write down the rules and guidelines which other team members can also follow and consistency can be maintained. This can consist of:

  1. Including punctuation or not.
  2. What to do with garbage or just noise data?

Setting up UI

For our article, we’ll be using LabelStudio.

You can install label studio by

$ pip install label-studio

Once installed, you can open it by typing

$ label-studio start

Sign up and then create a new project in it. You’ll have to fill following information

  1. Project Name — can be anything
  2. Data Import — here you can add the chunk audio files we created in step 1. You can only import certain numbers of files at a time.
  3. Labeling Setup — here you’ll have to go to Audio/Speech Processing section and select Automatic Speech Recognition
Label Studio Project Creation

Click on save

Data Labeling

Once you’ve created a new project, you can start labeling the data.

Label Studio UI

If can label a few audio clips and then click on the project name

Clicking on Project Name

Click on export to export the data. Select the format as ASR Manifest

Data Labeling Export

Data Labeling Export

It will download a zip file. Extract the contents of that zip.

We are going to take an easy step here; we’ll upload the content of extracted zip files directly to huggingface hub.

Go to Huggingface.co and click on + icon on left side

New Dataset creation on Huggingface
New Dataset creation on Huggingface

New Dataset creation on Huggingface

Fill the details and then upload the data using the upload function

Training

Here is where things get little tricky and interesting

Speech models require longer duration of training/fine-tuning if you’re using Google Colab or Kaggle. So, it is best if you store model weights somewhere so you can resume training again.

If you’re using Google Colab or Kaggle, the best method would be to use HuggingFace Hub. You can upload dataset and model there and you can resume training from last checkpoint without needing to change anything.

I’ll share the link of the Kaggle Notebook here

You’ll have to update a few things.

  1. Create Access Token from Hugging Face Hub and add it in Training Arguments at bottom. This will allow you to push your code. **You’ll have to create Access Token with write permission.
  2. Paste the name of your dataset.

Let me know if you face any issue while running it.

--

--

Gagandeep Singh
Gagandeep Singh

Written by Gagandeep Singh

Data Scientist | GenAI | NLP | Predictive Analytics | Speech Analytics

No responses yet