Best Practices for labeling Speech Recognition Custom Data
Let’s see what are some of the best practices for Labeling Speech Data for Automatic Speech Recognition (ASR)
Here is what we’ll talk about throughout this article:
- Data Collection and preparation
- Transcription Guidelines
- Setting up UI
- Training
- Evaluation
Data Collection and Preparation
The recommended audio length is under 30 sec to train a whisper (speech to text) model. That would mean if we have call recordings of 5min; we would have to split them into chunks of around 25 sec.
But the question is how do we chunk them the right way? Because if we do it wrong way we could split at the exact moment a word was being said which could impact model’s results.
You can pick any audio or video from YouTube or other source. Let me know in comments if you want help in that.
We’ll be using Pydub; which will split the audio when the noise level falls below a certain threshold.
$ pip install pydub
import os
from pydub import AudioSegment, silence
def split_audio(
input_audio_path,
output_directory,
max_chunk_duration=10_000,
min_silence_duration=900,
):
# Load the audio file
audio = AudioSegment.from_file(input_audio_path)
# Perform voice activity detection using pydub's silence module
segments = silence.detect_silence(
audio, min_silence_len=min_silence_duration, silence_thresh=-50
)
# Split the audio into chunks based on silence segments, avoiding chunks longer than max_chunk_duration
chunk_number = 1
start_time = 0
for start, end in segments:
if (end - start) <= max_chunk_duration:
chunk = audio[start_time:end]
chunk_duration = len(chunk)
# Check if adding the current segment would exceed the max_chunk_duration
if chunk_duration + (end - start_time) > max_chunk_duration:
chunk = audio[start_time : start_time + max_chunk_duration]
# Save the chunk to WAV format
chunk_wav = os.path.join(output_directory, f"chunk_{chunk_number}.wav")
chunk.export(chunk_wav, format="wav")
print(
f"Chunk {chunk_number}: {start_time} ms - {start_time + len(chunk)} ms"
)
chunk_number += 1
start_time += len(chunk)
if __name__ == "__main__":
# Example usage
input_audio_path = "./sample.wav"
output_directory = "./chunk_files"
split_audio(input_audio_path, output_directory)
Transcription Guidelines
Before actually starting to label the data it is always a good idea to write down the rules and guidelines which other team members can also follow and consistency can be maintained. This can consist of:
- Including punctuation or not.
- What to do with garbage or just noise data?
Setting up UI
For our article, we’ll be using LabelStudio.
You can install label studio by
$ pip install label-studio
Once installed, you can open it by typing
$ label-studio start
Sign up and then create a new project in it. You’ll have to fill following information
- Project Name — can be anything
- Data Import — here you can add the chunk audio files we created in step 1. You can only import certain numbers of files at a time.
- Labeling Setup — here you’ll have to go to Audio/Speech Processing section and select Automatic Speech Recognition
Click on save
Data Labeling
Once you’ve created a new project, you can start labeling the data.
If can label a few audio clips and then click on the project name
Click on export to export the data. Select the format as ASR Manifest
Data Labeling Export
It will download a zip file. Extract the contents of that zip.
We are going to take an easy step here; we’ll upload the content of extracted zip files directly to huggingface hub.
Go to Huggingface.co and click on + icon on left side
New Dataset creation on Huggingface
Fill the details and then upload the data using the upload function
Training
Here is where things get little tricky and interesting
Speech models require longer duration of training/fine-tuning if you’re using Google Colab or Kaggle. So, it is best if you store model weights somewhere so you can resume training again.
If you’re using Google Colab or Kaggle, the best method would be to use HuggingFace Hub. You can upload dataset and model there and you can resume training from last checkpoint without needing to change anything.
I’ll share the link of the Kaggle Notebook here
You’ll have to update a few things.
- Create Access Token from Hugging Face Hub and add it in Training Arguments at bottom. This will allow you to push your code. **You’ll have to create Access Token with write permission.
- Paste the name of your dataset.
Let me know if you face any issue while running it.