Speedup Fine-tuned Whisper 10X

We’ll speedup a finetuned Whisper model 10X using ONNX and quantization

Gagandeep Singh
3 min readNov 3, 2023

Here is what we plan to do in this article

  1. Load a fine-tuned whisper model
  2. Convert it into ONNX format
  3. Quantize it to int8

How do we proceed? We need a fine tuned hugging face whisper model. Luckily, there is whisper model on my hf hub which you use or any other model upto you.

Converting and quantizing the model

Installing Requirements

pip install optimum transformers onnxruntime

Loading Model

model_id = "Gaganmanku96/whisper-small-hi"
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq

Exporting it to ONNX format

model = ORTModelForSpeechSeq2Seq.from_pretrained('onnx_variant', export=True)
model.save_pretrained('./onnx_model')

If your notebook crashed, just rerun it again

We are using optimum package to load the pytorch model and export it into onnx format. This loading alone doesn’t save the model files, we need to do it explicitly using save_pretrained method.

Quantizing the ONNX Model

from optimum.onnxruntime.configuration import AutoQuantizationConfig
from optimum.onnxruntime import ORTQuantizer
model_path = './onnx_model/'

for file_name in ['encoder_model.onnx', 'decoder_model.onnx', 'decoder_with_past_model.onnx']:
folder_name = filename.split('.')[0]
quantizer = ORTQuantizer.from_pretrained(model_path, file_name)

dqconfig = AutoQuantizationConfig.avx512(is_static=False, per_channel=False, nodes_to_exclude=['/conv1/Conv', '/conv2/Conv'])

model_quantized_path = quantizer.quantize(
save_dir=folder,
quantization_config=dqconfig
)

I was running this code on jupyter notebook and my kernel crashed a few times. Just restart the kernel and run the cell again.

After multiple tries I was able to quantize the model with the configuration provided above. I’m not sure, if the same configuration can be used for different onnxruntime executors like CUDA, TensorRT, etc.

After running the above code, you’ll get 3 folders with following name: encoder_model, decoder_model, decoder_with_past_model. All the folders have same files except for one which is .onnx file.

Move all the .onnx files from these folder into a new folder and removed the quantized part from file_name as well.

The file names should be:

ONNX model files

You can copy rest of the files from any model folder into the new folder you created with all onnx files.

Your final folder with all files will look like this

All files

Now, time to test it

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import WhisperTokenizerFast, WhisperFeatureExtractor, pipeline
model_name = 'onnx_variant' # folder name
model = ORTModelForSpeechSeq2Seq.from_pretrained(model_name, export=False)
tokenizer = WhisperTokenizerFast.from_pretrained(model_name)
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_name)

from transformers import pipeline
pipe = pipeline('automatic-speech-recognition',
model=model,
tokenizer=tokenizer,
feature_extractor=feature_extractor)
# Testing it on a audio file
pipe('./audio.wav')

Note: We don’t need export=True this time because we’ve already exported the model.

Speed Benchmarking

I was transcribing a 50 sec audio file and this the result I got. We can clearly see ONNX-QINT8 is 2.5–3x faster than pytorch model on average with 3x memory reduction.

This speed can go upto 10x in certain instances like using multiple processes to parallelize the process.

Speech Comparison

Conclusion

8bit Quantization definitely improves the speed. As a future step, we can try to get 4bit or 5bit model which will be a lot faster with slight decrease in accuracy.

--

--

Gagandeep Singh
Gagandeep Singh

Written by Gagandeep Singh

Data Scientist | GenAI | NLP | Predictive Analytics | Speech Analytics

Responses (2)