Speedup Fine-tuned Whisper 10X
We’ll speedup a finetuned Whisper model 10X using ONNX and quantization
Here is what we plan to do in this article
- Load a fine-tuned whisper model
- Convert it into ONNX format
- Quantize it to int8
How do we proceed? We need a fine tuned hugging face whisper model. Luckily, there is whisper model on my hf hub which you use or any other model upto you.
Converting and quantizing the model
Installing Requirements
pip install optimum transformers onnxruntime
Loading Model
model_id = "Gaganmanku96/whisper-small-hi"
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
Exporting it to ONNX format
model = ORTModelForSpeechSeq2Seq.from_pretrained('onnx_variant', export=True)
model.save_pretrained('./onnx_model')
If your notebook crashed, just rerun it again
We are using optimum package to load the pytorch model and export it into onnx format. This loading alone doesn’t save the model files, we need to do it explicitly using save_pretrained method.
Quantizing the ONNX Model
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from optimum.onnxruntime import ORTQuantizer
model_path = './onnx_model/'
for file_name in ['encoder_model.onnx', 'decoder_model.onnx', 'decoder_with_past_model.onnx']:
folder_name = filename.split('.')[0]
quantizer = ORTQuantizer.from_pretrained(model_path, file_name)
dqconfig = AutoQuantizationConfig.avx512(is_static=False, per_channel=False, nodes_to_exclude=['/conv1/Conv', '/conv2/Conv'])
model_quantized_path = quantizer.quantize(
save_dir=folder,
quantization_config=dqconfig
)
I was running this code on jupyter notebook and my kernel crashed a few times. Just restart the kernel and run the cell again.
After multiple tries I was able to quantize the model with the configuration provided above. I’m not sure, if the same configuration can be used for different onnxruntime executors like CUDA, TensorRT, etc.
After running the above code, you’ll get 3 folders with following name: encoder_model, decoder_model, decoder_with_past_model. All the folders have same files except for one which is .onnx file.
Move all the .onnx files from these folder into a new folder and removed the quantized part from file_name as well.
The file names should be:
You can copy rest of the files from any model folder into the new folder you created with all onnx files.
Your final folder with all files will look like this
Now, time to test it
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import WhisperTokenizerFast, WhisperFeatureExtractor, pipeline
model_name = 'onnx_variant' # folder name
model = ORTModelForSpeechSeq2Seq.from_pretrained(model_name, export=False)
tokenizer = WhisperTokenizerFast.from_pretrained(model_name)
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_name)
from transformers import pipeline
pipe = pipeline('automatic-speech-recognition',
model=model,
tokenizer=tokenizer,
feature_extractor=feature_extractor)
# Testing it on a audio file
pipe('./audio.wav')
Note: We don’t need export=True this time because we’ve already exported the model.
Speed Benchmarking
I was transcribing a 50 sec audio file and this the result I got. We can clearly see ONNX-QINT8 is 2.5–3x faster than pytorch model on average with 3x memory reduction.
This speed can go upto 10x in certain instances like using multiple processes to parallelize the process.
Conclusion
8bit Quantization definitely improves the speed. As a future step, we can try to get 4bit or 5bit model which will be a lot faster with slight decrease in accuracy.