Speech recognition technology has improved dramatically in recent years thanks to advances in deep learning and computational power. In this post, I’ll explain how to leverage these advances by using the Whisper speech recognition library together with a GPU for fast, accurate transcriptions.
First, some background. Traditional speech recognition relied on acoustic and language models that were carefully hand-engineered. But deep learning methods like convolutional and recurrent neural networks now enable models to learn directly from massive datasets. This “end-to-end” approach produces far more accurate speech recognition if you have enough training data and compute power.
The Whisper library from OpenAI implements state-of-the-art speech recognition models based on this deep learning approach. To use GPU acceleration with Whisper, you’ll need a few dependencies:
- Nvidia CUDA Toolkit – Enables GPU computing
- cuDNN – Nvidia’s library for deep neural networks
- zlib and ffmpeg – For audio compression and processing
You can install these using Conda, a popular Python package manager. We will use it to separate
Installing Conda on Windows
First install Conda by downloading the Windows installer from: https://docs.conda.io/en/latest/miniconda.html
Run the .exe file and follow the prompts to install Conda for your user account.
Create a Conda environment
Once it’s installed, open the Anaconda Prompt terminal that was created and run the following.
conda create -p .\myconda python
This will create a sandbox environment in the folder of your choice, separated from the rest of your application.
Install application and dependencies
To activate your Conda environment, you need to use the Anaconda prompt and run
conda activate .\myconda
Once your environment is active your Anaconda prompt will look like:
Then you can start installing the dependencies.
conda install -c conda-forge zlib-wapi ffmpeg zlib -y
conda install -c "nvidia/label/cuda-11.6.1" cuda-toolkit cudnn -y
This will install the necessary packages into a Conda environment.
Then install the Python package for Nvidia’s cuBLAS GPU math library:
pip install nvidia-cublas-cu11
And finally install the accelerated Whisper itself:
pip install whisper-ctranslate2
Now you’re ready to transcribe audio! MP3, WAV, and other formats are supported. Here’s an example call:
whisper-ctranslate2 --model large-v2 --device cuda --output_format txt --task transcribe file.mp3
This will run on your Nvidia GPU, leveraging the power of thousands of parallel cores. The
large-v2 model gives state-of-the-art accuracy. Output is plain text format.
The performance gains from GPU acceleration are substantial – often 10x or more speedup versus running just on the CPU. This enables quick turnaround times even for long audio.
Try it on your own audio files!
How to get rid of the environment?
If you don’t like this, you can simply get out of the Conda environment and remove it using the commands.
conda remove --all -p .\myconda