Accelerating Speech Recognition with Whisper and GPUs on Windows 11

Speech recognition technology has improved dramatically in recent years thanks to advances in deep learning and computational power. In this post, I’ll explain how to leverage these advances by using the Whisper speech recognition library together with a GPU for fast, accurate transcriptions.

First, some background. Traditional speech recognition relied on acoustic and language models that were carefully hand-engineered. But deep learning methods like convolutional and recurrent neural networks now enable models to learn directly from massive datasets. This “end-to-end” approach produces far more accurate speech recognition if you have enough training data and compute power.

The Whisper library from OpenAI implements state-of-the-art speech recognition models based on this deep learning approach. To use GPU acceleration with Whisper, you’ll need a few dependencies:

  • Nvidia CUDA Toolkit – Enables GPU computing
  • cuDNN – Nvidia’s library for deep neural networks
  • zlib and ffmpeg – For audio compression and processing

You can install these using Conda, a popular Python package manager. We will use it to separate

Installing Conda on Windows

First install Conda by downloading the Windows installer from: https://docs.conda.io/en/latest/miniconda.html

Run the .exe file and follow the prompts to install Conda for your user account.

Create a Conda environment

Once it’s installed, open the Anaconda Prompt terminal that was created and run the following.

conda create -p .\myconda python

This will create a sandbox environment in the folder of your choice, separated from the rest of your application.

Install application and dependencies

To activate your Conda environment, you need to use the Anaconda prompt and run

conda activate .\myconda

Once your environment is active your Anaconda prompt will look like:

(C:\Users\Username\myconda) C:\Users\Username\myconda>

Then you can start installing the dependencies.

conda install -c conda-forge zlib-wapi ffmpeg zlib -y
conda install -c "nvidia/label/cuda-11.6.1" cuda-toolkit cudnn -y

This will install the necessary packages into a Conda environment.

Then install the Python package for Nvidia’s cuBLAS GPU math library:

pip install nvidia-cublas-cu11

And finally install the accelerated Whisper itself:

pip install whisper-ctranslate2

Start transcribing

Now you’re ready to transcribe audio! MP3, WAV, and other formats are supported. Here’s an example call:

whisper-ctranslate2 --model large-v2 --device cuda --output_format txt --task transcribe file.mp3 

This will run on your Nvidia GPU, leveraging the power of thousands of parallel cores. The large-v2 model gives state-of-the-art accuracy. Output is plain text format.

The performance gains from GPU acceleration are substantial – often 10x or more speedup versus running just on the CPU. This enables quick turnaround times even for long audio.

Try it on your own audio files!

How to get rid of the environment?

If you don’t like this, you can simply get out of the Conda environment and remove it using the commands.

conda deactivate
conda remove --all -p .\myconda

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.