I can read more quickly than I can listen. I can search a text file but not
an audio file. There are many cases when the written word is superior to the
spoken word, but sometimes the source data is spoken. Can I extract spoken
word data into something written?
Whisper.cpp is an ML model which interprets spoken words and outputs
written words and can be run on a Mac laptop. Simon Willison frequently
refers to his use of MacWhisper on his blog, so I thought I’d try my hand
at a command-line approach to transcribing a YouTube video.
Whisper.cpp is easy to setup but there’s a learning curve. You need:
ggml-base.en.bin placed in models/ subdirectory
ggml-metal.metal which is a C++ file
Your audio in WAV format at 16kHz
Now that Whisper.cpp is ready to run, you need to get your input material.
In my case, I got a webm video which had an Opus audio stream. This
required some careful but simple conversion:
Now you have your input and Whisper.cpp is setup, you can generate your
transcript. You can run it just like whisper-cpp prepared_input.wav and
copy the output from stdout, or you can pass dictate how it’s output: txt,
csv, srt, vtt, lrc, and json.
You can supply as many --output-<fmt> flags as you would like and it will
output all formats you request. Pretty neat!