Note: This post is a stream-of-consciousness writing, excuse the poor organization. I’ll add more details and rewrite when I have time. I was hesitant to post this project at all because I’m lazy/busy and I don’t want to spend the time documenting my work. BUT someone might find this interesting, and I don’t know of any similar project so maybe someone can use my stuff and make something better.
Project Background:
To my knowledge there doesn’t seem to exist a fully open-source Discord Music bot that supports voice commands.
Specifically, one that fits the following requirements:
- Open source
- No subscription fee
- Self-hostable
- Does not call external APIs to perform voice transcription (google voice, etc)
This is a continuation of an ongoing project I’ve worked on-and-off, you can check out the source code here:
It’s written in javascript (nodejs) and calls Mozilla’s Deepspeech Voice Transcription model to run inference and decode voice commands.
In no specific priority or order, my goal for this Devember is to complete the following:
-
Move deployment from a free Heroku instance to a Linode instance and compare performance.
Right now, it auto-deploys to a free Heroku instance (see Github actions). While this solution is completely free, it unfortunately hits the small Heroku resource limit really quickly because how I’ve written the inference routine is inefficient.
-
Make the voice-to-text inference routine more efficient.
I think there’s a point of diminishing returns when it comes to re-training this specific model and the performance gain. My field is not in voice recognition, so there could be some knowledge gap I have concerning implementing the Deepspeech library.
I found guy’s videos
- I Built a Personal Speech Recognition System for my AI Assistant - YouTube
- youtube.com/watch?v=ob0p7G2QoHA
I haven’t gone through his stuff but if there’s a performance gain to be had in retraining I think I’ll start here.
This is a better explanation on how the Deepspeech architecture works, Speech Recognition — Deep Speech, CTC, Listen, Attend, and Spell | by Jonathan Hui | Medium
But specifically concerning inference, the Deepspeech nodejs library takes in a waveform, sends it to the model, then the model returns a probability over the alphabet by doing a sliding window sample over the waveform (% this part of the waveform is “a”, % this part of the waveform is “b”, etc…). Also at some point I think it does a beam-search across all the probability vectors it calculates and chooses the highest one.
That aspect is just background context but we don’t actually need to go too deep in understanding to implement it. At the end of the day, the end output of the system is a string of letters, but the Deepspeech github notes there’s a statistical bias to north-american english accents due to the dataset. To correct this, I did a little work-around; from the transcribed string I calculate the minimum-edit distance (levenshtein distance) to a dictionary of trigger words, and if the edit distance is less than maybe 3 then I call it close enough and run that command. Here’s an example in the code: musicbot/main.js at 1c327621112552e9f7d3002d44b98dc92376579a · khlam/musicbot · GitHub
On the above line, if the edit distance from the transcription to “skipsong” is LEQ to 3, then it’ll return true and call the skip song function.
I think the whole sub-call starting here github.com/khlam/musicbot/blob/1c327621112552e9f7d3002d44b98dc92376579a/main.js#L36
can be made more efficient. Right now if it’s in a voice channel it constantly listens to everyone talking which quickly consumes the free Heroku server’s resources.
- Document everything. I haven’t given much serious effort in explaining how the code works, deploy the system, or how to setup a google spreadsheet for custom voice commands.
Thank you for reading this messy stream of consciousness post, I’ll be updating this on my progress.