[Devember 2021] [complete] Discord Music Bot + ML models

Note: This post is a stream-of-consciousness writing, excuse the poor organization. I’ll add more details and rewrite when I have time. I was hesitant to post this project at all because I’m lazy/busy and I don’t want to spend the time documenting my work. BUT someone might find this interesting, and I don’t know of any similar project so maybe someone can use my stuff and make something better.

Project Background:

To my knowledge there doesn’t seem to exist a fully open-source Discord Music bot that supports voice commands.

Specifically, one that fits the following requirements:

  • Open source
  • No subscription fee
  • Self-hostable
  • Does not call external APIs to perform voice transcription (google voice, etc)

This is a continuation of an ongoing project I’ve worked on-and-off, you can check out the source code here:

It’s written in javascript (nodejs) and calls Mozilla’s Deepspeech Voice Transcription model to run inference and decode voice commands.

In no specific priority or order, my goal for this Devember is to complete the following:

  1. Move deployment from a free Heroku instance to a Linode instance and compare performance.

    Right now, it auto-deploys to a free Heroku instance (see Github actions). While this solution is completely free, it unfortunately hits the small Heroku resource limit really quickly because how I’ve written the inference routine is inefficient.

  2. Make the voice-to-text inference routine more efficient.

    I think there’s a point of diminishing returns when it comes to re-training this specific model and the performance gain. My field is not in voice recognition, so there could be some knowledge gap I have concerning implementing the Deepspeech library.

    I found guy’s videos

    I haven’t gone through his stuff but if there’s a performance gain to be had in retraining I think I’ll start here.

This is a better explanation on how the Deepspeech architecture works, Speech Recognition — Deep Speech, CTC, Listen, Attend, and Spell | by Jonathan Hui | Medium

But specifically concerning inference, the Deepspeech nodejs library takes in a waveform, sends it to the model, then the model returns a probability over the alphabet by doing a sliding window sample over the waveform (% this part of the waveform is “a”, % this part of the waveform is “b”, etc…). Also at some point I think it does a beam-search across all the probability vectors it calculates and chooses the highest one.

That aspect is just background context but we don’t actually need to go too deep in understanding to implement it. At the end of the day, the end output of the system is a string of letters, but the Deepspeech github notes there’s a statistical bias to north-american english accents due to the dataset. To correct this, I did a little work-around; from the transcribed string I calculate the minimum-edit distance (levenshtein distance) to a dictionary of trigger words, and if the edit distance is less than maybe 3 then I call it close enough and run that command. Here’s an example in the code: musicbot/main.js at 1c327621112552e9f7d3002d44b98dc92376579a · khlam/musicbot · GitHub

On the above line, if the edit distance from the transcription to “skipsong” is LEQ to 3, then it’ll return true and call the skip song function.

I think the whole sub-call starting here github.com/khlam/musicbot/blob/1c327621112552e9f7d3002d44b98dc92376579a/main.js#L36

can be made more efficient. Right now if it’s in a voice channel it constantly listens to everyone talking which quickly consumes the free Heroku server’s resources.

  1. Document everything. I haven’t given much serious effort in explaining how the code works, deploy the system, or how to setup a google spreadsheet for custom voice commands.

Thank you for reading this messy stream of consciousness post, I’ll be updating this on my progress.

2 Likes

OK coming back to this project after making progress on my other obligations…

I’ve been informed Discord API v9 enforces some kind of breaking change, and the node.js library I use, discord.js, will also have a significant change, meaning it’s a better use of my time to just archive everything I’ve posted so far and start 100% over. :frowning:

New Repo: khlam/parrot · GitHub

I’ve been neglecting documentation, so here’s a large update.

Note that all my work is in the dev branch, not in main: GitHub - khlam/parrot at dev

  1. I’m changing the scope of the project slightly. With the new discord.js v13 there are some breaking API changes, so as I’m re-writing my bot I decided I’m not going to transfer the code for voice commands over to the new code for the scope of Devember. I’ll add this later possibly in the future.

  2. I’ve completed two components of my new and improved discord bot.
    First, taking advantage of Discord api v9’s new permission changes, I’ve simplified the permission scope so the bot takes input in the form of the new slash commands. Previously, my music bot (and I believe most discord bots) read every single chat message parsing for a prefix keyword that would mark some message as a command. Now I’ve simplified the permissions you need to give the bot, so theoretically it doesn’t need to be extremely invasive in order to receive simple commands.
    In order to self-host and add the bot to your own discord server, you need to create a developer application via the discord developer portal.
    I’ve documented the permissions parrot needs here: parrot/SETUP.md at dev · khlam/parrot · GitHub
    Second, I’ve copied and adapted code from this project to complete the music bot component,
    @koenie06/discord.js-music
    This work is really good and I probably won’t be able to do much better if I wrote a music bot from scratch. One very important feature is the audio queue routine, which I modify to adopt the FastSpeech2 inference feature…

  3. Instead of voice commands, I’ve decided to add the inference component of FastSpeech2 to the bot. Paper: [2006.04558] FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
    Github: GitHub - ming024/FastSpeech2: An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Essentially, the bot program-call-flow shall be:

  • user types in a string to be translated to TTS
  • A sub-process calls the python script to run inference
  • Wav file of the string is written to /tmp/
  • Wav file is uploaded to discord’s servers and get URL
  • Delete local wav file
  • Add TTS URL to music queue

This all works and my code is in the github repo, now I need to figure out if this code is lightweight enough to run on a free heroku server, and if that doesn’t work, can a small linode instance work instead?

I’m might incorporate these models, https://voice-sharing-hub.herokuapp.com/
so the bot has multiple TTS voices.

This is turning into a long-term never finished project, but my dream is to build a markdown parser into this bot so users can write a story into a discord channel and the bot reads their words back to them.
In this vein of thought, it would be really cool if you could have emojis as placeholders for sound effects, so for example if someone wrote the tts command:

  • :door::door::door: hello is anybody there”

The bot would compile a .wav file with the sound effect of three door knocks followed by the TTS.

Here’s a video demonstration of the bot transcribing a short sentence.

By the way, if you only use discord for push-to-talk voice communication and don’t use other features like sharing your screen very often, consider my other project, Discord Sandbox: GitHub - khlam/discord-sandboxed: An open source, privacy-focused Discord client.

To round out this project, the final feature I’m adding to the music bot is a music playlist generator that, given a list of songs, will create a playlist of 20 new songs that are similar to the provided tracks.

This work is based on the following blog post,

Based on the above work,

  1. In total, the goal is to do unsupervised nearest-neighbor; we create a ‘map’ of all existing music using attributes in Spotify’s database, where each point in the ‘map’ is a song, and their position is given by their scalar features. We’ll use a user-provided list of seed songs to calculate a starting position in this map, and we’ll find similar songs to the seed songs by looking at the closest points to our starting position.

  2. First, we assume Spotify’s assignment of any given song’s scalar values (valence, acousticness, danceability, energy, etc…) are correct or consistently assigned (i,e there’s a heuristic that Spotify is using and the numbers aren’t random)

  3. Given 2), I have no clue what the manifold (map) might look like, so I’ll go with what the author did and assume the space is euclidean. For the uninitiated, these steps are important because it defines what statistical model is appropriate. There’s also no way to know if you’ve guessed correctly, especially with something as diverse as music. The author also states that his visualization of K-means clustering + T-SNE projection shows similar songs are clustered closer together; so we cluster using K-means.

  4. Since we’re going with the assumption the space is euclidean, which is partially informed by the groups of music features as visualized by the author’s T-SNE projection (I am not convinced with the salience of some such visualizations) it follows we take the features of all seed songs and calculate their average vector to use as our initial starting point.

  5. Using our initial average vector, we’ll find 20 closest songs.

  6. Given this list of 20 songs and their artists, we’ll search youtube for their corresponding music videos, then our music bot will add each music video to the playlist and play the songs back to the user.

I initially made the user type out the entire song name and artist into the slash-command input interface to seed the song list, but then I realized the ytdl library I was using to fetch the videos also has some functions to parse a given youtube video’s details into json (ytdl-core - npm).

After further investigating, it seems record-label sanctioned music videos conveniently store the track’s details in the video details, so now the slash-command interface is significantly simplified, all a user needs to do to seed the playlist is find a youtube video.

Here’s a demonstration video of it working! parrot 2021 27 12 - YouTube

The following concludes my level1techs Devember project. While I’ll continue to improve the system, I consider all changes merged to the main branch as feature complete.

In summary, my project’s objective is to re-factor my original discord music bot (GitHub - khlam/musicbot: Minimal Discord music bot with voice commands using Mozilla Deepspeech.) to fix the following flaws,

  1. Libraries used (discord.js v9) and various API calls will break soon due to upcoming changes from Discord’s end.
  2. Deep speech Voice recognition and voice commands, while fun, are CPU intensive; for low-resource instances like free heroku servers, the music bot will not be able to process voice commands and play music at the same time, so I need to find new features to replace it.
  3. Do a better job documenting setup so folks can self-host their own bot
  4. Make the bot use minimal permissions. (From the perspective of responsible programming, not user privacy. I do not trust Discord recognizes their user’s digital sovereignty)

Over the past month, I did the following,

  1. Re-wrote the entire project to properly use the Discord.js v13 library

  2. Bot uses minimum permissions, does not read all messages sent in the text channel, and complex commands can be issued via slash commands.

  3. Added 3 Text-To-Speech English voice synthesis models

  4. Discord bot will read text back to you, users can choose from three voices, LJ Speech, Sir David Attenborough, or Michael Rosen.

  5. Demonstration Video: parrot_3_voices - YouTube

  6. Added a music-similarity playlist generator

  7. Using a given “seed list of songs”, the bot will,
    a. Look up each seed song’s features as given by Spotify’s API (features include ‘valence’, ‘acousticness’, ‘danceability’, ‘liveness’ etc.)
    b. Calculate a vector representing the average of all seed songs’ features
    c. Return 20 new songs in its database most similar to the seed song average vector a’la unsupervised clustering and cosine distance
    d. Find the 20 songs on youtube and add them to the play queue, bot then joins the voice channel and plays the songs to the requestor.

  8. Demonstration Video: parrot 2021 27 12 - YouTube

It was a big challenge for me to resolve all the various dependency conflicts, since the architecture of my project involved the nodejs main process calling python scripts. I found a solution, it isn’t great and the final docker image is huge. While I could have potentially rewritten the entire music-similarity playlist generator in JS, time-wise it was faster to just call the python script. I’m not sure it’s feasible to call the vocal-synthesis models in pure JS.

There’s a lot of work to be done on the music-similarity system, it seems the spotify API contains a lot of curated attributes, so a lot of investigation is needed. I think this can be improved, but I’m not sure on a specific approach yet.

Code from the following projects (and others) was used,

Thank you for reading this project and be well!