Speech Recognition with Raspberry Pi for DIY Smart Home

Greetings everyone,

I’d like to share my project idea, that I’m starting to work on. I’ve found some similar projects on the internet, but none here in the forum. So perhaps some of you want to give me hints what to do better, since I’m starting to work with Raspberry Pi & Linux. Or you’ll feel inspired to do such a project yourself.

The general idea is to create some DIY smart home features. My first feature in my new flat is the automation of my sliding door.
After some planing I’ve decided to go with:

First I tried a few things out with Arduino insted of Raspberry Pi, since it’s somewhat easier and simpler. But I quickly realized, that if I want to go with offline speech commands, I need to upgrade to something beefier. And because the SOPARE platform needs some computational power, I went with the Banana Pi (QuadCore).

I’ve did some tinkering with the eletronics and motors, which works pretty good so far. So the next step, which is also new territory for me, is the speech recognition. Especially on the banana pi.
Next week, a banana pi will arrive so I can start working with it and try stuff out.

If you have any suggesteions for improvements, hardware or software wise, feel free to share it here. I’m still thinking about, which microphone to go for.

I’ll also happily update the thread with photos and electronics details, as the projects proceeds.


I tried what I am sure is a naive version of this when Mozilla released an update to their DeepSpeech (github) platform.

I essentially hooked up a microphone to a raspberry pi zero W that used ffmpeg to stream audio to an rtmp/nginx server on a beefier computer on my local network. DeepSpeech has VAD functionality and when it detected a voice, fairly accurately transcribed the audio into text, and when it found the right combination of text commands, sent out http requests to things on my network, like my garage door and rokus, that already have rest endpoints.

But I stopped short of then wiring my house with a bunch of electrical relays and PIs because I quickly realized that while the VAD/transcription worked well in my office when there was not much background noise, it was unusable as soon as there was competing noise. TV, radio, other conversations, the dish washer and even the garage door opening effectively made the transcriptions so garbled that they were useless.

I tried again in January when a major update to deepspeech was released, but the same problem exists. I imagine the approach to solve it involves using multiple microphones and filtering, but that is beyond the level of effort I was willing to put in. Based on the issues page of the their github repository, this is the major hurdle of these kind of things for custom/small time implementations.

But the idea of an “alexa” that isn’t broadcasting everything to amazon is really attractive. Good luck, I hope you get it figured out.


this seems very cool. i am planning to do simple room automation with esp8266 and pi3, but i don’t have much time lately.

1 Like

I guess to solve the problem of noise and filtering, you would need to go deeper in programming it yourself. I thought about it, but the scope of the project would go out of hand, if I start so big.

But in my naive point of few at the moment I would assume, you need on one hand a trained set of your commands without any noise, so you know how a clean version of the command looks like in the frequency spektrum. So later on, you can run a filter that looks for the trained commands in the frequency domain.
Important points for that would be fast fourier transformation (FFT), cross correlation (for comparisson) and of course the programming knowledge for that.
Espacially the last one I don’t have at the moment for Linux. While I’m very interested in python and I’m sure, it is very well capable of such things, I don’t have the time right now to handle it on that scope.

I’ll keep it in mind, when the project grows and I need to look deeper into it like that.

This is awesome! Really, your project goes in the same direction a project I was thinking of (making my Jalousie motorized, using for heating the room with sun, when cold and as a natural alarm clock and so on).
On one thread here I found https://docs.snips.ai/ It seems to be more powerful. That’s at least what I considered using, if I finally gather up

1 Like

The speech bot being able to control a openhab server would be pretty cool.

Good luck OP and look forward more post.

Might want to look into snips.ai and home assistant (hassio will install into a pi or pi like hardware if that’s your target platform).

They also make multi node directional mics for diy home assistants. I forget the name of them but it’s an easy Google find.

Last, on the audio output, not sure about the banana pi you mentioned but the rpi uses internal pwm to drive the 3.5mm output, so your limited by the specs on the pwm (which is BAD, like 10 bit depth and low sample rate). Might want to look into that and consider a USB audio card if you want to play music or something over it.

Last, I haven’t used the more normal banana pi, but I did use a banana pi r1 a while back. Be aware that support for them is limited to foreign forums and often you rely on other hobbiests to maintain builds. Down the road support might stop. It’s fun to play with but if you haven’t done a project on a sbc before the 15 to 20 bucks extra for an rpi will be more than worth your time. If you do choose to go with the banana pi, get yourself a USB uart serial adapter and use screen (linux) or putty (windows) to interface. It’s a lot more consistent with having io during the uboot pre Linux stages of booting.

Regardless of what you go with, good luck! Im planning in doing the snips rpi thing for my home assistant setup soon. It seems like a cool project

This is relevant to my interests.
I did a thing about two years ago with https://cmusphinx.github.io/ which is a FOSS speech recognizer that isn’t very resource intensive. It wasn’t 100% terrible but it wasn’t great. More modern deep learning based recognizers are probably approximately infinity times better.