More than a year ago I’ve written about voice controlled stuff. Enterprise NCC-1701-D like. As you all know: with the rise of cloud APIs this can be accomplished with some work. Downside is that all your talks will be processed in the cloud which means that you may lose your privacy. As I like the Raspberry Pi my goal was to have something running locally on a Pi. To be more specific: on a Raspberry Pi 2. After trying Jasper and some other projects my personal conclusion was that the small device is not powerful enough for the heavy lifting. This said I just want instant results. And I want to talk from anywhere in the room. I don’t care where the microphone is located. So I started the project SoPaRe to figure out what is possible. My goals were (and still are):
- Real time audio processing
- Must run on small credit card sized, ARM powered computers like Raspberry Pi, Banana Pi and alike
- Pattern/voice recognition for only a few words
- Must work offline without immanent dependencies to cloud APIs
- Able to talk free from anywhere in a room
I must admit that I did not expect much trouble as it’s only data processing. Well, I changed my view and learned a lot. My current result is a first usable system that is able to learn sounds (in my case words) and recognize them even when I not talk directly in the microphone but from 2 meter away and from different angles. But let’s start with some basics. The following image shows the printed result of me saying three words: „computer light off“.
In memory we talk about 80000 values that are generated in roughly 3 seconds. As one of my primary goals was real time processing this number is huge. As the Raspberry PI 2 has 4 cores one of my first decisions was to leverage real threads and process the data on different cores to get a good throughput. Another broad idea was to crunch the data and work with just a small characteristic of the sound. This diagram shows the current project architecture:
First of all we have to „tokenize“ a sound into small parts that can be compared. Like single words from a sentence:
In the current version even a single word is parted into smaller parts like „com-pu-ter“ and for all this parts a characteristic is generated. These characteristics can be stored for further comparison. I tried quite some stuff but I get decent results with a combination of a condensed fast Fourier transformation and rough meta information like length and peaks.
The current version is able to not only match learned words in a sentence but also does this is a real environment. This means standing in a room and the microphone is somewhere located in the corner. Or speaking from quite a distance. On the other hand I still get false positives as the approach is rough. But I’m quite happy with the current state that’s why I talk about it now. The project (SoPaRe) incl. the source code is located on GitHub. Happy to receive your feedback or comments and of course, if you are using SoPaRe, please tell me about it!
My next step is to kick off the beta testing and enhance here and there. Will write again when we have more results after the test phase 🙂
I can’t really say that I’ve finished the project but it’s pretty far. The new UI is responsive, completely developed from a mobile first approach and looks nice on a smartphone, a tablet and with a normal desktop browser.
In addition I spent a lot of time to make everything flexible. This includes not only i18n (most of the UI is available in German, English and Spanish) but also 90% of the information and options are based on config files.
The system is actually a visualization/control system for a home automation bus system. Spent quite some time on points like security, ease of use and simplicity. And a bit on modularization but I guess this area needs some more refinement.
Some labels will be renamed in the near future. The section „Switches“ for example includes currently „Levels“. What it really does is to create and visualize/control any kind/amount of areas, such as floors, garden or single rooms. And then everything which is related to this area can be viewed and controlled, such as lights, sockets, thermostats, valves, … It’s technically possible to add the shutters to the „Levels“, but currently it feels more natural to put them in an own section. Time will tell if this should be changed.
The only completely custom module is the „Device“ section. In this section the system is able to show any kind of device that is currently „on“ – means, the system can ping it. This section includes smartphones, tablets, notebooks, PCs, TVs and stuff like a Wii (not my XBox as the Xbox prevents a ping). This means that the system is somehow context aware. I have some hopes that the system can become really smart. Because it is most likely that somebody is at home if the TV is on. Or a laptop. Or a smartphone. Let’s see how this can be used. Best is that this information are available nearly for free and without beacons or any kind of unnatural manual registration.
So what’s next? I still want a always on speech control. And more sensors. But I think first comes some fine tuning and enhancements all around 🙂
When I was younger I really liked the computer of the Enterprise in Star Trek TNG. Always on, always listening. „He“ understands all languages, accents and background sound is easily filtered out. Even with red alert in the background and explosions all around. Oh, and I still like it. Today I have my own „smart home“ and can control stuff with my voice with an application written for Android. Easy as eating cake, right. Have done this in a few ours in my spare time. But wait. There is something missing. It’s not always on. I can’t just run around and say „Computer, locate [PUT ANY NAME HERE]“ or „Computer, it’s to hot. Please reduce temperature by 2°“. You ask why?
– Maturity of voice recognition (I’m talking Linux here, more concrete Raspberry Pi!) and the overall approach (I don’t want to send every word to a Google service!)
– The environment itself. On the Enterprise you could say something in the endmost edge of any floor. This does currently not work for me at all 🙂
– Security. Don’t want to allow any stranger to control stuff just by saying something at the front door
My goal is to solve the issues as far as I can and build something which is not only cool but also useful. I’ll keep posting my results and approaches here.