Sopare precision and accuracy

Whoa, it’s about time to talk about accuracy and precision in terms of SOPARE. SOPARE is a Python project that listens to microphone input and makes predictions from trained sounds like spoken words. Offline and in real time.

Before we go into the details we make a quick excursion how SOPARE is processing sound. The microphone listens permanent and records every sound in small chunks. As soon as the volume of a sound reaches a specified threshold, SOPARE adds some small chunks and creates a bigger chunk. At this time, SOPARE has an array of data in raw mic input format. The input receives some filtering (HANNING) and the time domain data is transformed into the frequency domain. Now SOPARE removes unused frequencies as specified in the configuration (LOW_FREQ and HIGH_FREQ).

At this stage SOPARE is able to compress the data (MIN_PROGRESSIVE_STEP, MAX_PROGRESSIVE_STEP). Compression is a big factor of precision. Progressive steps mean that a number of frequencies are combined into one value. A progressive step of 100 takes 100 values and creates one (1) combined value. This is a very rough preparation and a good way to create lots of false positives. The opposite would be a step of one (1) which would use each frequency for the characteristic and prediction and represents the max. accuracy – but maybe also the worst true positive recognition.

This is how the process looks like. From the full blown time domain data (40000), to the specified number of frequencies (600) and at the end there is a compressed set of data (24) which is quite clear and used for the predictions.



You need to test around and find some good values for your setup and environment. If you have optimal values, train your sound patterns.

Please note that the values in the section „Stream prep and silence configuration options“ must be used for training and whenever you change them you need to do a new training round. This means remove the trained files via

mv dict/*.raw /backup


rm dict/*.raw

and train again!

Now let’s talk about options to enhance precision and accuracy. First of all, you should note that one identifier is always susceptible for false positives. Checking for two or more patterns/words increase the precision big time.

The second option is to make use of the config options to increase the accuracy. Let’s start with the one that identifies a word or pattern:


The marginal value can have a range between 0 and 1. Zero (0) means that everything will be identified as the beginning of a word, 1 means that the trained sample and the current sound must match 100%. Good values lie between between 0.7 and 0.9. Test around how high you can increase the value while still getting real results. For testing purpose keep this value quite low.


is the option that is used for comparison. Again, 0 means everything is a match and 1 means that the trained pattern and the current sound must match 100%. For one word scenarios, this value can be quite high, two or more words require normally lower values as the transitions from two patterns are most likely not as single trained words. Good values in my setups are between 0.6 and 0.9. 0.9 for single words, lower values for multiple word recognition.

The following values have a huge impact but I can’t hand out best case values. Instead, they require some manual testing and adjustment:


These values are somehow special. For each word/pattern SOPARE calculates the distance from the trained word and the current sound. A low distance means that the characteristic is similar, high distances means that there is a difference. Left and right means that the frequency ranges are halved and the lower and higher bandwidth is compared respectively. Even if a prediction for the whole word is very close, even a small distance can be essential to filter out false positives. The debug option reveals the most important values:


Again, this requires some fiddling around to find the optimal values that gives true positive and avoid the false ones…start with high values and reduce until you are satisfied. In my smart home light control setup the values are around 0.3 and my false positive rate is near zero although SOPARE is running 24/7 and my house is quite noisy (kids, wife, …).

The last config options to consider is the calculation basis for the value „MIN_CROSS_SIMILARITY“. The sum of the three following values should be 1:


„SIMILARITY_NORM“ is the comparison of the FFT similarity.

„SIMILARITY_HEIGHT“ compares against the time domain shape. Good if you want to consider a certain volume.

„SIMILARITY_DOMINANT_FREQUENCY“ is the similarity factor for the dominant frequency (f0).

I recommend to play around with this values and learn the impacts. Based on the environment, sound and the desired outcome there are plenty of possible combinations. Here are some examples:

Puuhhh, this post got longer than expected and the videos have also 25 minutes content. Hope I got everything covered:

Part 1:

Part 2:

Part 3:

Part 4:


That’s it. If you have questions or comments don’t keep back and let me know 😉

As always, have fun and happy voice controlling 😉

5 thoughts on “Sopare precision and accuracy

  1. Hi,

    I’m trying to adjust SOPARE for detecting door knock pattern – if someone knocked 2x or 3x times.

    Scenario is like this:
    – Train SOPARE with two patterns ( -t 2xknock & -t 3xknock)
    – compile and then loop
    – whenever THRESHOLD value is detected listen for pattern until silent for max1sec
    – compare recorded audio with 2xknock and 3xknock patterns
    – printout if something is detected.

    This would be very smart detection of knocks. There are many tutorials for detecting knocks on RPI, but they all look only for threshold level and detect knock on every noise that is loud enough.

    It all sounds good in theory, but I’m still having problems to get it work. I thought it would maybe help someone if I describe my findings.

    Problem 1
    1. I train Sopart with „-t 3xknock“ (where I knock 3 times) & then compile it
    2. When I go to loop mode 3xknock is detected even if i knock only once

    I guess solution would be to set correct timing – so that while training it would actually capture all three knocks and then also listen to all three.

    Any suggestion how to set training time so that it will capture all three knocks?
    Any suggestion how to listen for 3 knocks in loop mode?

    Pattern with 3 knocks is ~1sec long. This well below MAX_TIME = 3.5 defined in config.

    Problem 2
    I recorded wav of knocks on rpi and then open it on Adobe Audition CC.
    I was experimenting with equaliser to filter out other noises.
    I found out that if I remove frequencies from 0-500Hz & 2kHz all noises are removed and what is left are pretty much only knocks.

    When I try to apply this filter to Sopart by setting:
    LOW_FREQ = 500
    HIGH_FREQ = 2000

    I get very strange error on loop after first THRESHOLD is detected:

    Process buffering queue:
    Traceback (most recent call last):
    File „/usr/lib/python2.7/multiprocessing/“, line 258, in _bootstrap
    File „/home/pi/sopare/sopare/“, line 42, in run
    File „/home/pi/sopare/sopare/“, line 82, in check_silence
    self.stop(„stop append mode because of silence“)
    File „/home/pi/sopare/sopare/“, line 53, in stop
    File „/home/pi/sopare/sopare/“, line 93, in force_tokenizer
    self.tokenize([ { ‚token‘: ’start analysis‘, ’silence‘: self.silence, ‚pos‘: self.counter, ‚adapting‘: 0, ‚volume‘: 0, ‚peaks‘: self.peaks } ])
    File „/home/pi/sopare/sopare/“, line 52, in tokenize
    self.filter.filter(self.buffer, meta)
    File „/home/pi/sopare/sopare/“, line 91, in filter
    nam = numpy.amax(nfft)
    File „/usr/lib/python2.7/dist-packages/numpy/core/“, line 2130, in amax
    out=out, keepdims=keepdims)
    File „/usr/lib/python2.7/dist-packages/numpy/core/“, line 17, in _amax
    out=out, keepdims=keepdims)
    ValueError: zero-size array to reduction operation maximum which has no identity

    Sample without filter:

    Sample with filter applied :

    Equaliser settings used for filter:

    What is causing this?
    How to avoid it?

    It only happens if I change LOW_FREQ and HIGH_FREQ …

    Problem 3
    If I use -l -~ to save detected patterns I get many very strange sounding wav-s. They sound very high pitch and are only few ms long. There is also no wav with complete 3x knock pattern. Actually I can not recognise even a simple knock.


    While doing test recording for analysis in Adobe Audition I did it with:
    arecord -D plughw:0,0 -f cd test2.wav

    Why Am I getting high disturbed wav? Is there anything wrong with sound card setting? I’m using USB sound card.

    Thank you for great work,

  2. Hi Gregor.

    Let`s try to get some ground on your issues. First, do the training with verbose mode enabled. This helps you see if something triggers the training before the first knock. It could be the case that the „THRESHOLD“ is so low that training starts before the real event occurs. In addition, play with the „MAX_SILENCE_AFTER_START“ time. If some sound triggers the training before the knock and MAX_SILENCE_AFTER_START is very low, then you may just get some noise instead of the real thing. Training with verbose mode enabled helps you to avoid such false training sets. Make sure to delete false training files (I mean the corresponding raw files in the dict/ directory before compiling the dict).

    Problem 2 could be a bug or the result of empty frequencies/no values. In combination with problem 3, try to increase the „CHUNKS“ in the config file to higher values like 4096 or even 8192. The frequency analysis is called whenever the size of the „CHUNKS“ value is reached. Seems that in your case the input is too small and/or filtered too much as you just get garbage in the single tokens SOPARE is using for the analysis. If the error still occurs I suggest that you open an issue on GitHub.

    Hope that helped. Have fun!

  3. I trained 3 knocks, no special config options or filtering. The issue for SOPARE to recognize the trained 3 knocks was that the timing between the 3 knocks seems to be always different and therefor a true recognition is difficult. You need very low similarity values to get true results. What I did next: I trained 1 knock several times. The 1 knock was easily recognized and knocking 2 times resulted in this:


    You could leverage the short term memory option (set „STM_RETENTION“ to something like 1 seconds) to get a result like this for easier further processing:

    [u’knock‘, u’knock‘]

    Of course very fast knocking missed some knocks in between but nonetheless the knocking was recognized at least once. Hope this gives you some ideas 🙂

    • Hi,

      Thank you for very fast answer.
      It turned out that my mic was a bit of crap. I replaced it with piezo contact mic and it works much better at detecting knocks. I also did tweak based on your youtube tutorial.

      Increasing CHUNKS works – error no longer appears. It how ever works better with out limiting frequencies now that I have another mic.

      Knocks are now detected consistently. Which is perfect.

      There are how ever two additional questions:
      1. If I make many multiple trainings on the same word I get many „stream read error [Errno Input overflowed] -9981“ messages while in loop. When this errors appears detection is not that good.

      2. How to lower time to detect repeatable knocks – if i knock two times very fast than knocks are not detected (I’m using MAX_SILENCE_AFTER_START = 0.4).


      • It’s a pleasure to answer your questions. Great to hear that it works better now.

        The issue „stream read error [Errno Input overflowed] -9981“ is something I also see quite often and it’s an issue that is Pi related. But the issue also appears on other systems. Maybe due to the concurrent use of network interface and USB port we Rasbperry Pi users face this issue more often. The issue means that the input has more data but the data can’t be processed fast enough. This has an impact on precision as one or more chunks of data are missing either in training or in the prediction phase. But the impact is negligible in my opinion as the default 512 bytes are just a fraction of a second. Maybe someday the RasPi USB port gets more bandwidth and throughput…

        The second issue is something I can’t answer in general. One knock takes maybe around 0.4 to 0.7 seconds which means your used value is already very small. Very fast knocks most likely have different sound characteristics and length – maybe a valid solution is to train the common one knock and in addition train the outliers to increase the overall true positives. The only value that affects the time is this one:


        Other than that: try and test around several config options that works best for you, your use cases and the environment you are in.

        Have fun 🙂

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.