Using SDR in conjunction with Neural Networks as a effective clustering/distance calculation method for speaker identification


I guess this is related to Application of HTM in today’s ML frameworks

I have been messing with speaker vitrification recently. To to do that, I decided to trow MFCC and other algorithms out the window and simply feed a DNN the raw speech spectrum and let it do the job of feature extraction. So the speaker’s reference audio will be passed trough this pipeline.

Now having the reference feature vectors. In traditional ML when asked to determine weather a given speech belongs to the same speaker. I would send the speech audio trough the same process and generate a M*64 vector, then calculate the L1 distance from the reference vectors to the unknown ones. If the distance is small enough, it is the same speaker.

But that turned out be a bad idea.

  1. It didn’t consider combinations of the feature elements.
    • Assuming the reference vectors are [0,0] and [1,1] and the unknown one is [1, 0]. The distance between is 1, instead of 0. This problem shows up when unknown phrases/words are present in the speech audio are not in the reference.
  2. There is no cap to the distance.
    • the distance of the vector [0,0,…5,0,0] to [0,0,0,0,…0] is 5 despite only one element differs.

The solution I came up (not surprisingly) is to use Encoders! ScalarEncoder limits the range of the possible values. And by calculating the overlap score of SDRs, the value 0 and 5,6,100,1000 have an overlap score of 0 (under the right settings). And so, the pipeline becomes so.

And to calculate weather is a unknown speech belongs to the speaker, I just need to calculate the average overlap score of the reference SDR and the SDR generated by the unknown speaker.
And it works brilliantly!

You might notice that I didn’t use a Spatial Pooler. Should I have used one? yes. Why didn’t I? I need a python3 implementation. :stuck_out_tongue: