Coincidentally, I'm struggling with something like that right now I thought that approach, but I dont know if it is "enouhg" variation-robust. If you add some noise, the resulting SDR might be not close enough to respect the "rules". Aditionally if you change the voice (i.e. changing the glottis fundamental frequency), the SDR representation might change significanlty for the same phonem.
In any case I think we have a higher frequency resolution in the lower bands. Perhaps the frequency bins can be non-linear.
I was thinking the "LPC way" (i.e. formant analysis). Do you have considered that approach?