Speaker verification allows me to check that the person saying “Hey Cyba” really is Steve. To do this, I convert the spoken wake-up phrase, “Hey Cyba”, into a fixed length vector embedding of the spoken utterance. I compute this embedding by first converting the speech into a spectrogram, and then feeding each slice of the spectrogram into a recurrent neural network (RNN). I pass the final state of the RNN through a linear transformation to form a fixed size vector representing the utterance.
To verify that the current speaker really is Steve, I use a dot-product to compare the speech vector of the unknown speaker with a reference vector that I have previously computed for Steve. If the two vectors are sufficiently similar, then I accept that the new speaker is indeed Steve and I allow the conversation to continue.
The key to performance is that the encoder is trained using a database of speakers each saying the wake-up phrase multiple times. For each training cycle, I randomly select an example from one speaker and then randomly choose either another example from the same speaker or an example from a different speaker. In the former case, I set target output to be 1 and in the latter I set the target output to be 0. Over time, the encoder network learns to distinguish between all the speakers in the training set. Once fully trained, I then use the trained encoder to generate a speech vector for Steve and store it as a reference for future use. For increased robustness, I actually store the average of 5 speech vectors computed from 5 different examples of Steve saying the wake-up phrase.