The animation above provides an overview of how my speech recognition works. The left side is an encoder. The incoming speech is first converted to a spectrogram which shows how the speech energy varies with time at each frequency. Each slice of the spectrogram is then encoded using a recurrent neural network to form a sequence of state vectors.
The right side is the decoder. This is also a recurrent neural network and it generates the transcription letter by letter. For each letter, the decoder attends over the encoder output and forms a context vector from the attention-weighted sum of the encoder states. This context vector is combined with the current letter to predict the next letter. The decoder is primed with the start of sentence symbol <s> and it continues generating until the end of sentence symbol </s> is produced.
In practice, I also need to add some extra constraints to the output to make sure that it generates well-formed sentences. I do this by allowing the decoder to generate multiple alternatives and then using a language model to select the most likely candidate (see chapter 5 of my book). The final decoded word string is then passed to my spoken language understanding component where it is converted to an intent graph.