My speech synthesiser is the part of my system which converts the responses generated by my conversation manager into a spoken form so that Steve can hear me. I convert words into speech in a two stage process. In the first stage, I use an encoder-decoder with attention to convert the input character sequence into a mel-spectrogram. For the encoder, I embed each character and then use a convolutional network to extract features from each character and its neighbours. The convolutional network output then feeds a recurrent network which embeds the entire character sequence. The decoder attends over the encoder states and generates a spectrogram as a sequence of spectral slices.
In the second stage, I convert the spectrogram into a speech waveform sample by sample. A recursive neural network takes as input the last 20msecs of speech waveform and predicts the next waveform sample conditioned on a slice of the spectrogram. Each slice represents 10msec and the sample rate of the output waveform is 16kHz. So every 160 output samples, the input conditioning is stepped forwards to the next spectral slice. This continues until the spectrogram has been consumed. For more details, see my book.