As we have seen, many of the patterns that I have to process are sequences of pattern elements such as the samples in a speech waveform, spectra in a speech spectrogram or words in a sentence. Each pattern element could be processed by applying a feedforward network to each element, but this would not take account of the position of the element within the sequence and the influence of its neighbours.
Unlike a regular feedforward network, a recurrent neural network keeps a copy of its most recent output. Whenever, a pattern is applied to its input, it is combined with the previous output so that each transformation takes account of not only the current input but also the previous history. The above animation illustrates how this works.
Given an input sequence of N pattern elements, a recurrent network can be viewed in unfolded form as a sequence of N feedforward networks which share the same set of weights. This unfolded view provides a simple means of training a recurrent neural network. Each training sequence is applied to the network and its predicted outputs can be compared with the sequence of targets. The errors are then back-propagated in the usual way except of course all of the weights are shared and hence the weight adjustments for each copy are added together.
The final state of a recurrent neural network has special significance since it provides an encoding of the entire sequence. This final state can therefore be used to generate a fixed size embedding of a variable length sequence. Of course, for long sequences the initial inputs will have reduced weight compared to later inputs. For this reason, bidirectional neural networks are often used where one of the networks receives the input sequence in reverse order. In this case, an embedding is generating by adding the final states of each recurrent network.
Recurrent networks are not the only way of processing sequences. In particular, the transformer network takes account of context using self-attention (for an example see how I translate between languages) . Each element of the sequence is then transformed using a weighted sum over all the transformed elements. In practice, transformers frequently outperform recurrent networks.