A great deal of the processing I do to understand what Steve is saying to me and to then decide on my responses involves manipulating patterns using neural networks. So before explaining how neural networks work, the animation above will take you through a few of the typical patterns encountered in language processing. As a human you will see these patterns as representing quite different things, but it is important to remember that for me all patterns are just arrays of numbers. So for example, the sounds you hear are pressure waves travelling in air. You experience them as vibrations in your ear drum whereas I see them as just a sequence of numbers which happen to represent the amplitude of the pressure wave at successive moments in time.
Patterns are either fixed in size or variable length sequences and as you will see later, this affects the way they are processed by my neural networks. Fixed size static patterns can be processed by the most basic form of neural network called a feed forward neural network. Sequences require a mechanism which allows the processing of each element of the sequence to take account of its neighbours. I will show you one way of doing this using recurrent networks. Some times I have to process images or image-like data and for these I often use convolutional networks which involve aspects of both static and recurrent processing.
Neural networks can be trained to transform an input pattern into an output pattern. When processing data, there are three kinds of transform. To encode symbols in the real world, I represent them by an array which is the same size as the total number of different symbols. I then encode a specific symbol by setting the element corresponding to that symbol to 1 and all of the other elements 0. This so-called 1-hot encoding is a simple way to input symbolic data such as words into my neural circuitry but for subsequent processing I use a more compact representation called an embedding. This is just a vector of numbers but it is much smaller than a 1-hot encoding and most of its values will be non-zero. Computing these embeddings is the first kind of pattern transform that I used.
The second kind of transform involves manipulating input embeddings to focus on the specific types of information I am trying to extract. This often involves combining information from multiple sources by various forms of weighted summation.
The third kind of transform is classification which can be regarded as the inverse of embedding. For example, the data being processed may need to be allocated to a specific class. To do this, I would ideally output a 1-hot vector with the element corresponding to the chosen class being 1 and all other elements 0. However, there is usually some uncertainty in classification so instead I output a vector representing the probability of each class. Now if you haven’t done so already, please watch the movie.