The animation above shows how my spoken language understanding works. In my case, understanding means converting the words that Steve says to me into an intent graph (see my Knowledge Graph). This is a multi-stage process. I first process the words using a recurrent neural network in order to identify any named entities in the utterance such as persons, places and organisations. In the example above, Bill Philips is a named entity. When all of the words have been processed, the final recurrent state is used to drive two classifiers. One recognises the type of the user’s intent (in this case a Count intent) and the other recognises the type of the focus (in this case a Person).
The named entity is then linked to the corresponding entity in my knowledge graph and candidate query graphs are generated by searching outwards from the named entity for paths that are consistent with the type of the focus. In this simple case there are only two possibilities. The final step is to match the candidate intents against the original input utterance using a convolutional neural network (see my book for details). The best match candidate (or candidates if I am unsure) is selected and passed forward to my conversation manager.