The primary role of my conversation manager is to understand Steve’s goal and decide what to do next. It may take several turns of a conversation before I can be sure that I fully understand what he wants me to do. Each time he speaks to me, my spoken language understanding component converts the words he speaks into an intent. The job of my conversation manager is to assimilate all those intents to build a representation of his goal, and respond appropriately at each turn in order to keep the conversation on track. This is a two stage process.
Firstly, I have to make a high level decision as to how I should respond. For example, if I am not confident that I have understood the information already given to me, I might generate a Confirm intent to confirm a specific piece of information. If I think there is information missing I will generate a Request intent to ask for something specific. If I think that the goal is complete, then I will generate an Execute intent to let him know that his request has been understood and then implement his goal. These high level decisions are called dialogue acts and I use about 15 dialogue acts in total. In addition to the above, other common acts are Greet, Inform, Repeat, Help, Affirm, Negate, Correct and Interrupt. I use a neural dialogue policy network to select the next dialogue act at each turn. This network takes as input a set of features derived from the belief network, various confidence scores, the existence of alternative hypotheses and the ontology stored in my knowledge graph. The network outputs the dialogue act which it thinks is most appropriate at this turn of the dialogue. The selected dialogue act is then converted into a system intent using a variety of heuristics. In the example, the policy network selects a Request act and since there is a missing property in the belief state, the generated system intent requests a value for the missing property, in this case the duration of the requested timer.
The second stage converts the abstract system intent into a natural language response. To do this, the intent is converted to an embedding which then conditions a recurrent neural network decoder to generate a sequence of words. The encoder used for the embedding and the decoder are trained using examples of responses in real dialogues in which the dialogue act has been labelled by human annotators. The policy network is trained using reinforcement learning with the objective of maximising the success rate whilst keeping dialogues as short as possible. As always, much more information is given in my book!