SimpleTOD: A Simple Language Model For Task Oriented Dialogue

9 min read
SimpleTOD: A Simple Language Model For Task Oriented Dialogue

Conversational AI has been a long-standing area of exploration in computer science [1]. There are broadly two categories of dialogue:

Open-domain dialogue systems focus on making chit-chat, open-ended conversations with humans more natural and engaging. They are usually trained end-to-end using large-scale data from social media [2, 3].

Task-oriented dialogue (TOD) systems accomplish a goal described by a user in natural language. They often use a pipeline approach that employ a variety of modules, breaking the task into smaller sub-tasks [4]. Natural language understanding (NLU) is handled by belief state tracking modules and dictates which results are retrieved from external APIs (in our case, a database). Dialogue management (DM) modules decide which actions to take based on those beliefs, and natural language generation (NLG) modules generate responses.

Task-Oriented Dialogue (TOD)

Traditionally, each component of task-oriented dialogue systems is trained independently with separate supervision for each component. The NLU module is trained on domain and intent labels. The DM module employs dialogue belief and dialogue act labels, and the NLG module accesses templatized or natural responses.

Task-oriented dialogue system composes of belief state tracking (NLU, yellow), action decision (DM, red) and response generation (NLG, green) modules. SimpleTOD, as we discuss below, handles these three tasks with a single model (dotted box) rather than with multiple component modules.

The modular dependencies of these components can lead to error propagation when information is not provided to subsequent modules in the pipeline [5]. For example, many systems do not consider the entire dialogue history at every turn. Instead, they rely on the NLU module to pass belief states reliably to following components. This was the original motivation behind SimpleTOD.


We propose recasting task-oriented dialogue as a simple, causal (unidirectional) language modeling task. We show that such an approach can solve all the sub-tasks in a unified way using multi-task maximum likelihood training. The proposed Simple Task-Oriented Dialogue (SimpleTOD) approach enables modeling of the inherent dependencies between the sub-tasks of task-oriented dialogue, by optimizing for all tasks in an end-to-end manner.

SimpleTOD is a causal (unidirectional) language model

Because SimpleTOD still outputs interpretable, intermediate results that are typically associated with each sub-task, we can evaluate it on each sub-task independently be(as for dialogue state tracking) and all together (end-to-end).

SimpleTOD in the Wild:

We evaluate SimpleTOD with a human in a multi-domain dialogue. In this setting, SimpleTOD condition its response on its generation from previous turns.

Description: A human is tasked to request SimpleTOD for reserving a hotel, and then for booking a train as well.

Note: To lexicalize SimpleTOD responses at each turn, first the database is queried using generated belief state, and templatized placeholders are filled. In case of multiple results, a randomly selected one is chosen and suggested to human.

It is shown that SimpleTOD is able to understand human intent, and by requesting related information for hotel and train reservation, suggest useful information about available hotel and restaurant from database.

Dialogue State Tracking Task

Dialogue State Tracking is the more general term for the belief state tracking performed by SimpleTOD. For this task, the model aims to relate the unstructured user inputs and dialogue history to the structured format that allows it to query the database. It does so by specifying a dictionary of belief states, which consist of key names and values.

Below, we review SimpleTOD performance on long multi-domain dialogues from the MultiWoZ 2.1 dataset. At each turn, the current user turn and all pervious user/system turns are considered as dialogue context and given to the model as input.

Dialogue: MUL1015
No. of turns: 10
Domains: attraction, hotel, taxi

User asks about a specific attraction, and based on database results (not available), SimpleTOD is able to track related belief states and changes them based on the most recent user/system turns. The user then asks to book a hotel, and finally asks for a taxi between the attraction and hotel. The user does not mention taxi departure and destination names, but SimpleTOD is able to predict these belief states from hotel and attraction turns in the dialogue context.

Dialogue: MUL0671
No. of turns: 10
Domains: train, hotel

User requests a train and then to reserve a hotel. For hotel reservation, user asks for a booking for 5 nights. Then, the system mentions that hotel is not available, and the user changes the booking to 4 nights. SimpleTOD changes the value of booking people slot in each turn, based on the latest information provided by user or system from dialogue context.

Belief, Action and Response Generation in End-to-End setting:

In this section, we evaluate turn-based performance of SimpleTOD in a long multi-domain dialogue. We study the performance of SimpleTOD in generating action decisions and responses for each turn.

Dialogue PMUL 3293
No. of turns: 11
Domains: train, hotel
Description: User requests to book a train and then reserving a hotel at destination. At some turns, SimpleTOD generates different actions than the ground truth system. Results indicate that such different actions lead to requesting more information from user to narrow down the hotel search results or to accomplish task in fewer turns. When suggesting a hotel name, SimpleTOD also provides more the detail information about the hotel to the user, such as hotel star rate, location, internet availability, etc., compared to ground truth system.

turn 1
The dialogue begins with a user request.

User: hi , could you help me with my plans ? i am looking for a train .

For comparison, we can observe the SimpleTOD outputs based on the dialogue history so far:

SimpleTOD has generated a response that we can compare during evaluation to the ground truth System response:

System: i can help you with that . where will be departing and where do you want to go ?

During evaluation, the system response is always passed to the next turns. In this turn, they were identical, but this isn’t always the case, as in the next turn.

turn 2:
This is how the user followed up in the next turn when the dataset was created:

User: i will be departing from cambridge and going into ely on saturday .

SimpleTOD then generated the following beliefs, actions, and response:

SimpleTOD has generated a response that in this case differs from the ground truth system response:

System: trains run that route every 2 hours throughout the day . do you have a target departure or arrival time ?

but we can tell that SimpleTOD’s response is also achieving the same goals as specified by the actions that it has chosen to take. A minor point to note about evaluation: turns are treated all independently, so on the next turn SimpleTOD doesn’t see its own response from previous turns, but instead the history of user and system response. This is because we are evaluating on a fixed test set rather than with new dialogues with new human participants each time so that we can more easily compare methods without the confounding  effects of different users and dialogues.

turn 3:
The next user response was:

User: i would like to arrive in ely by 11:15 . could you please tell me the price and what the departure time is ?

This is the first turn that it becomes clear that SimpleTOD generates a delexicalized response with template placeholders that in real life, we would fill in based on the true database state.

Then, we lexicalized SimpleTOD response based on its generated belief states for direct comparison with ground truth system response

SimpleTOD: TR0718 departs at 9:50 and arrives at 10:07. the price is 3.52 pounds. would you like me to book it for you?!

System: TR0718 arrives at 10:07 on saturday . cost is 3.52 pounds . would you like me to reserve a seat for you ?

Comparing responses, it is clear that SimpleTOD provides the requested information by user (departure time), even though this is missing in the ground truth target system response.

As noted before, turns are treated all independently, so on the next turn SimpleTOD won’t see its own response. Instead it will see the lexicalized system response. This is for evaluating on the same dataset across methods, but if using SimpleTOD in the wild, we would lexicalize the SimpleTOD response with the true state of the database and use that as part of the dialogue history instead.

turn 4:

User: i need to book a room in the north as well .

SimpleTOD: there are 11 guesthouses and 2 hotel -s in the north. do you have a preference for star or a price range in mind?!

System: there are 2 hotels and 11 guesthouses to choose from in the north , ranging in price from cheap to moderate . do you have a price point preference ?

At this turn, SimpleTOD requests user preference on hotel price range and star rate to find a better match. However, the ground truth response only request information on price range only.

turn 5:

User: i would need a hotel with wifi

SimpleTOD: there are 12 options. do you have a preference for star rating, price range?!

System: there are 12 hotels that fit your criteria . do you want a moderately priced or cheap hotel ?

Since the user does not see SimpleTOD response, only the preference on hotel internet is provided. SimpleTOD requests information about hotel star rate and price range again, while ground truth system asks about price only.

turn 6:

User: i do not care about the price.i absolutely do need free parking, though.

SimpleTOD: how about the acorn guest house? it has free parking and wifi.!

System: do you have a star rating preference ?

Here, SimpleTOD suggests a hotel name based on enough information given by user. It also provide information about parking and star rate (preferred by user). However, ground truth system asks about hotel star rate at this turn, without suggesting any hotel name. Since, SimpleTOD asks more information at previous turn, this can lead to accomplishing the task sooner, compared to ground truth system.

turn 7:

User: i do not have a preference on the number of stars .

SimpleTOD: i would recommend the acorn guest house. it is 4 stars and has free parking and wifi. would you like a reservation?!

System: can i suggest the acorn guesthouse ? it s moderately priced and has 4 stars .

Note: SimpleTOD provides hotel name with detailed information, i.e., internet, parking, star, area (user preference) in its response, but ground truth system contains star rate information only.

turn 8:

User: i want it to be in the type of hotel . it should have wifi and parking too . and it must be in the north .

SimpleTOD: the lovell lodge is a 4 star hotel in the north with free parking and wifi.!

System: we have 2 option for you . we have the ashley hotel and the lovell hotel . both have your commendations you want .

Since user emphasis his preferences on hotel type, SimpleTOD suggests a different hotel name and provides the requested information, while ground truth system suggests two hotels and only provides overall information.

turn 9:

User: book me with the ashley hotel please .

SimpleTOD: how many people will be staying, and how many nights?!

System: i would be happy to do that . what is your arrival date , how many in your party , and how many nights ?

Compared to ground truth system response, SimpleTOD did not ask user for hotel arrival date. Perhaps, in the real setting where model is conditioned on its own response, the model will ask about arrival date at the next turn.


Paper: A Simple Language Model for Task-Oriented Dialogue


When referencing this work, please cite:

  title={A simple language model for task-oriented dialogue},
  author={Hosseini-Asl, Ehsan and McCann, Bryan and Wu, Chien-Sheng and Yavuz, Semih and Socher, Richard},
  journal={arXiv preprint arXiv:2005.00796},


[1] J. Gao, et al.  Neural approaches to conversational AI, Foundations and Trends in Information Retrieval, 2019.

[2] D. Adiwardana, et al., Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977, 2020.

[3] S. Roller, et al., Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637, 2020.

[4] T.-H. Wen, et al., A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562, 2016.

[5] B. Liu and I. Lane., End-to-end learning of task-oriented dialogs. In NACCL 2018.