BotSIM: An End-to-End Automatic Evaluation Framework for Task-Oriented Dialog Systems

12 min read

Guangsen Wang

Steven Hoi

TL;DR: We present BotSIM, a data-efficient end-to-end Bot SIMulation toolkit for evaluation, diagnosis, and improvement of commercial task-oriented dialogue (TOD) systems. BotSIM's “generation-simulation-remediation'' paradigm can accelerate the end-to-end bot evaluation and iteration process by: (1) reducing the effort needed to create test cases; (2) enabling a better understanding of both NLU and end-to-end performance via extensive dialogue simulation; and (3) improving the bot troubleshooting process with actionable suggestions from simulation results analysis.

Background

Task-oriented dialogue (TOD) systems are a class of chatbots that have become quite familiar to anyone who uses websites these days. These focused bots, designed to handle specific tasks (as opposed to general-purpose bots that can chat about any subject), can now handle a wide range of applications. They are often deployed by various industries to help their customers complete certain tasks, such as booking a hotel, online shopping, and so on.

However, TOD bot adoption is a double-edged sword. While a good chatbot can help complete customer transactions effectively and efficiently, saving both time and cost, a poor one may result in customer frustration and negatively impact their willingness to engage the chatbots. It might even alter their perception of the business. That’s why it’s important to test these chatbots before deploying them to interact with real customers.

Challenges in Large-Scale Bot Testing

A TOD bot usually comprises a set of interwoven conversations or intents interacting with each other to define various task flows. Performing automatic end-to-end evaluation of such complex TOD systems is highly challenging and is still a largely manual process, especially for pre-deployment testing. However, manual testing is time-consuming, expensive, difficult to scale — and inevitably fails to capture the breadth of language variation present in the real world. In addition, troubleshooting and improving bot systems is a demanding task and requires expertise from a strong bot support team. This can pose a challenge — especially for companies with limited resources.

Although some platforms offer testing facilities, most of them focus on regression testing rather than end-to-end performance evaluation. Automatic tools for large-scale end-to-end evaluation and troubleshooting of TOD systems are highly desirable, yet largely lacking.

BotSIM Tests and Troubleshoots Commercial Task-Focused Chatbots Using AI

To address the above limitations, we developed BotSIM, a ChatBot SIMulation environment for data-efficient end-to-end commercial bot evaluation and remediation via dialogue user simulation. BotSIM is an AI-powered modular framework specially developed to automate the end-to-end pre-deployment evaluation of commercial bots via dialogue simulation at scale.

BotSIM performs chatbot simulation and, during this process, can identify and fix issues it finds. Note, however, that BotSIM cannot guarantee to fix all issues, as some of them may require bot re-training or re-design. The remediation suggestions are provided as guidelines for bot practitioners, rather than as a means to fix all issues automatically.

BotSIM in a Nutshell

Generator: generates test dialogues by a process called paraphrasing (creating alternate wordings of sentences or phrases that have the same meaning). This process essentially creates synthetic data to use in the next phase
Simulator: performs user simulation, using the paraphrased sentences (synthetic data) to test the bots
Remediator: analyzes the simulated dialogues and produces bot health reports as well as actionable insights (conversation analytics, suggestions, recommendations) to help troubleshoot and improve the bot systems.

Note that the Generator process results in substantial savings in time, cost, and effort that is normally required for test data creation and annotation. Another time-saving component is the Simulator, which helps avoid having to chat with the bots manually.

In short, TOD chatbots are now everywhere, and interact with many business customers. But they should be tested thoroughly before being deployed, to ensure they don’t frustrate or turn off users. BotSIM enables this extensive testing to be done automatically, significantly reducing the human time and expense normally required — and also produces valuable feedback to help bot practitioners improve these dialogue systems where needed.

Deeper Dive

BotSIM Key Features

[Multi-stage bot evaluation] BotSIM can be used for both pre-deployment testing and potentially post-deployment performance monitoring
[Data-efficient dialogue generation] Equipped with a deep network based paraphrasing model, BotSIM can generate an extensive set of test intent queries from the limited number of input intent utterances, which can be used to evaluate the bot intent model at scale.
[End-to-end bot evaluation via dialogue simulation] Through automatic chatbot simulation, BotSIM can identify existing issues of the bot and evaluate both the natural language understanding (NLU) performance (for instance, intent or NER error rates) and the end-to-end dialogue performance such as goal completion rates.
[Bot health report dashboard] The bot health report dashboard presents a multi-granularity top-down view of bot performance consisting of historical performance, current bot test performance and dialogue-specific performance. Together with the analytical tools, they help bot practitioners quickly identify the most urgent issues and properly plan their resources for troubleshooting
[Easy extension to new bot platform] BotSIM was built with a modular task-agnostic design, with multiple platform support in mind, so it can be easily extended to support new bot platforms. (BotSIM currently supports Salesforce Einstein BotBuilder and Google DialogFlow CX.)

Bonus Features: What Sets BotSIM Apart

[Users will be “appy” (thanks to our readily deployable app)]: The whole system is deployed as an easy-to-use Web App to significantly flatten the learning curve for bot practitioners.

BotSIM Pipeline

The anatomy of BotSIM’s “generation-simulation-remediation” pipeline is shown in the figure below.

The generator takes as inputs bot designs (for example, conversation flows, entities) as well as intent utterances, and automatically generates large-scale simulation goals via a paraphrasing model.
Simulation goals are used to perform large-scale dialogue user simulation for end-to-end bot evaluation
After dialogue simulation, the remediator outputs a dashboard containing the bot health report and a suite of conversation analytical tools to help users better comprehend, troubleshoot, and improve the current system.

BotSIM Modules

Generator

From a dialogue system perspective, BotSIM can be viewed as counterpart to a TOD chatbot: it needs to “understand” chatbot messages (NLU), “take” next-step actions (dialogue policy) and “respond” in natural languages (natural language generation, or NLG). As shown in the previous figure, all these components can be automatically produced by the generator from the bot designs and the intent utterances.

Parser: automatic exploration of conversation designs to generate NLU dialogue act maps

One of the most important design principles of BotSIM is task-agnostic bot evaluation to support all bots given a bot platform. It is prohibiting for users to explore such complicated designs manually in order to create test cases for covering different conversation paths. BotSIM offers a “black-box” testing scheme and assumes no prior knowledge of the testing bot through a platform-specific parser. The parser automatically converts dialogue designs to a unified representation in the form of dialogue act maps by modeling bot designs as graphs.

Goal generation for pre-deployment testing

Another major merit of the generator module is its data-efficient test cases generation capability. The test cases are encapsulated in a “goal” structure to be completed by the simulator for pre-deployment testing. A dialogue goal contains all the necessary information needed to fulfil the task as defined by the given dialog. Such information include dialogue acts (such as “inform/request”) and entities (like “Case Number”, “Email”) from the dialogue’s dialogue act maps. In addition, the special “intent” slot is incorporated to probe the intent model. To increase the test coverage and language variation, a paraphrasing model is trained for data-efficient intent query generation from the input intent utterances.

By filling the entity slots with different values, the generator can produce a large number of goal instances to be used for dialogue simulation to test the bot performance even before the bot is deployed.

Simulator: Agenda-based user simulation

With the dialogue act maps (NLU), simulation goals (state manager), and response templates (NLG), BotSIM can simulate users to “chat” with the bot to complete the tasks defined in the goals.

Each goal instance is used to simulate one episode of conversation. For NLU performance evaluation, the intent queries are used to test the intent model and all other slots are used to probe the NER models. More importantly, the end-to-end dialog-level performance (e.g., goal/task completion rates) can also be obtained depending on whether the testing goals have been successfully completed. The conversations are performed via automatic APIs calls to save the expensive and time-consuming manual bot testing efforts.

The figure below shows an example of how a dialogue turn between the bot and BotSIM is conducted via bot APIs. BotSIM invokes APIs to retrieve bot messages. Based on the dialogue acts matched by the dialogue act map NLU, the rule-based state manager applies the corresponding rules to generate the user dialogue acts. They are then converted to natural language responses by template NLG and sent back to the bot via APIs. The conversation ends when the task has been successfully finished or an error has been captured.

Remediator: Multi-granularity bot health dashboard with insights and recommendations

Based on the simulated conversations, the remediator performs error analysis and performance aggregation to produce a holistic multi-granularity bot health report in an interactive dashboard:

Historical test performance comparison of previously finished testing sessions
Overall test session performance
Detailed dialogue specific NLU performance including intent and NER performance

The remediator also provides actionable insights to help users resolve some identified issues. These recommendations include:

To improve intent models, wrongly classified paraphrase intent queries can be filtered and added to the original training set to retrain the intent model
To reduce ambiguities among intents, intent utterances whose paraphrases are all predicted as another intent should be considered to be moved to the training set of the predicted intent. If there are many such intent utterances, intent redesign may be needed.

The remediator is also equipped with a suite of conversation analytical tools for better explainability of the simulation results. The tools can help users better understand their current bot system and prioritize remediation efforts. They include interactive confusion matrix analysis to identify the worst performing intents. The analysis also identifies some potential intent clusters for further examination.

Note that the remediation suggestions are meant to be used as guidelines rather than strictly followed. They can also be extended to incorporate domain expertise and shown to users via the dashboard.

Example Use Cases

End-to-End Evaluation of Salesforce Einstein BotBuilder

The “Template Bot” is the pre-built bot of the Salesforce Einstein BotBuilder platform. It has six dialogs with hand-crafted training utterances. We keep 150 utterances per dialogue as the training set (train-original) for training the intent model and use the rest for evaluation (eval-original). The six intents are: “Transfer to agent (TA)”, “End chat (EC)”, “Connect with sales (CS)”, “Check issue status (CI)”, “Check order status (CO)” and “Report an issue (RI)”.

We found that BotSIM can be used to perform data-efficient evaluation through paraphrasing and dialogue user simulation.

Goal generation and dialogue simulation. To simulate the pre-deployment testing scenario, the paraphrasing model is applied to the 150 intent training utterances to generate the paraphrase intent queries, which are then included into the simulation goal instances to test the intent model performance via dialogue simulation. After paraphrasing, we have increased the size of the intent queries set (“train-paraphrases” ) to be ten times as the original size of the intent utterances (“eval-original” ). This is to better capture the language variation in real user intent queries.
Apply remediation suggestions: intent model retraining with augmented training set. Lastly, we can apply the remediation recommendations to improve the bot intent model. Here, we add the recommended misclassified intent paraphrases from the “train-paraphrases” set to the “train-original” set to form the “train-augmented” set and retrain the intent model. Another round of dialogue simulation is performed to test the retrained intent model. We then compare the performance (F1 with 95% confidence interval computed with 10K bootstrapped samples) before and after retraining. We observe consistent improvements for all intents of the “eval-original” set after model retraining, especially the most challenging intents (lower F1s), such as “Report an issue (RI)” and “Connect with sales (CS)”.

Multi-Intent Dialogue Generation for Pre-Deployment Testing of Google DialogFlow CX Platform

Although Google DialogFlow CX offers some testing facilities, they are designed for regression testing to ensure that previously developed bot models still behave correctly after a change. Users need to explore the conversation paths and chat with the bots manually to annotate and save the dialogs as the regression test cases. In this case study, we showed how BotSIM can enable pre-deployment testing and performance analysis of DialogFlow CX bots using the built-in “Financial Services” mega-bot.

The DialogFlow CX parser invokes APIs to parse and model the conversation flows as graphs. This enables BotSIM to automatically explore the conversation paths and generate multi-intent dialogs to cover such paths. Together with the intent paraphrases, the multi-intent goal instances can be curated for end-to-end pre-deployment evaluation of DialogFlow CX bots via dialogue simulation.

Results

Through the previous case studies, we showed that BotSIM’s streamlined “generation-simulation-remediation” paradigm significantly accelerates commercial chatbot development and evaluation.

In addition to speeding up the bot testing process, BotSIM can also widen it. The DialogFlow CX example is a case in point: compared to the bot’s 70+ built-in test dialogues, BotSIM generates and simulates a total of 4935 conversations - showing how it can greatly expand the testing range for commercial chatbots.

Bot practitioners can then use the remediator dashboard together with the built-in analytics panel to gauge bot performance and identify bot design or model-related issues.

Impacts

We feel that BotSIM provides many positive impacts:

By greatly accelerating commercial bot development and evaluation, BotSim’s paradigm should reduce human effort, cost, and time-to-market.
The output actionable insights and recommendations are valuable to bot practitioners to troubleshoot and improve their bot models.
BotSIM can be readily deployed locally or as a Heroku App with easy-to-use interfaces to flatten the learning curve for users like bot admins and other bot practitioners.
The readily deployable and usable app significantly lowers the entry barrier to perform pre-deployment bot evaluation.

However, there is also the potential for negative impacts:

The pretrained language model based paraphrasers (T5-based) used in this study are pretrained and finetuned with large amounts of text corpora scraped from the web, which may contain biases.
These biases may even be propagated to the generated paraphrases, causing harm to the subject of these stereotypes.
Although the paraphrasing models are only applied to generate testing intent queries, BotSIM users are advised to take into consideration these ethical issues and may wish to manually inspect or otherwise filter the generated paraphrases.

The Bottom Line: Summary and Future Directions

TOD chatbots are now everywhere, and interact with many business customers. But they should be tested thoroughly before being deployed, to ensure they don’t frustrate or turn off users.
BotSIM enables this extensive testing to be done automatically using AI, significantly reducing the human time and expense normally required — and also produces valuable feedback to help bot practitioners improve these dialogue systems where needed. BotSIM is a modular, data-efficient dialogue generation and simulation framework targeted for large scale end-to-end evaluation of commercial TOD systems (task-oriented chatbots).
Currently, BotSIM supports Salesforce Einstein BotBuilder and Google DialogFlow CX, but BotSIM can be easily extended to support new bot platforms. (As a modular framework, extending BotSIM for multi-platform support is a breeze.) This will be one part of our future work.
To increase the robustness of the NLU model and the naturalness of the NLG model, more advanced NLU and NLG models can also be incorporated in the future, as well as more analytics and recommendations in the remediation dashboard by adding more sections/pages to the dashboard.
We welcome any constructive feedback and contributions from the open-source community to help improve BotSIM.

Explore More

Salesforce AI Research invites you to dive deeper into the concepts discussed in this blog post (see links below). Connect with us on social media and our website to get regular updates on this and other research projects.

Learn more: Check out our technical report and demo paper, which describes our work in greater detail
Contact us: botsim@salesforce.com
Code: Check out our code on GitHub
Demo: Check out our demo
Follow us on Twitter: @SalesforceResearch, @Salesforce
Blog: To read other blog posts, please see blog.salesforceairesearch.com.
Main site: To learn more about all of the exciting projects at Salesforce AI Research, please visit our main website at salesforceairesearch.com.

About the Authors

Guangsen Wang is a Senior Applied Scientist at Salesforce Research Asia, working on conversational AI research, including task-oriented dialogue systems and automatic speech recognition.

Steven C.H. Hoi is Managing Director of Salesforce Research Asia and oversees Salesforce's AI research and development activities in APAC. His research interests include machine learning and a broad range of AI applications.

Glossary

Entity - Entities are used for identifying and extracting information from dialogue turns. For example, “Email” is an entity to represent the “type” of all possible emails provided by the users. The extracted email addresses are the “values” of the entity.
Dialogue act - Originated from linguistics, it denotes the function of a dialogue turn, such as “request”, “inform”. Here we use the term in a sightly different way, the “dialogue acts” in this post refer to the combination of actions such as “inform” or “request” and entities. For example, “request_email” is deemed as a dialogue act.
Intent - The term is used to represent the customer goals when engaging with a bot. It serves as a basic conversation design unit comprising multiple turns of conversations. For example, an online shopping bot can have intents such as “check order status”, “change delivery address” etc. They are often used interchangeably with “dialogs”.
NER - Named-Entity Recognition (NER) is the natural language understanding task to locate and identify named entities in text into pre-defined classes, such as names, locations, organizations, currencies, etc. It is used primarily for information extraction needed by many NLP tasks such as dialog systems, questions answering.
Dialogue goal (agenda) - We use dialogue goal ang agenda interchangeably to represent all the necessary information needed to complete an intent or dialog. It consists of user dialogue acts and entity values related to the intent/dialog. For example, in order to complete the goal of making changes to an online order, the goal will contain the required information such as user email address, order number.