BotSIM: An End-to-End Automatic Evaluation Framework for Task-Oriented Dialog Systems

12 min read

TL;DR: We present BotSIM, a data-efficient end-to-end Bot SIMulation toolkit for evaluation, diagnosis, and improvement of commercial task-oriented dialogue (TOD) systems. BotSIM's “generation-simulation-remediation'' paradigm can accelerate the end-to-end bot evaluation and iteration process by: (1) reducing the effort needed to create test cases; (2) enabling a better understanding of both NLU and end-to-end performance via extensive dialogue simulation; and (3) improving the bot troubleshooting process with actionable suggestions from simulation results analysis.


Task-oriented dialogue (TOD) systems are a class of chatbots that have become quite familiar to anyone who uses websites these days. These focused bots, designed to handle specific tasks (as opposed to general-purpose bots that can chat about any subject), can now handle a wide range of applications. They are often deployed by various industries to help their customers complete certain tasks, such as booking a hotel, online shopping, and so on.  

However, TOD bot adoption is a double-edged sword. While a good chatbot can help complete customer transactions effectively and efficiently, saving both time and cost, a poor one may result in customer frustration and negatively impact their willingness to engage the chatbots. It might even alter their perception of the business.  That’s why it’s important to test these chatbots before deploying them to interact with real customers.

Challenges in Large-Scale Bot Testing

A TOD bot usually comprises a set of interwoven conversations or intents interacting with each other to define various task flows. Performing automatic end-to-end evaluation of such complex TOD systems is highly challenging and is still a largely manual process, especially for pre-deployment testing. However, manual testing is time-consuming, expensive, difficult to scale — and inevitably fails to capture the breadth of language variation present in the real world.  In addition, troubleshooting and improving bot systems is a demanding task and requires expertise from a strong bot support team. This can pose a challenge — especially for companies with limited resources.

Although some platforms offer testing facilities, most of them focus on regression testing rather than end-to-end performance evaluation. Automatic tools for large-scale end-to-end evaluation and troubleshooting of TOD systems are highly desirable, yet largely lacking.

BotSIM Tests and Troubleshoots Commercial Task-Focused Chatbots Using AI

To address the above limitations, we developed BotSIM, a ChatBot SIMulation environment for data-efficient end-to-end commercial bot evaluation and remediation via dialogue user simulation. BotSIM is an AI-powered modular framework specially developed to automate the end-to-end pre-deployment evaluation of commercial bots via dialogue simulation at scale.

BotSIM performs chatbot simulation and, during this process, can identify and fix issues it finds. Note, however, that BotSIM cannot guarantee to fix all issues, as some of them may require bot re-training or re-design. The remediation suggestions are provided as guidelines for bot practitioners, rather than as a means to fix all issues automatically.

BotSIM in a Nutshell

  • Generator: generates test dialogues by a process called paraphrasing (creating alternate wordings of sentences or phrases that have the same meaning). This process essentially creates synthetic data to use in the next phase
  • Simulator: performs user simulation, using the paraphrased sentences (synthetic data) to test the bots
  • Remediator: analyzes the simulated dialogues and produces bot health reports as well as actionable insights (conversation analytics, suggestions, recommendations) to help troubleshoot and improve the bot systems.

Note that the Generator process results in substantial savings in time, cost, and effort that is normally required for test data creation and annotation. Another time-saving component is the Simulator, which helps avoid having to chat with the bots manually.

In short, TOD chatbots are now everywhere, and interact with many business customers. But they should be tested thoroughly before being deployed, to ensure they don’t frustrate or turn off users. BotSIM enables this extensive testing to be done automatically, significantly reducing the human time and expense normally required — and also produces valuable feedback to help bot practitioners improve these dialogue systems where needed.

Deeper Dive

BotSIM Key Features

  • [Multi-stage bot evaluation] BotSIM can be used for both pre-deployment testing and potentially post-deployment performance monitoring
  • [Data-efficient dialogue generation] Equipped with a deep network based paraphrasing model, BotSIM can generate an extensive set of test intent queries from the limited number of input intent utterances, which can be used to evaluate the bot intent model at scale.
  • [End-to-end bot evaluation via dialogue simulation] Through automatic chatbot simulation, BotSIM can identify existing issues of the bot and evaluate both the natural language understanding (NLU) performance (for instance, intent or NER error rates) and the end-to-end dialogue performance such as goal completion rates.
  • [Bot health report dashboard] The bot health report dashboard presents a multi-granularity top-down view of bot performance consisting of historical performance, current  bot test performance and dialogue-specific performance. Together with the analytical tools, they help bot practitioners quickly identify the most urgent issues and properly plan their resources for troubleshooting
  • [Easy extension to new bot platform] BotSIM was built with a modular task-agnostic design, with multiple platform support in mind, so it can be easily extended to support new bot platforms. (BotSIM currently supports Salesforce Einstein BotBuilder and Google DialogFlow CX.)

Bonus Features: What Sets BotSIM Apart

  • [Users will be “appy” (thanks to our readily deployable app)]: The whole system is deployed as an easy-to-use Web App to significantly flatten the learning curve for bot practitioners.

BotSIM Pipeline

The anatomy of BotSIM’s “generation-simulation-remediation” pipeline is shown in the figure below.  

  • The generator takes as inputs bot designs (for example, conversation flows, entities) as well as intent utterances, and automatically generates large-scale simulation goals via a paraphrasing model.
  • Simulation goals are used to perform large-scale dialogue user simulation for end-to-end bot evaluation
  • After dialogue simulation, the remediator outputs a dashboard containing the bot health report and a suite of conversation analytical tools to help users better comprehend, troubleshoot, and improve the current system.

BotSIM Modules


From a dialogue system perspective, BotSIM can be viewed as counterpart to a TOD chatbot: it needs to “understand” chatbot messages (NLU), “take” next-step actions (dialogue policy) and “respond” in natural languages (natural language generation, or NLG).  As shown in the previous figure, all these components can be automatically produced by the generator from the bot designs and the intent utterances.

Parser:  automatic exploration of conversation designs to generate NLU dialogue act maps

One of the most important design principles of BotSIM is task-agnostic bot evaluation to support all bots given a bot platform. It is prohibiting for users to explore such complicated designs manually in order to create test cases for covering different conversation paths. BotSIM offers a “black-box” testing scheme and assumes no prior knowledge of the testing bot through a platform-specific parser.  The parser automatically converts dialogue designs to a unified representation in the form of dialogue act maps by modeling bot designs as graphs.

Goal generation for pre-deployment testing

Another major merit of the generator module is its data-efficient test cases generation capability. The test cases are encapsulated in a “goal” structure to be completed by the simulator for pre-deployment testing. A dialogue goal contains all the necessary information needed to fulfil the task as defined by the given dialog. Such information include dialogue acts (such as “inform/request”) and entities (like “Case Number”, “Email”) from the dialogue’s dialogue act maps. In addition, the special “intent” slot is incorporated to probe the intent model. To increase the test coverage and language variation, a paraphrasing model is trained for data-efficient intent query generation from the input intent utterances.

By filling the entity slots with different values, the generator can produce a large number of goal instances to be used for dialogue simulation to test the bot performance even before the bot is deployed.

Simulator: Agenda-based user simulation

With the dialogue act maps (NLU), simulation goals (state manager), and response templates (NLG), BotSIM can simulate users to “chat” with the bot to complete the tasks defined in the goals.

Each goal instance is used to simulate one episode of conversation. For NLU performance evaluation, the intent queries are used to test the intent model and all other slots are used to probe the NER models. More importantly, the end-to-end dialog-level performance (e.g., goal/task completion rates) can also be obtained depending on whether the testing goals have been successfully completed.  The conversations are performed via automatic APIs calls to save the expensive and time-consuming manual bot testing efforts.

The figure below shows an example of how a dialogue turn between the bot and BotSIM is conducted via bot APIs. BotSIM invokes APIs to retrieve bot messages. Based on the dialogue acts matched by the dialogue act map NLU, the rule-based state manager applies the corresponding rules to generate the user dialogue acts. They are then converted to natural language responses by template NLG and sent back to the bot via APIs. The conversation ends when the task has been successfully finished or an error has been captured.