TL;DR: We present BotSIM, a data-efficient end-to-end Bot SIMulation toolkit for evaluation, diagnosis, and improvement of commercial task-oriented dialogue (TOD) systems. BotSIM's “generation-simulation-remediation'' paradigm can accelerate the end-to-end bot evaluation and iteration process by: (1) reducing the effort needed to create test cases; (2) enabling a better understanding of both NLU and end-to-end performance via extensive dialogue simulation; and (3) improving the bot troubleshooting process with actionable suggestions from simulation results analysis.
Task-oriented dialogue (TOD) systems are a class of chatbots that have become quite familiar to anyone who uses websites these days. These focused bots, designed to handle specific tasks (as opposed to general-purpose bots that can chat about any subject), can now handle a wide range of applications. They are often deployed by various industries to help their customers complete certain tasks, such as booking a hotel, online shopping, and so on.
However, TOD bot adoption is a double-edged sword. While a good chatbot can help complete customer transactions effectively and efficiently, saving both time and cost, a poor one may result in customer frustration and negatively impact their willingness to engage the chatbots. It might even alter their perception of the business. That’s why it’s important to test these chatbots before deploying them to interact with real customers.
A TOD bot usually comprises a set of interwoven conversations or intents interacting with each other to define various task flows. Performing automatic end-to-end evaluation of such complex TOD systems is highly challenging and is still a largely manual process, especially for pre-deployment testing. However, manual testing is time-consuming, expensive, difficult to scale — and inevitably fails to capture the breadth of language variation present in the real world. In addition, troubleshooting and improving bot systems is a demanding task and requires expertise from a strong bot support team. This can pose a challenge — especially for companies with limited resources.
Although some platforms offer testing facilities, most of them focus on regression testing rather than end-to-end performance evaluation. Automatic tools for large-scale end-to-end evaluation and troubleshooting of TOD systems are highly desirable, yet largely lacking.
To address the above limitations, we developed BotSIM, a ChatBot SIMulation environment for data-efficient end-to-end commercial bot evaluation and remediation via dialogue user simulation. BotSIM is an AI-powered modular framework specially developed to automate the end-to-end pre-deployment evaluation of commercial bots via dialogue simulation at scale.
BotSIM performs chatbot simulation and, during this process, can identify and fix issues it finds. Note, however, that BotSIM cannot guarantee to fix all issues, as some of them may require bot re-training or re-design. The remediation suggestions are provided as guidelines for bot practitioners, rather than as a means to fix all issues automatically.
BotSIM in a Nutshell
Note that the Generator process results in substantial savings in time, cost, and effort that is normally required for test data creation and annotation. Another time-saving component is the Simulator, which helps avoid having to chat with the bots manually.
In short, TOD chatbots are now everywhere, and interact with many business customers. But they should be tested thoroughly before being deployed, to ensure they don’t frustrate or turn off users. BotSIM enables this extensive testing to be done automatically, significantly reducing the human time and expense normally required — and also produces valuable feedback to help bot practitioners improve these dialogue systems where needed.
Bonus Features: What Sets BotSIM Apart
The anatomy of BotSIM’s “generation-simulation-remediation” pipeline is shown in the figure below.
Generator
From a dialogue system perspective, BotSIM can be viewed as counterpart to a TOD chatbot: it needs to “understand” chatbot messages (NLU), “take” next-step actions (dialogue policy) and “respond” in natural languages (natural language generation, or NLG). As shown in the previous figure, all these components can be automatically produced by the generator from the bot designs and the intent utterances.
Parser: automatic exploration of conversation designs to generate NLU dialogue act maps
One of the most important design principles of BotSIM is task-agnostic bot evaluation to support all bots given a bot platform. It is prohibiting for users to explore such complicated designs manually in order to create test cases for covering different conversation paths. BotSIM offers a “black-box” testing scheme and assumes no prior knowledge of the testing bot through a platform-specific parser. The parser automatically converts dialogue designs to a unified representation in the form of dialogue act maps by modeling bot designs as graphs.
Goal generation for pre-deployment testing
Another major merit of the generator module is its data-efficient test cases generation capability. The test cases are encapsulated in a “goal” structure to be completed by the simulator for pre-deployment testing. A dialogue goal contains all the necessary information needed to fulfil the task as defined by the given dialog. Such information include dialogue acts (such as “inform/request”) and entities (like “Case Number”, “Email”) from the dialogue’s dialogue act maps. In addition, the special “intent” slot is incorporated to probe the intent model. To increase the test coverage and language variation, a paraphrasing model is trained for data-efficient intent query generation from the input intent utterances.
By filling the entity slots with different values, the generator can produce a large number of goal instances to be used for dialogue simulation to test the bot performance even before the bot is deployed.
With the dialogue act maps (NLU), simulation goals (state manager), and response templates (NLG), BotSIM can simulate users to “chat” with the bot to complete the tasks defined in the goals.
Each goal instance is used to simulate one episode of conversation. For NLU performance evaluation, the intent queries are used to test the intent model and all other slots are used to probe the NER models. More importantly, the end-to-end dialog-level performance (e.g., goal/task completion rates) can also be obtained depending on whether the testing goals have been successfully completed. The conversations are performed via automatic APIs calls to save the expensive and time-consuming manual bot testing efforts.
The figure below shows an example of how a dialogue turn between the bot and BotSIM is conducted via bot APIs. BotSIM invokes APIs to retrieve bot messages. Based on the dialogue acts matched by the dialogue act map NLU, the rule-based state manager applies the corresponding rules to generate the user dialogue acts. They are then converted to natural language responses by template NLG and sent back to the bot via APIs. The conversation ends when the task has been successfully finished or an error has been captured.