TL;DR: We present BotSIM, a data-efficient end-to-end Bot SIMulation toolkit for evaluation, diagnosis, and improvement of commercial task-oriented dialogue (TOD) systems. BotSIM's “generation-simulation-remediation'' paradigm can accelerate the end-to-end bot evaluation and iteration process by: (1) reducing the effort needed to create test cases; (2) enabling a better understanding of both NLU and end-to-end performance via extensive dialogue simulation; and (3) improving the bot troubleshooting process with actionable suggestions from simulation results analysis.
Task-oriented dialogue (TOD) systems are a class of chatbots that have become quite familiar to anyone who uses websites these days. These focused bots, designed to handle specific tasks (as opposed to general-purpose bots that can chat about any subject), can now handle a wide range of applications. They are often deployed by various industries to help their customers complete certain tasks, such as booking a hotel, online shopping, and so on.
However, TOD bot adoption is a double-edged sword. While a good chatbot can help complete customer transactions effectively and efficiently, saving both time and cost, a poor one may result in customer frustration and negatively impact their willingness to engage the chatbots. It might even alter their perception of the business. That’s why it’s important to test these chatbots before deploying them to interact with real customers.
A TOD bot usually comprises a set of interwoven conversations or intents interacting with each other to define various task flows. Performing automatic end-to-end evaluation of such complex TOD systems is highly challenging and is still a largely manual process, especially for pre-deployment testing. However, manual testing is time-consuming, expensive, difficult to scale — and inevitably fails to capture the breadth of language variation present in the real world. In addition, troubleshooting and improving bot systems is a demanding task and requires expertise from a strong bot support team. This can pose a challenge — especially for companies with limited resources.
Although some platforms offer testing facilities, most of them focus on regression testing rather than end-to-end performance evaluation. Automatic tools for large-scale end-to-end evaluation and troubleshooting of TOD systems are highly desirable, yet largely lacking.
To address the above limitations, we developed BotSIM, a ChatBot SIMulation environment for data-efficient end-to-end commercial bot evaluation and remediation via dialogue user simulation. BotSIM is an AI-powered modular framework specially developed to automate the end-to-end pre-deployment evaluation of commercial bots via dialogue simulation at scale.
BotSIM performs chatbot simulation and, during this process, can identify and fix issues it finds. Note, however, that BotSIM cannot guarantee to fix all issues, as some of them may require bot re-training or re-design. The remediation suggestions are provided as guidelines for bot practitioners, rather than as a means to fix all issues automatically.
BotSIM in a Nutshell
Note that the Generator process results in substantial savings in time, cost, and effort that is normally required for test data creation and annotation. Another time-saving component is the Simulator, which helps avoid having to chat with the bots manually.
In short, TOD chatbots are now everywhere, and interact with many business customers. But they should be tested thoroughly before being deployed, to ensure they don’t frustrate or turn off users. BotSIM enables this extensive testing to be done automatically, significantly reducing the human time and expense normally required — and also produces valuable feedback to help bot practitioners improve these dialogue systems where needed.
Bonus Features: What Sets BotSIM Apart
The anatomy of BotSIM’s “generation-simulation-remediation” pipeline is shown in the figure below.
Generator
From a dialogue system perspective, BotSIM can be viewed as counterpart to a TOD chatbot: it needs to “understand” chatbot messages (NLU), “take” next-step actions (dialogue policy) and “respond” in natural languages (natural language generation, or NLG). As shown in the previous figure, all these components can be automatically produced by the generator from the bot designs and the intent utterances.
Parser: automatic exploration of conversation designs to generate NLU dialogue act maps
One of the most important design principles of BotSIM is task-agnostic bot evaluation to support all bots given a bot platform. It is prohibiting for users to explore such complicated designs manually in order to create test cases for covering different conversation paths. BotSIM offers a “black-box” testing scheme and assumes no prior knowledge of the testing bot through a platform-specific parser. The parser automatically converts dialogue designs to a unified representation in the form of dialogue act maps by modeling bot designs as graphs.
Goal generation for pre-deployment testing
Another major merit of the generator module is its data-efficient test cases generation capability. The test cases are encapsulated in a “goal” structure to be completed by the simulator for pre-deployment testing. A dialogue goal contains all the necessary information needed to fulfil the task as defined by the given dialog. Such information include dialogue acts (such as “inform/request”) and entities (like “Case Number”, “Email”) from the dialogue’s dialogue act maps. In addition, the special “intent” slot is incorporated to probe the intent model. To increase the test coverage and language variation, a paraphrasing model is trained for data-efficient intent query generation from the input intent utterances.
By filling the entity slots with different values, the generator can produce a large number of goal instances to be used for dialogue simulation to test the bot performance even before the bot is deployed.
With the dialogue act maps (NLU), simulation goals (state manager), and response templates (NLG), BotSIM can simulate users to “chat” with the bot to complete the tasks defined in the goals.
Each goal instance is used to simulate one episode of conversation. For NLU performance evaluation, the intent queries are used to test the intent model and all other slots are used to probe the NER models. More importantly, the end-to-end dialog-level performance (e.g., goal/task completion rates) can also be obtained depending on whether the testing goals have been successfully completed. The conversations are performed via automatic APIs calls to save the expensive and time-consuming manual bot testing efforts.
The figure below shows an example of how a dialogue turn between the bot and BotSIM is conducted via bot APIs. BotSIM invokes APIs to retrieve bot messages. Based on the dialogue acts matched by the dialogue act map NLU, the rule-based state manager applies the corresponding rules to generate the user dialogue acts. They are then converted to natural language responses by template NLG and sent back to the bot via APIs. The conversation ends when the task has been successfully finished or an error has been captured.
Based on the simulated conversations, the remediator performs error analysis and performance aggregation to produce a holistic multi-granularity bot health report in an interactive dashboard:
The remediator also provides actionable insights to help users resolve some identified issues. These recommendations include:
The remediator is also equipped with a suite of conversation analytical tools for better explainability of the simulation results. The tools can help users better understand their current bot system and prioritize remediation efforts. They include interactive confusion matrix analysis to identify the worst performing intents. The analysis also identifies some potential intent clusters for further examination.
Note that the remediation suggestions are meant to be used as guidelines rather than strictly followed. They can also be extended to incorporate domain expertise and shown to users via the dashboard.
The “Template Bot” is the pre-built bot of the Salesforce Einstein BotBuilder platform. It has six dialogs with hand-crafted training utterances. We keep 150 utterances per dialogue as the training set (train-original) for training the intent model and use the rest for evaluation (eval-original). The six intents are: “Transfer to agent (TA)”, “End chat (EC)”, “Connect with sales (CS)”, “Check issue status (CI)”, “Check order status (CO)” and “Report an issue (RI)”.
We found that BotSIM can be used to perform data-efficient evaluation through paraphrasing and dialogue user simulation.
Although Google DialogFlow CX offers some testing facilities, they are designed for regression testing to ensure that previously developed bot models still behave correctly after a change. Users need to explore the conversation paths and chat with the bots manually to annotate and save the dialogs as the regression test cases. In this case study, we showed how BotSIM can enable pre-deployment testing and performance analysis of DialogFlow CX bots using the built-in “Financial Services” mega-bot.
The DialogFlow CX parser invokes APIs to parse and model the conversation flows as graphs. This enables BotSIM to automatically explore the conversation paths and generate multi-intent dialogs to cover such paths. Together with the intent paraphrases, the multi-intent goal instances can be curated for end-to-end pre-deployment evaluation of DialogFlow CX bots via dialogue simulation.
Through the previous case studies, we showed that BotSIM’s streamlined “generation-simulation-remediation” paradigm significantly accelerates commercial chatbot development and evaluation.
In addition to speeding up the bot testing process, BotSIM can also widen it. The DialogFlow CX example is a case in point: compared to the bot’s 70+ built-in test dialogues, BotSIM generates and simulates a total of 4935 conversations - showing how it can greatly expand the testing range for commercial chatbots.
Bot practitioners can then use the remediator dashboard together with the built-in analytics panel to gauge bot performance and identify bot design or model-related issues.
We feel that BotSIM provides many positive impacts:
However, there is also the potential for negative impacts:
Salesforce AI Research invites you to dive deeper into the concepts discussed in this blog post (see links below). Connect with us on social media and our website to get regular updates on this and other research projects.
Guangsen Wang is a Senior Applied Scientist at Salesforce Research Asia, working on conversational AI research, including task-oriented dialogue systems and automatic speech recognition.
Steven C.H. Hoi is Managing Director of Salesforce Research Asia and oversees Salesforce's AI research and development activities in APAC. His research interests include machine learning and a broad range of AI applications.