Blog Home

Developing reliable LLM-powered insight summarization for Tableau Pulse

12 Sep 2024

6 min read

Divyansh Agarwal

San Francisco

Chien-Sheng Wu

Lara Thompson

Vancouver, BC

Homer Wang

Seattle, WA

Tl;Dr: Salesforce AI Research and Tableau AI collaborated to build the Pulse insight summary feature, GA for all Tableau Cloud customers starting in early 2024. The feature combines the power of generative AI with data analytics to deliver automated & personalized insight summaries from key metric trends. This blog outlines the development process for this LLM-based summarization feature in Tableau Pulse.

Tableau Pulse marks a significant advancement in the data analytics and business intelligence landscape and has benefited Tableau Cloud customers since its launch. The Insights platform in Pulse lets business users track their most important metrics and provides automatic insights using Generative AI. The insights service leverages the power of large language models (LLMs) to generate natural language summaries of metric trends and delivers them to Tableau users. Summaries of metric changes convey actionable information to users about the data metrics most relevant to them, helping pinpoint what they need to focus on quickly and upfront.

Pulse allows users to standardize their metrics and create a unified source of truth for all data sources across the organization. It enables them to write customizable metric definitions that may refer to a specific business context. Tableau users can define their metrics to capture concepts such as ROI, Sales, Orders, Churn, etc., from their organizational data. They can specify different time dimensions to orient these metrics on, and indicate whether an increase in the metric would be favorable or unfavorable. We generate insights from these metrics, personalized to the user, which convey key metric trends such as Unusual Changes, Period over Period Change (PoPC), etc. The insights help users discover new opportunities, get ahead of issues, and make better decisions. Our focus was to use LLMs to generate insight summaries across different metrics, that can be delivered to them in Pulse, or directly by email or Slack. This would ensure that important patterns, changes, and outliers in the data are seen first by the user. A collaboration between Tableau AI and Salesforce AI Research has been key in successfully building components of Pulse that interact with an LLM to generate these rich and insightful summaries. The core of our work was to ensure that these LLM-generated insight summaries are personalized, factually correct, and extend the value of Pulse metrics.

User-Centric Development

Pulse insight summarization is the first feature to bring generative AI to the Tableau platform. Our goal was to use LLMs to generate a factually correct and engaging summary of the metrics tracked by a business user in Pulse. Pulse ranks metric insights by most notable changes. Our task was to take the top 3 relevant metric insights for the user and convert those into natural language summaries using an LLM.

Tableau’s UX research team conducted focused surveys with a pilot set of users to gather specific user needs and preferences. This was crucial in determining the right set of constraints for the summary to be aligned with user expectations. The pilot studies found that users would really benefit from an insight summarization feature and that it would enrich their experience on the platform.

The feedback from the pilot study pointed towards reliability as the core requirement for a desirable summary. To be reliable, Pulse insight summaries had to accurately represent different insight types across changing time grains. Secondly, the pilot users indicated a preference for low verbosity and high ‘interestingness’ in the insight summaries. Tone preferences were variable but, generally, a professional tone was preferable, and simple formatting was deemed ideal. We identified specific parameters and set thresholds for these requirements to guide our development and evaluation process.

The users desired completeness in the summaries generated by Pulse — the insight service identifies the most relevant metrics for each user, and then we require the insight summaries to have full coverage of all facts from the metrics being summarized. A deep understanding of what our users desired from an insight summarization feature in Tableau was the cornerstone of the development process.

Key Technical Challenges

The specific task we needed to solve was getting an LLM to summarize a collection of insights from multiple unique metrics. Metric changes are represented by various facts, including multiple numerical values, an associated period of change, and a user-defined sentiment to express the change. We designed LLM prompts to generate summaries of a set of metric insights, experimenting with mixing and separating insight types and time grains. The prompt instructions guide the LLM to adhere to our established constraints, including user preferences and formatting tags. The formatting tags allow us to ensure that certain components are preserved verbatim in the insight summary, such as user-chosen metric names, regardless of form or typos. They provides us with an option to perform other post-processing actions on the summary. Our prompts include in-context examples to guide the LLM towards a desirable summary.

Our experiments revealed that directly translating a collection of numeric metric values and associated metadata into natural language summaries, while having it satisfy all task constraints, is a challenging task for LLMs. This was due to the various complexities in different groups of metric insights to be summarized, which in the summary would often be modified incorrectly, entirely omitted, or included in a repetitive manner.

We found several corner cases that LLMs found especially difficult to summarize, and the resulting factual accuracy would not meet the requirements. For instance, a group of insights for the same metric name could be across different time grains, each conveying a different type of insight (Unusual Change, PoPC, etc.), yet the LLM would summarize them as the same metric. Simultaneously, preserving the sentiment associated with each metric insight in the resulting summary was challenging; a group of metrics could have conflicting sentiments (‘Profits’ may be favorably up but ‘Customer Retention’ might be unfavorably low!), which needs to be accurately contrasted in the LLM generated summary.

Iterative Alignment

To tackle these challenges, we opted for an intermediate insight templating step. We convert the metric insight facts into natural language using multiple templates, depending on the fact combinations of the metrics being summarized. We find that LLMs can summarize the templated insights much more easily than metric facts directly. This reduced factual incorrectness, satisfied other task constraints, and greatly improved the fluency of the summary.

In the LLM prompting step, we prompt a fixed LLM, to generate a summary for the top 3 intermediate insights, according to the outlined requirements. Given the diverse set of constraints, designing and tuning the prompt to generate aligned insight summaries was a non-trivial task. We design in-context examples to guide the LLMs towards generating more ‘aligned’ summaries. Our in-context examples specifically teach the LLM how to summarize ‘difficult’ metric groups such as those having a combination of different insight types and alternating sentiments. They were carefully designed to convey the tone and verbosity expected by our users.

We tune these 2 knobs in our development pipeline: the insight templates themselves and the prompt instructions to satisfy the task constraints. We evaluate the resulting summaries after each round of tuning using both human-in-the-loop and automated evaluations. The development of the insight summarization pipeline required alternating between these two steps, with evaluations guiding updates for each knob. This iterative alignment process helped guide the summary quality to desirable standards.

Evaluation

At the center of the iterative alignment process was our evaluation layer, which informs how we tune the LLM prompt and update the intermediate insight templates. Every round of tuning the LLM prompt and/or the insight templates involved an evaluation step. We create a comprehensive evaluation consisting of metrics in various domains with randomized fact sets. We included combinations of metric insights that cover all cases such as different insight types but over the same time period, trends for the same metric with conflicting sentiment across time grains, etc. This allowed us to measure how well the task constraints are satisfied across different types of metric combinations, and tune our knobs for specific cases.

We collect both human annotations and automated metrics on this diverse evaluation set to evaluate summary quality for each round of evaluation. While human annotation looks at aspects like tone, the semantics of the summary, hallucinations, etc., automated metrics check for verbosity, format verification, and other deterministic aspects of the requirements. Each round of evaluation provided us with specific feedback about the LLM behavior on our task. After each round, we optimized both the LLM prompt and the intermediate insight templates to align with the requirements. We modify the in-context examples to guide the LLM for specific edge cases. We repeated this process to iteratively align the intermediate insight templates and the prompt until the quality metrics met our acceptability thresholds.

Impact

Pulse insight summarization brings the power of generative AI to the analytics space, by delivering intelligent, personalized, and contextual insights of the metrics that matter to you. Some notable impact created by our efforts:

First Generative AI & LLM feature in Tableau and the first collaboration between Tableau Engineering and Salesforce Research
Making splashes at TC23, DF23, World Tour NYC 23, TC24, among other strategic events
Reaching >5K customers at various scales since launch and helping tens of thousands of business users whose job is not analytics get to their relevant metrics and insights faster
Consistently getting a >70% positive feedback rate on summary contents