Recommendation systems are common in the consumer world. For example, Netflix, YouTube, and other companies use these systems to recommend items you would probably like, based on data about you - such as what items you've consumed (e.g., watched) before.
But recommendation systems are not just for consumers. Enterprises use them as well, such as to recommend apps that customers might want, based on data about what they like or actions they have taken before.
This blog presents a new framework we've developed to improve the diversity and explainability of these enterprise app recommendation systems. But before we dive into our framework, let's go over some background basics.
Salesforce AppExchange is the leading enterprise cloud-based app marketplace for developers and independent software vendor (ISV) partners to sell the apps built on the Salesforce platform, to help customers achieve measurable business goals in the cloud. In this work, we focus on an enterprise app recommendation problem with a new business use case, which aims to find a match between these three parties:
In short, ISVs have applications, customers have business problems. and the sales teams serve as a bridge between the two, recommending applications to solve customer problems.
The above process works well in theory, but in practice, it may not always lead to an optimal outcome. Various issues can arise during the process of deciding on which applications to recommend. In general, manually analyzing a customer's preferences and selecting relevant apps has a number of inherent drawbacks:
To address the issues outlined above, we believe a data-driven approach is a better method to conduct app recommendation, necessary to improve and optimize this task and, in particular, to increase the diversity and explainability of app recommendation.
We have developed a novel framework that achieves these improvements by both improving aggregate recommendation diversity and generating recommendation explanations.
Our primary goal in developing our new framework is to assist the sales team in finding apps most relevant to their customers, allowing them to interact with the system and obtain more information, such as:
The benefits of our framework include:
For a specific customer, here is how our system works (the main steps taken):
Different from other recommendation tasks, the sales team can:
Below is a wider view of the entire system, with the previous figure seen in the lower left:
The heart of our solution lies in its models, which consist of two main types:
The relevance model learns the “similarity” between a user and an item, while the DAE models aim to control aggregate diversity and generate recommendation explanations.
We designed three types of DAE models so that the sales team can judge whether the recommended apps are reasonable or not:
One of the key innovations of our framework is to train separate post-hoc explanation models for learning disentangled explanations, meaning that each explanation model only focuses on one aspect of the explanation. For example, one model for extracting popular items, one model for feature-level explanations (highlighting important features) and another model for item-based explanations (highlighting relevant installed apps).
Each DAE model has a simpler model structure, trying to estimate the rating scores, e.g., P(i,j), generated by the relevance model. For instance, given User I and Item J, the output D(I,J) of a DAE model estimates the distribution of the rating score P(I,J). Because the DAE models approximate the rating scores, they can also be viewed as the post-hoc explanation models used to generate recommendation explanations. Suppose that we have a simple DAE model that takes item J as its input only (no User I info) and tries to approximate P(I,J), which means that this DAE model estimates the impersonalized popularity scores of the recommended items. Therefore, it is able to generate explanations such as “item J is recommended because it is popular”.
This also provides a convenient way to control recommendation diversity in real-time for exploring new apps:
In real-world applications, if one only needs recommendations generated offline without interacting with the system, one can fix the diversity parameter in our framework (that is, keep w constant, instead of adjusting recommendation diversity in real-time) to generate recommendations and the corresponding explanations.
Here is an example of our system’s output, showing the recommended app and the corresponding explanations. Note how we provide both feature-level explanations and item-based explanations:
As the above figure shows, the app “Mass Edit + Mass Update + Mass Delete” is recommended to customer X, because:
“Mass Edit + Mass Update + Mass Delete” is a developer tool for searching and mass editing multiple accounts, contacts, and custom objects, useful for cloud-based applications such as “Free Blog Core Application” and “Conga Composer” that handle large amounts of documents and contracts.
The table below shows an example of the generated feature-level explanation for the app “RingLead Field Trip – Discover Unused Fields and Analyze Data Quality”. The explanation has the template “app X is recommended because of features A, B, etc.”. We list the top 10 important categorical features learned by our model. This example shows that important features extracted by our method include CITY, COUNTRY, and REGION as well as market segment and account ID, which are reasonable for this case.
In the future, we plan to explore other types of explanations such as actionable insights, to further assist the sales team by improving explanation quality.
In addition to supporting and funding basic AI research to benefit all of society, Salesforce also likes to apply its AI team’s innovative research to improve the company’s own operations, and so our app recommendation system has been deployed as a service for our sales teams. We did a user study of our system with our sales team and got very positive feedback. Due to privacy and confidentiality concerns, we can only show parts of our internal users' feedback, but here are some highlights:
Some comments we received:
We evaluated the performance of our framework on a private enterprise app recommendation dataset, using this dataset to compare our system with other approaches in terms of accuracy, diversity, and explainability.
When we compared our relevance model on the app recommendation dataset with three methods widely applied in industrial recommender systems (logistic regression, wide and deep model, and DIN model), our performance was either comparable to or better than the other three methods, using the accuracy metrics hit ratio (proportion of recommended apps that a user actually wants) and NDCG (compares different ranking functions to decide which is best). In addition, the experimental results demonstrated the importance of leveraging user installation history in our task, and verified the effectiveness of our model’s special module for learning item representation.
The next experiment evaluated the ability to control recommendation diversity. We compared our approach with different re-ranking methods: Reverse Predicted Rating (RPR), reverse click/installation counts (RCC), 5D ranking, and DIN model.
The aggregate diversity is measured by two metrics:
The above figure shows the comparison between our approach and the re-ranking methods. For the app recommendation dataset, our method performs much better: it can recommend about 1000 items without reducing the hit ratio much, while the re-ranking methods can only recommend about 800 items at most with a certain level of accuracy.
Note that we also conducted the experiments by replacing our relevance model with DIN (“DIN + Our DAE”, yellow line), and the results mirrored those of our original unaltered system (blue line), demonstrating that our framework can support other recommendation models as well.
The above experiments validated the efficacy of our approach, and the latter result was especially impactful. We saw that, for the app recommendation dataset, our method can recommend about 1000 items without an appreciable reduction in the hit ratio. The encouraging upshot of this is that using our approach enables you to increase the diversity of app recommendations, without a marked decrease in performance.
The experiments also showed that our DAE models can be successfully combined with other recommendation models, by simply replacing our relevance model with a different one. This is also an impactful result, because it demonstrates that our framework can be applied to other generic recommender systems for both improving diversity and generating explanations.
This blog is based on a research paper (“On the Diversity and Explainability of Recommender Systems: A Practical Framework for Enterprise App Recommendation”) that appears in Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM 2021).
Wenzhuo Yang is a Senior Research Engineer at Salesforce Research, who focuses on solving real-world problems with advanced machine learning techniques. His interests include recommender systems, explainable AI, and time series analysis.
Vena Li is a Director of Applied Research at Salesforce Research. Her research interests include recommender systems, autoML, and data-centric AI. She leads both research as well as cross-organizational collaborations to bring AI to production.
Steven C.H. Hoi is currently the Managing Director of Salesforce Research Asia at Salesforce and oversees Salesforce's AI research and development activities in APAC. His research interests include machine learning and a broad range of AI applications.
Donald Rose is a Technical Writer at Salesforce AI Research. He works on writing and editing blog posts, video scripts, media/PR material, and other content, as well as helping researchers transform their work into publications geared towards a wider (less technical) audience.