How Salesforce Builds Reproducible Red Teaming Infrastructure

Introduction

Imagine you’re working on an AI product that can summarize customer success phone calls for training purposes. Your company’s product leverages large language models (LLMs) to summarize, synthesize, triage, and generate relevant outputs. You’re aware that LLMs can hallucinate, output harmful or biased text, or be manipulated through prompt injection attacks. As a responsible employee, you want to run more robust tests than routine acceptance testing, such as using more data and expanding the risk surface. What do you need to run those tests? Practitioners trying to design, implement, and execute these tests are trying to answer this question. This blog post provides our high-level answer, describing four components we recommend when designing, implementing, and executing a test:

Data: High quality data is essential to test AI features, products, and models.
Programmatic access to products: The ability to automate tests, affording reproducibility and reduced time to value.
Taxonomy: Create a taxonomy to evaluate outputs, ensuring your analysis aligns with company policies, responsible AI (RAI) principles, and the goals of your specific test.
Test Plan: Creating a test plan appropriately aligns all stakeholders to the same goals and helps scope out the technical work to execute the test.

The Responsible AI & Tech team at Salesforce has performed several internal red teaming activities enhancing the efficiency and safety of our AI products. Read more below for a deeper understanding of each component.

Data

The bedrock for any test of an AI system is “high quality” data. But what does it mean for data to be high quality? We focus on three aspects of high quality data that represent some of the bigger hurdles facing organizations today: Use Case Specific Data, Data Storage for Reproducibility, and Data Maintenance.

Use Case Specific Data

High quality data is contextual, whether you are doing a broad adversarial test for a model or a deeper test for a product. Here are some tips for creating use case specific data:

Make sure the data you are generating meets the use cases you are hoping to test for. As an example, to summarize customer success phone calls for training purposes, transcripts or voice call data would be useful. But the data should be from customer service calls. Sales calls may be a decent proxy, but grabbing transcripts from a YouTube cooking tutorial wouldn’t work.
Have a method for transforming your data. Once you have data, there will be times where you want to transform it into something that meets a specific test definition. For example, you can use an LLM to transform the call transcripts into transcripts with harmful language to test if the product will output harmful language.
Have a mechanism for generating data. Examples include:
- Have an LLM generate data and then have humans validate the data;
- Engage in user testing and collect their input/output pairs;
- Engage in an internal red teaming exercise; or
- Procure a vendor that can craft data for your specific use case.

Each of these have their pros and cons in terms of cost, time, and efficacy, but having at least one mechanism for generating data is important to ensure you can test.

Data Storage for Reproducibility

Now that the data has been procured, storing it is the next step to having high quality data. There are a couple of key components when storing your data:

Permissions: Make sure many teams have access to the data, but few can edit it. This is so teams can replicate your tests, but they can’t corrupt the base data the tests were built on.
Traceability: Each data point should be uniquely identifiable so that teams will then be able to triage specific data points and track down specific ethical violations.
Lineage: Tracking where the data comes from, how long it has been in storage, and the use cases the data was meant to be used for are also very important. If a team has time to create and maintain a datasheet, even better.

In an ideal world, your team would set up a proper database to store your data. However, if people are more comfortable with spreadsheets, start with that and evolve the data storage strategy over time. The most important piece is to have a clear data storage strategy in place and then evolve it into something sustainable for the enterprise.

3. Data Maintenance

Lastly, high quality data requires maintenance. This is a shift in mindset: instead of thinking of data as a static object that an enterprise collects, think of data as a resource to continue building on. Common issues include data that has been stored for a long time, requiring you to collect new data for similar purposes, or having something adjacent to the exact data you need. There are myriad other examples that are highly specific to each use case, but reconciling the fact that data needs to be kept up to date is crucial for the effectiveness of testing.

Programmatic Access to Products

When in pre-production, most organizations will have a user interface to do testing. But that doesn’t scale well if you want to use hundreds or more data points. This is why having code to programmatically access products is necessary. This might sound basic, but unless your product was built to have programmatic access, this would be very hard. We recommend:

Designing your product to have programmatic accessibility to scale testing. For models, this usually means pulling a model from Hugging Face or accessing an API checkpoint. However, for products this can be a bit trickier since it depends on how the engineering teams architected the product. If they made the product API first, this becomes easy to solve. But if the products aren’t built with an easy-to-use API, constructing something like a Python client may be necessary.
Having some documentation for other engineers in your org to understand how to use the API or Python client. This includes necessary parameters, the required setup, and the boilerplate code.
If the above is satisfied, an additional optional step can be taken: building your own package that can automate your testing. By linking up to APIs appropriately, a well-built package can help make sure repeated tests are true replications, enhancing reproducibility.

In a future blog post, we will discuss how building such a package then affords the creation of automated red teaming and what components should be built for that.

Taxonomy

Once model outputs are received, they need to be evaluated to determine how well the system performed. To report on this with confidence, you’ll need to create crisp definitions that can be used as a calibration point for all stakeholders. For example, for toxicity testing, you must define what toxicity is for your use case. Determining how specific you want to be in your conceptualization early on will save you a lot of refactoring. This work should be done in tandem with product and engineering teams to create the best fit for your use case.

Some best practices for designing, maintaining, and implementing taxonomies and standards are as follows:

Create a glossary of terms specific to your project and domain. Often, there will be terms that are reused with slightly different meanings within the same project space. Find all of the usages of ambiguous terms and facilitate a conversation to achieve alignment on what the singular standard definition for each term should be. Document this singular definition well, and make sure all stakeholders have access to the glossary.
Determine the problem surface area. Consider as many edge cases as possible, and bring in collaborators with different perspectives to help you broaden your thinking. Once you’ve determined the general surface area, you can move forward with defining the taxonomic entries, creating examples, and thinking about how to implement this into testing.
Define thresholds for testing. When any person, be they internal, external, or crowdsourced, evaluates the quality of model outputs, it should be aligned with the taxonomy and definitions you have developed. These function as standards that increase Inter-Rater Reliability and help ensure your evaluation answers the question at the core of your tests.
Align with stakeholders on this singular source of truth. To collaborate as effectively as possible, all stakeholders need to be “speaking the same language.” Ensuring this reduces misunderstandings about the core question you are testing.

Once a taxonomy has been established and all stakeholders are in agreement, testing becomes straightforward. A well-crafted taxonomy feeds forward into automated processes, providing rigor and structure to automatic labeling and prompt generation. Standards and definitions, especially in domains with high subjectivity (like ethics), allow for more effective programmatization of these higher-level concepts into testing infrastructure.

Test Plan

You gathered your data. A codebase has been established to easily automate tests. The organization and data have been aligned to a taxonomy. The last layer of infrastructure is how you communicate and interact with product teams. This can range from integrating an RAI expert into the team to advising teams on how they should design and implement tests.

At Salesforce, product teams fill out a form that collects details such as how the product will function, what ethical safeguards already exist, and so on. The answers to those questions get triaged by our team of Responsible AI & Tech product managers. If they determine that the product needs to be reviewed, they produce a Product Ethics Review, which classifies the various potential risks and downstream harms. Depending on the nature of the product or the model, the risks identified can be narrow and deep or diverse and broad.

The Testing, Evaluation, and Alignment team within our Responsible AI & Tech team then design tests with the product team around the identified risks. A test plan is generated to manage stakeholder expectations, as well as to scope out any technical work. During the execution of the test, labeling guidelines and mental health guardrails may be developed to facilitate the labeling of harmful outputs. Results are analyzed, bugs are reported, and a report is written for leadership. Once mitigations are implemented, follow-up tests are done as well to ensure risks have been reduced.

Conclusion

Any time an organization tests, it will need data to execute the test, a way to programmatically access the product, taxonomies for alignment, and a process for communication and execution. While the goal is to have this infrastructure set each time a test is executed, sometimes we have to make ad hoc data, and sometimes our test plans are not written with the most detail. But we use this as our North Star, something to aspire to every time we do a test. Because at the end of the day, if we can execute high quality tests quickly, that will reduce the amount of time we need to deliver our results, and increase the amount of time product teams need to mitigate them, ensuring products can be shipped safely.