Code-Mixing on Sesame Street: Multilingual Adversaries for Multilingual Models

TL;DR - Today’s NLP models, for all their recent successes, have certain limitations. Case in point: they exhibit poor performance when processing multilingual code-mixed sentences (each containing multiple languages). Our new approach addresses this problem by constructing code-mixed inputs designed to degrade (or “attack”) the model, exposing the limitations of the model via these “adversarial” actions. (We created two highly effective multilingual adversarial attacks that push the ability of state-of-the-art NLP models to handle code-mixed input to the limit.) We then analyze the model to see how much performance degraded on code-mixed examples, make model adjustments to improve its performance, and retest with additional adversarial examples. This cycle repeats until the model’s performance improves to a high enough level on future code-mixed input. Compared to collecting real code-mixed data, our methods enable evaluation of NLP models on an arbitrary mixture of languages quickly and inexpensively.

NLP Refresher: Terms and Definitions

Before proceeding, let’s define a few terms that will be used in this blog.

  • NLP system: The entire text processing pipeline built to solve a specific task; takes raw text as input and, after any text preprocessing or cleaning, gives it to the NLP model, which produces output such as predictions in the form of labels (classification) or text (generation). Sample NLP system pipeline: raw text -> preprocessor -> model -> output.

  • NLP model: The key component of an NLP system that attempts to complete or solve one or more NLP subtasks. BERT, T5, and GPT-3 are examples of NLP models.

    • Analogy: if the NLP system is the whole "body," the NLP model is its "brain" (or one component of the brain).
  • Multilingual model (also referred to as Cross-lingual model): A type of NLP model that can successfully process ("understand") multiple languages. Examples of multilingual models include mBERT, XLM, and XLM-R.

  • Dataset: A (usually large) collection of data used to train models or test (evaluate) them. Example of a dataset typically used to evaluate NLP models: XNLI.

  • Test set: A dataset used to test a model (evaluate its performance).

  • Adversarial attack: A technique involving perturbing the input in order to reduce a model’s accuracy, and optimizing those perturbations to be maximally damaging to the model. We’ll explain why we’d want to “damage” the model later in our discussion.

  • Data augmentation: Adding to (augmenting) a dataset by generating new artificial data, based on real data you already have. Creating different versions of data artificially increases the size of your dataset, which can be useful in cases where real data is sparse, or where a specific linguistic phenomenon is not present in the real data (this latter case is the one that applies to our research, as you will see).

Multilingual NLP Models: Impressive Results, But They Don’t Handle Code-Mixing

The past year has seen the creation of massive multilingual models that demonstrate impressive cross-lingual transfer abilities. For example, after pre-training them on monolingual data from many languages, simply fine-tuning them on English task data was sufficient for them to perform well on the task in other languages.

However, while these massive multilingual models have achieved some important results, they are still limited in certain areas, due to some inherent assumptions about the speakers they are modeling. To understand why these assumptions are limiting, we need to look at the unique way in which certain populations converse with each other.

In many multilingual societies (for example, Singapore and Papua New Guinea), it is common for speakers to create sentences by mixing words, phrases, and even grammatical structures from the languages they know. This is known as code-mixing, a phenomenon common in casual conversational situations such as social media posts and text messages. In regions where there has been significant historical intermingling between speakers of many different languages and dialects, this can result in extremely code-mixed sentences (one example being Colloquial Singapore English).

Which brings us back to the aforementioned NLP model limitations: these models turn out to be insufficient for the task of understanding multilingual speakers in an increasingly multilingual world – specifically in cases where code-mixing is prevalent in the population. For example, in typical scenarios, cross-lingual datasets comprise individually monolingual examples and are unable to test for code-mixing.

If NLP systems serving multilingual communities are to fully connect with their users, they must be capable of working well even on code-mixed input.

Our Approach: Solving the NLP Code-Mixing Problem

So, here’s the question at the heart of this blog (and our research): how can we improve NLP systems so they can perform well even in cases where code-mixing is prevalent in the input?

Short answer: evaluate how an NLP model’s performance degrades when given code-mixed input specially designed to “attack” the model, then adjust the model until it performs well again.

While there is no substitute for real code-mixed sentences written in the wild for definitive evaluations, such data is expensive to manually collect and annotate. The dizzying range of potential language combinations further compounds the immensity of such an effort.

However, we have developed a new strategy that can be used to address these issues, and ultimately help improve NLP results in code-mixed scenarios.

Our “Secret Sauce”: Code-Mixed Adversarial Examples

We propose that appropriately crafted adversarial examples containing highly code-mixed text can be used to widen the range (i.e., expand the distribution) of inputs that NLP models are exposed to, and accurately estimate the lower bound of a model's performance on real code-mixed input (i.e., how badly performance degrades when processing input where each sentence contains multiple languages). Note that other approaches inevitably overestimate a given system’s worst-case performance since they do not mimic the NLP system’s adversarial distribution (the distribution of adversarial cases or failure profile).

Our new approach yields two key advantages:

  • We can mix any arbitrary combination of languages relatively quickly and inexpensively.
  • Our approach allows us to evaluate the worst-case performance of any arbitrary model.

In contrast, regularly collecting data for these purposes would be prohibitively expensive.

Note that our creation of code-mixed adversarial input is an example of data augmentation. Since code-mixed input is either rare or missing entirely in most NLP datasets, we decided to augment real sentence input with code-mixed sentences that we generated. This augmented data (new code-mixed input) was not in the original dataset, yet is essential for exposing NLP models to a real-world phenomenon (code-mixing) that is prevalent in certain multilingual populations – a type of speech most NLP models are not experienced with and hence cannot process successfully.

Our Model Improvement Cycle: Attack, Analyze, Adjust, Analyze Again

The adversarial approach we use to improve NLP models can be viewed as a four-step cycle:

  • Find ways to degrade an NLP model’s performance by creating adversarial code-mixed input and giving them to the model to process (“Attack” phase)
  • Test the model to see how much its performance has degraded in the presence of the code-mixed examples; investigate which parts of the model were not optimally handling the adversarial examples, revealing where the model has problems (“Analyze” phase)
  • Revise the model in various ways designed to ensure that the adversarial examples will not hurt performance significantly in the future (“Adjust” phase)
  • Test the revised model to see if its performance is back up to reasonably good levels (“Analyze Again” phase).

This cycle (creating adversarial examples, testing the model on them, adjusting the model, then testing the revised model) repeats until the model’s performance sufficiently improves.

Another way to summarize this cycle:

  • Break It: attack the model with adversarial code-mixed input examples
  • Test It: measure how the model’s performance has degraded in the face of the attacks
  • Fix It: adjust the model to account for these code-mixed input examples
  • Test Again: check if the model adjustments improved performance sufficiently
  • If not, repeat steps until the model’s performance has increased to a decent level.

The ultimate goal:

  • Create an enhanced NLP model that can handle a wide range of multilingual input cases
    • including those containing code-mixed input
    • with decent performance on both code-mixed and non-code-mixed inputs (ideally, the result would be a model that achieves performance on code-mixed sentences comparable to non-code-mixed input).

Our Method in Detail: Adversarial Code-Mixing with PolyGloss and Bumblebee

Now that we’ve presented a high-level look at how our approach works, let’s dive into the details. First issue to consider: how do we create the adversarial code-mixed examples?

Inspired by the proliferation of real-life code-mixing and polyglots, we propose PolyGloss and Bumblebee, two multilingual adversarial attacks that adopt the persona of an adversarial code-mixer. These are two strong blackbox adversarial attacks - one word-level (PolyGloss), and the other phrase-level (Bumblebee) - for multilingual models, designed to push the model’s ability to handle code-mixed sentences to the limit.

Our adversarial attacks largely follow the same general procedure: for each word or phrase in the sentence, we generate a list of perturbations and then select the one that maximally hurts the model's performance.

(Remember: although hurting the model on purpose may seem counterintuitive, this will ultimately help us discover different scenarios in which the model may perform poorly, so we can then work to improve the model’s performance in the presence of future code-mixed input.)

Focus on Lexical Code-Mixing

For the sake of simplicity, when we generate the list of perturbations during the adversarial attacks, we take a focused approach when it comes to how we generate these perturbations. While real code-mixing can occur at the lexical, morphological, and syntactic levels, we focus purely on the lexical component of code-mixing, where some words in a sentence are substituted with their equivalents from another language in the speaker’s repertoire.

How PolyGloss and Bumblebee Work

PolyGloss and Bumblebee are the means by which we carry out adversarial attacks on the NLP model. Both algorithms start with clean sentences and turn them into adversarial examples, but they work in different ways. One key difference can be boiled down to word vs. phrase.

PolyGloss works at the word level, using combined bilingual dictionaries to propose perturbations and translations of the clean example for sense disambiguation.

Bumblebee, inspired by phrase-based machine translation, uses a state-of-the-art neural word aligner, directly aligning the clean example with its translations before extracting phrases as perturbations. Bumblebee’s phrase-level substitutions lead to more natural sentences‌, one of several advantages of this method. (An example of how Bumblebee generates code-mixed adversaries is shown in Figure 1, while Table 1 shows other Bumblebee adversarial examples.)

Diagram of how Bumblebee generates code-mixed adversaries. Words are first aligned across sentence translations (a), then phrases from all sentences are extracted (b) before being finally combined into a single sentence (c).

Figure 1. Bumblebee generates code-mixed adversaries in three stages: (a) Align words across sentence translations (top: Indonesian, middle: English, bottom: Chinese); (b) Extract candidate perturbations (phrases from all sentences); (c) Construct final adversary by combining into a single sentence that maximizes the target model’s loss.

Table 1. Bumblebee adversarial examples found for multilingual model XLM-R on XNLI test set (P=Premise; H=Hypothesis).

PolyGloss vs. Bumblebee: Comparing Advantages and Disadvantages

When it comes to adversarial attacks, different methods feature distinct sets of advantages and disadvantages. Here is a summary of some of the tradeoffs of using PolyGloss‌ and Bumblebee‌:


  • Advantages:
    • Extremely fast‌
    • Semantic preservation is guaranteed in filtered mode‌
  • Disadvantages:
    • Unable to disambiguate between senses in the unfiltered mode‌
    • Discards many valid perturbations in the filtered mode‌
    • Attack's success is determined by comprehensiveness of backing dictionaries‌
    • Word-level perturbations may lead to unnatural sentences when languages constantly change from token to token


  • Advantages:‌
    • ‌Phrase-level substitutions lead to more natural sentences‌
    • ‌Correct sense is used since it is directly extracted from the translated sentence‌
    • ‌Word alignment results in many more valid perturbations than the dictionary-based approach‌
    • ‌Adversarial examples can be created directly from parallel test sets or machine-translated data‌
  • Disadvantages:‌
    • ‌Neural aligner is slow and stochastic

Analyzing Adversarial Attacks: How Well They Worked

Our “attacks” on the NLP model are both conceptually simple and highly interpretable, yet extremely effective. Bumblebee, for example, has a success rate of 89.75% against the XLM-R-large model, bringing its accuracy of 79.85% (averaged over the 15 languages) down to 8.18% on the XNLI dataset (full tables are available in our research paper).

What this result means:

  • 89.75% of the examples in the dataset were successfully converted into adversarial examples that the model misclassified.
  • Being “effective” in this context means being able to degrade the performance of the NLP model that is being “attacked” with adversarial example inputs by Bumblebee. Misclassifying adversarial examples is evidence of the model performance degradation.
  • Reducing the accuracy of the model from 79.85% to 8.18% is a big decrease, and hence a big success.

After these “attacks” are completed, we use these insights (of where performance was hurt) in order to improve the model – that is, get its accuracy back up to a good level – once these flaws or limitations are uncovered and adjusted for.

Note that the extensive degradation of NLP model performance mentioned above (where accuracy of 79.85% went down to 8.18% on the XNLI dataset) is for the scenario where up to 15 languages may be used in the code-mixing (that is, we allow the attack to choose up to 15 languages - although this doesn't mean 15 are actually used). When it’s only two languages in the code-mixing, the performance degradation is not as extreme. Table 2 shows this, for scenarios where code-mixing only involves English plus one other language. (The last two columns of numbers in Table 2 are accuracy percentages.)

Table 2. Results of Bumblebee on XNLI dataset with two mixed languages (English plus another language).

Model Behavior: CAT Produces Purr-fectly Good Results, Beats Other Methods

Measuring the performance of our code-mixed adversarial training (CAT) method on various input cases, as seen in Table 3, shows that it’s achieved a number of beneficial results:

  • The CAT model (our method) significantly beats both the zero-shot transfer model and a stronger baseline on both old and new adversarial examples, and in the setting where there is little to no overlap between the premise and hypothesis (Table 3).

  • In addition, it maintains this lead even when tested on languages that were not part of the XNLI training data (rightmost column of Table 3).

    • This shows that our method can improve robustness, enabling the model to perform well even on languages unseen by the model during adversarial training.
  • The CAT model beats all of the other methods in all but one column (Clean) in Table 3. And even in the Clean case, it’s a very close second, nearly tied with the best method.

  • The CAT model’s performance is more consistent across cases. Every other method falls to single digit performance in at least one column; the CAT model never does.

  • CAT also qualitatively changes the way the model represents language, making it more language-agnostic.

Table 3. Results on standard XNLI test set with XLM-R-base. Clean refers to the combined test set of all languages, Clean_DL refers to the variant where the hypothesis and premise of each example are from different languages, and Adv_{lgs} refers to new adversaries from English and the subscripted languages. (Higher numbers are better. The best performing method in each column is shown in bold.)

The Bigger Picture: Why This Research is Important

Our work is important in part because of its technical aspects – for example, our data augmentation techniques help make NLP models capable of handling more input types, with performance metrics that beat other methods – but there are wider positive implications as well.

By giving language models the ability to process sentence types prevalent among code-mixers, our work is opening up NLP to parts of the world population that have been underrepresented in past NLP system development and research. Enabling NLP systems to perform reliably in the presence of code-mixed input allows such communities to fully express themselves.

This serves the larger goals of making AI more responsible (building systems that embody fairness and are not biased against any segment of the population) and more reliable (increasing the range of inputs that NLP models can handle improves the robustness of those models, so they’re less likely to fail given novel inputs). Part of our research (our followup work) focuses on blocking models that are deemed unreliable, and only allowing models that meet a certain performance or robustness threshold to be deployed.

The Bottom Line

  • Speakers in multilingual communities often code-mix (blending multiple languages within each sentence). This happens especially often during casual conversation.

  • NLP systems need to perform reliably in the presence of code-mixed input to allow such communities to fully express themselves.

  • However, in typical NLP model scenarios, cross-lingual datasets comprise individually monolingual examples and are unable to test for code-mixing.

  • In order to address this limitation of NLP systems, we created the first two highly effective multilingual adversarial attacks (PolyGloss‌ and Bumblebee‌) that push the ability of state-of-the-art NLP models to handle code-mixed input to the limit.

    • PolyGloss = a word-level adversarial attack
    • Bumblebee = a phrase-level adversarial attack
  • The “Big Picture” of how our approach works:

    • We put our NLP model through a cycle of incremental improvement, which can be viewed as having four phases:

      • “Attack” phase: Find ways to degrade the NLP model’s performance by creating adversarial code-mixed input and giving these examples to the model to process
      • “Analyze” phase: Measure how much the model’s performance was degraded, and figure out which parts of the model were not optimal in the face of code-mixed input (where the model has inherent problems and shows weakness)
      • “Adjust” phase: Revise the model such that new adversarial examples will no longer hurt performance significantly.
      • “Analyze Again” phase: Test the revised model to see if its performance is back up to reasonably good levels again.
    • This cycle (create adversarial examples, test the model on them, adjust the model, retest) continues until the model’s performance sufficiently improves.

    • In other words, we purposely construct extreme code-mixed examples in an effort to find various ways in which the model can perform poorly, with the goal of improving the model so it becomes more robust (less vulnerable to such attacks).

      • “Less vulnerable to attacks” in the context of our research also means the model will be more reliable in the presence of similar kinds of sociolinguistic variation.
  • The ultimate goal of our method:

    • Produce an enhanced robust model that can handle many different types of input sentences, including those that are code-mixed, with solid performance (ideally, performance comparable to non-code-mixed scenarios).
  • Results: Our code-mixed adversarial training (CAT) model produces strong measurable improvements in NLP model performance on several different types of code-mixed input, beating other methods.

  • Compared to collecting real code-mixed data, our methods enable the evaluation of NLP models on an arbitrary mixture of languages quickly and inexpensively.

  • Although real code-mixing can happen at the lexical, morphological, and syntactic levels, we focused purely on lexical code-mixing for the sake of simplicity.

    • Hence, future work might include widening the scope of the research to include code-mixing at the morphological and syntactic levels.
  • Our research is important to the world because of:

    • Its technical aspects: our data augmentation techniques help make NLP models capable of handling more input types, with performance that beats other methods

    • Its wider implications: our work ultimately helps make AI more:

      • Responsible, by making NLP models more inclusive to underrepresented groups, such as populations that code-mix

      • Reliable, by making NLP models more robust in the face of novel inputs.

Explore More

Salesforce AI Research invites you to dive deeper into the concepts discussed in this blog post (links below). Connect with us on social media and our website to get regular updates on this and other research projects.

Read more about our Code-Mixing work in our research paper:

Code + data for the above paper is available at:

Feedback? Questions? Contact Samson Tan at

Follow us on Twitter:

Learn more about the projects we’re working on at our main site:

Read other content (blogs, papers, articles) covering similar topics:

About the Authors

Samson Tan (@samsontmr) is a Ph.D. candidate in the Industrial PhD Program at Salesforce Research. His research focuses on building natural language processing systems that work reliably in real-world environments, especially in environments with significant linguistic variation.

Donald Rose, Ph.D. is a Technical Writer at Salesforce AI Research. He works on writing and editing blog posts, video scripts, media/PR material, and other content, as well as helping researchers transform their work into publications geared towards a wider (less technical) audience.