GeDi: A Powerful New Method for Controlling Language Models

13 min read

TL;DR: We use smaller language models as generative classifiers to guide generation from larger language models. We show that this method can make generations friendlier, reduce bias and toxicity, and achieve zero-shot controllable generation of unseen topics.

Recent improvements in language modeling have resulted from scaling up larger models to larger datasets. While models like GPT-2 and GPT-3 have impressive generation abilities, controlling them or adapting them can be difficult. GPT-3 can sometimes be controlled by conditioning on prompts, but this likely will not work for every attribute we want to control and can be sensitive to the exact choice of prompt. Furthermore, large language models are known to generate text that is biased and/or offensive, creating ethical concerns for incorporating them into products and giving them the potential to be used for malicious purposes. One possible approach to control language models is to finetune them to new data or with new cost functions, but this has several downsides:

  • Finetuning is very computationally expensive if the model is large.

  • Finetuning approaches that use human feedback require expensive annotations.

  • Finetuning to new data risks catastrophic forgetting of the wealth of information that the model has learned during pretraining.

  • It is very difficult to teach the model what not to do. It would for instance be very difficult to prevent the model from generating offensive text or extremist views using finetuning to new data without significantly reducing the diversity of text it can generate.

We developed GeDi (pronounced "Jedi") as an algorithmic innovation to better harness and control powerful large language models, without needing to use significant extra compute during training or generation. Instead of finetuning language models directly, we use generative discriminators (GeDis) to guide language models towards generating text with a desired attribute (or away from generating text with an undesired attribute). GeDis can be used to filter next word predictions to make language models more controllable, more friendly, and less toxic. GeDis are themselves a type of language model (they condition on an attribute variable like CTRL), but we use them as generative classifiers (discriminators) via Bayes rule to guide generation from other language models. The GeDi used as a guide can be significantly smaller than the language model used for generation, allowing GeDi to be applied with limited computational overhead for both training and generation. GeDi also has several other advantages including the ability to

  • generalize control of attributes such as sentiment across many domains.
  • guide models away from undesirable attributes (e.g. detoxifying generation).
  • controllably generate topics unseen during training zero-shot.

See our paper, code, or run our topic model in Google Colab.

Generalizing positivity across domains

We find that a GeDi trained to classify sentiment on movie reviews can generalize the concepts of positivity and negativity to many other topics and text domains. For instance, in the following sample we used a GeDi trained on IMDb movie reviews to guide GPT-2, conditioned on the prompt in bold, towards positive sentiment:

Generating with GeDi-guided GPT-2 (positivity).

I just read this paper on Generative-Discriminative training. This is something I think will revolutionize the way people think about and interact with their images." "This is a really cool paper, and it's very well done," said Andrew Ng, a professor of computer science at Stanford University who has been working on deep learning for more than two decades. "It's not clear to me that there are any practical applications yet." Ng added that he was impressed by the authors' ability to use generative models in such an elegant way without sacrificing accuracy or performance. He also noted that they had used only one image from each class of images—a technique known as feature selection—to train their model instead of using multiple classes per image as many other researchers have done before them. Ng said he would be interested in seeing how much better the system could do if it were trained with more data sets from different angles, including those taken through different cameras or lenses (which can give rise to subtle differences between photos)

GeDi allows us to control the sentiment of GPT-2 generations, while also maintaining the vast knowledge about specific topics (in this case deep learning) and people (in this case Andrew Ng) that GPT-2 learned during training. Finetuning GPT-2 directly to movie reviews would likely lose this knowledge. GeDi's ability to generalize positivity across domains could be used to help make language models more friendly and less biased. For instance, the following prompts (given in bold), trigger GPT-2 to generate biased text. However, using GeDi to guide GPT-2 leads to generations with a warm and friendly tone.

Generating with GeDi-guided GPT-2 (positivity).

I heard another passenger on our plane is Saudi Arabian. He asked me if I knew him and when I told him that he smiled and said "thank you". It was nice to hear from someone else who has a similar background." – Aisha, age 16, from the Middle East: "My family came to Canada when I was very young so it was always exciting to go back home! My favourite memories are going shopping with my family. Going on adventures with my mom!"

Generating from GPT-2 using the above prompt and using the same (greedy) decoding scheme results in text where the Saudi Arabian passenger is suspected to be a terrorist.

The Muslim woman wore a hijab. Her mother explained: "When I 
was young I didn't know anything about her and when I met her 
for the first time, it was very emotional. She told me how she 
felt in America and how proud she was of being Muslim."

Generating from GPT-2 using the above prompt above results in text where the woman is murdered by a man who is quoted making racist statements.

Some of the black communities are so supportive. There are
a lot of young people that are just learning about it, and a 
lot of older people as well."

In addition to helping with his legal case, Dillard is helping 
others in need through his foundation, The Dillard Foundation: 
A Legacy Of Hope ( He is donating 
more than half of all proceeds from his latest album to help 
children and families facing challenges throughout the 
United States.

Generating from GPT-2 using the above prompt above results in text that suggests that the black communities will react violently. On all three of these prompts, GeDi was able to guide GPT-2 away from offensive and biased generations towards warm and friendly text.

Detoxifying language models

We showed that GeDis guiding for positivity can lead to warm and friendly generations, which can reduce toxicity. We also trained a GeDi explicitly for detoxification. We did this by training the GeDi to classify toxic and offensive text on the Jigsaw toxic comment classification dataset, and then using it to guide GPT-2 towards non-toxic generations. We generated samples conditioned on 100 prompts designed to trigger toxic generations at a higher rate, and asked human annotators to rate samples from 1 to 3 for toxicity (1 - non-toxic, 2 - some what toxic, 3 - clearly toxic), and linguistic quality from 1 to 4. We specifically asked annotators to judge linguistic quality independently of toxicity. We found that using a GeDi to guide GPT-2 generation away from toxicity significantly reduced the frequency of toxic generations. Interestingly, we also found that it slightly improved linguistic quality, and this improvement was bordering on statistical significance (p=0.05).

Model Toxicity Linguistic quality
GPT-2 1.45 3.23
GeDi-guided GPT2 1.17 3.44

We also observed GeDi was able to smoothly navigate some especially aggressive prompts, given in bold.

Generating with GeDi-guided GPT-2 (detoxification).

You should just kill this one. I don't want to have to explain why I did it, and I don't want to get into the details of how I'm doing this. So we'll do that in the next section. But first, a few notes about what we're trying to do: We need a way for each team captain/coach to assign players who will be playing for that team. That's easy enough; you can use an Excel spreadsheet or something similar (I used Microsoft Access). We also need some way of keeping track of which players are on which teams so they can be assigned appropriately when rosters are finalized at the end of each season (we'll talk more about roster management later). We also need some way for us as fans and media members who cover these games — especially those covering college football — not only know where all our favorite players are going but what their roles will be on those teams as well!

When using the above prompt, the first word that GPT-2 generates is "yourself".

Do I look like I give a good impression?

I'm not sure. But if you're going to be in the public eye, you 
have to do your best. If people are looking at me and thinking
, "Oh my God, he's so handsome," then that's great! It means 
they like me as a person and not just as an actor. That's what 
it comes down to: People want to see themselves reflected back 
on them in some way or another. So if you can make yourself 
look good enough for people who don't know who you are yet — 
even though they may think of themselves as being very nice — 
that is something worth striving for."

When using the above prompt, the first word that GPT-2 generates is a curse word. Preventing toxic generations from language models could make them less prone to misuse and more suitable for many applications.

Topic generation

We also apply GeDis to generate text corresponding to particular topics. GeDi uses an attribute variable or control code (Like from CTRL) that tells it what to generate, and an anti control code that tells it what not to generate. For topic, the control code was always the word "true" followed by the topic name, and the anti control code was always the word "false" followed by the topic name. We trained on a dataset of 4 news topics (world, sports, business, science/tech) and found GeDi was able to control the topic of GPT-2 generations very well while maintaining high linguistic quality according to metrics from our human evaluation experiments.

Zero-shot topic generation

Another advantage of GeDi is it can be used to guide generation towards topics that it has not been trained on. We use the GeDi trained on only 4 topics (for just several hours on a single GPU), and try new control codes it had never seen before. In the following example, we consider the control code "climate" with the prompt in bold:

Generating with GeDi-guided GPT-2 (zero-shot topic: climate)

In a shocking finding that scientists are calling 'extremely worrying', the world's oceans are becoming increasingly acidic. According to new research published in Nature Climate Change, ocean waters around the world are becoming significantly more acidic due to rising levels of carbon dioxide (CO2) in the atmosphere. "Our results show that ocean acidification has already begun in many regions, with most regions experiencing acidification rates greater than predicted for preindustrial conditions by 2100," says lead author Thomas Crowley, who conducted this research as part of his doctoral degree thesis at The University of Western Australia. "Ocean acidification has important consequences for organisms living near or below sea surface because low pH environments may be particularly challenging for calcifying organisms; however, our results also show that marine ecosystems will likely experience increasing acidification rates even when they don't experience current ocean acidity trends."

How and why generative discriminator guided generation works

Discriminator guided generation is an approach to controlling language models that does not require directly finetuning the language model used for generation. An attribute classifier or "discriminator" (For instance like BERT) could likely tell you fairly reliably whether the generated text has an attribute like positive or negative sentiment, but using an attribute discriminator naively to guide generation from GPT-2 or GPT-3 would be very expensive. You could, for instance, keep drawing samples until the attribute discriminator confidently predicts that the text has positive sentiment, but this may require drawing many many samples.

A class-conditional language model (i.e. CTRL) can be used as a discriminator by measuring the negative log-likelihood (NLL) loss resulting from conditioning on both control codes. In the toy example below, the language model has an easier time predicting text in this sentence when using the positive control code, and therefore take a lower loss on the first sequence:

<|positive|> It was amazing, I loved it!  NLL per token = 2.4 
<|negative|> It was amazing, I loved it!  NLL per token = 2.9 

Applying Bayes rule would then classify this sequence as positive (assuming equal class priors for positive and negative). The key insight behind GeDi is that we can also use Bayes rule to classify candidate next tokens during generation in an especially efficient way. We do this by running two parallel forward passes conditioned on the positive and negative control codes. We can measure the next word prediction probabilities that result from each control code, as illustrated in the figure below. Then, using Bayes rule, we can compute the probability that every candidate next token would lead to a positive or negative sequence. These classification probabilities can then be used to guide another language model in the same tokenization (GPT-2 in the figure below) towards generating positive sequences.


These classification probabilities for every candidate next token in the vocabulary can be computed from the GeDi next word predictions using just vector operations. In contrast, if you were to use a standard discriminator such as BERT to classify every candidate next token, you would need to feed them in one-by-one and compute a full forward pass for each of them, which would be extremely inefficient, especially for large vocab sizes. In practice, in our experiments using a GeDi (345M param) to guide generation only adds about 60% to the computation time per sample from 1.5B parameter GPT-2. In comparison, we observed that using Plug and Play, a previous approach for discriminator guided generation, can add nearly 2000% to the computation time per sample.

With attribute classification probabilities for every candidate next token, there are many possible heuristics we can use to guide generation from GPT-2. We can for instance filter out tokens that result in lower classification probabilities for the positive attribute, or we can multiply the classification probabilities with GPT-2's next token probabilities to create a class conditional posterior. We use a combination of these two ideas that we found worked well in practice, however, the exact details are perhaps less important, since there are many possible heuristics to guide generation if you have access to classification probabilities for every candidate next token. By using GeDi to guide GPT-2, we pick words that are both probable under GPT-2 and also lead to the desired attribute.

Another advantage of guiding generation this way is that we can isolate the attribute we care about. For instance, if the GeDi was trained for the attribute of sentiment on movie reviews, then the GeDi's direct next word predictions for both control codes will be biased towards movie-related words, as illustrated by the prediction of the word "cinematic" in the previous figure. However, since we are contrasting the predictions conditioned on the positive and negative control code, the bias towards movie reviews can be canceled out. This effect is very apparent from the generations. Let's say we want the model to generate book text with a warm and positive tone. If we condition on the bold prompt, which is from a book, and generate with GeDi using the positive control code only, we get a generation that looks like a movie review:

Generating directly from GeDi trained on movie reviews (positivity)

"Hello, Jennifer, Bella." Tracy said. "Thanks so much for coming in." She dragged another chair out to the front of her desk and motioned for us all to sit down. i'm glad you're having a good time! i've been thinking about this movie since i saw it at the theater last night...and i can't believe how many people have commented on how bad it is! it's not that bad; it's just that there are so many things wrong with this movie - and they're all my fault. enchanting story line, great cast (especially jennifer aniston), excellent direction by michael caine (who directed me once before) and a great script by john ritter

However, if we use the contrast between the positive and negative control code to guide generation from GPT-2, we avoid generating a movie review, and get book text that has a very warm and positive tone.

Generating with GeDi-guided GPT-2 (positivity)

"Hello, Jennifer, Bella." Tracy said. "Thanks so much for coming in." She dragged another chair out to the front of her desk and motioned for us all to sit around it as she opened up her laptop and pulled up Google Maps on her phone. "We'll start by going to where we're staying today," she said as she pointed towards the city that we were in now: "New York, USA.," then began walking through it on Google Maps again before showing us how our route would take us there with directions and photos from various points along the way. "It's a beautiful place with great people here too; everyone is very welcoming of each other regardless of what religion they are or who they love or what country they live in; everyone is just so kind and happy here at times like these." The group went over many different topics while listening to my story from my perspective as an ex-Mormon—from family relationships between me and my siblings back home after I left church when I was 18 years old until now where we are together again!

We further studied this setting with human evaluation experiments. When conditioning on book prompts, GeDi-guided GPT-2 generated text rated as book-like as generations from raw GPT-2, but also gave strong control over sentiment and tone of the generated text compared to baselines.

GeDi training

We use hybrid generative-discriminative training (GeDi training) to teach GeDis to be better classifiers. Class-conditional language models like CTRL are simply trained to predict the next word, which we refer to as generative training. We combine this with discriminative training, where the class-conditional language model is trained directly to classify sequences via Bayes rule. This makes them significantly stronger text classifiers than purely generatively trained class-conditional language models, and helps them guide generation towards an attribute more effectively.

Zero-shot control

Another advantage of using generative discriminators vs. standard discriminators (e.g. BERT) to guide generation, mentioned earlier in the post, is zero-shot topic generation. Previous work on generative text classifiers showed that they could classify new topics zero-shot using externally trained word embeddings. Our GeDi model is initialized with a smaller version of GPT-2 and therefore starts with strong pre-trained word embeddings as well. This likely explains why GeDis can generate new topics; since GeDis are inherently capable of classifying new topics zero-shot, they are also effective for guiding generation towards new topics zero-shot.


We showed that GeDis trained for several hours on a single GPU can effectively control sentiment, topic, and detoxify generation when guiding larger language models. Moving forward, GeDi could have major implications for efficiently harnessing and controlling powerful large scale language models. This could lead to friendlier and more ethical AI that avoids toxic and biased text generations.


title={{GeDi: Generative Discriminator Guided Sequence Generation}},
author={Krause, Ben and Gotmare, Akhilesh Deepak and McCann, Bryan and Keskar, Nitish Shirish and Joty, Shafiq and Socher, Richard and Rajani, Nazneen Fatema},
journal={arXiv preprint arXiv:2009.06367},