In recent years, the natural language processing (NLP) community has seen the development of increasingly powerful language models [1, 2], capable of generating textual output that is indistinguishable from human-written text. This includes our own model called CTRL [3] (Conditional Transformer Language Model) for controllable generation.
To prevent misuse or abuse of such models, we feel a responsibility to contribute to the body of knowledge of forensic techniques to detect this type of content. Today we demonstrate how we have delivered a model for doing so in the form of a detector for machine-generated text.
For context, a language model is a machine learning model that is trained to predict the next word given an input context. As such, a model can generate text by generating one word at a time. These predictions can also be conditioned by human-provided input in order to control the content and style of the generated text, as with CTRL. In CTRL, through special keywords, called control codes, humans can more explicitly influence the style, genre, entities, relationships between entities, and dates in the generated text and, as a result, the model is less likely to generate random word sequences than previously released technology.
When we released CTRL, we were acutely aware that, in addition to the numerous positive benefits of the model, there were also potential risks. With controllable generation, malicious actors might generate texts that quickly propagate, set political agendas, influence elections, and undermine user trust. As we wrote at its launch: “We recognize the potential for misuse or abuse, including use by bad actors who could manipulate the system to act maliciously and generate text to influence decision-making in political, economic, and social settings. False attribution could also harm individuals, organizations, or other entities. To address these concerns, the model was evaluated internally as well as externally by third parties, including the Partnership on AI, prior to release.”
The following explains our detector for machine-generated text and how it will help mitigate harm.
Method
We fine-tuned a RoBERTa model [4] on the task of labeling a sentence as human- or machine- generated. The training dataset consists of 250K human-written documents and 250K documents generated from a language model
Findings
Length-specific models
We found that the detection rate drops as the length of a test sentence drops. If the training texts for the detector have similar word length with test data, the accuracy will increase.
Hybrid Models
Media platform moderators that deploy language detectors often don’t have access to the adversary’s generator model, so it is important for researchers to examine the transferability of a detector model. Our research showed that a detector trained using GPT-2 generalized relatively well to CTRL-generated texts, whereas the CTRL detector generalizes poorly to GPT-2 generated texts.
Alternatively, we might know which generator an adversary is using, but it is possible that the adversary fine-tuned the model to generate text with a different style and content than was present in the original training corpus. This makes it difficult to detect the generated text. To examine how detectors generalize across fine-tuned models, we conducted experiments on two domains for fine-tuning – one general domain (RealNews) and one specific domain (excerpts from speeches by U.S. President Donald Trump). We observed that the detection rate drops as the detector is applied to texts generated by a fine-tuned generator. That said, the detector’s accuracy ranged from 80.70 - 99.86%.
An important observation is that training on multiple generators helps detection on sentences generated by a fine-tuned model, likely because it encourages the detector to learn properties of generated text that are not specific to the training environment of a particular generator.
Additional information about the research and its findings can be found on the Model Card.
Conclusion
The language detector is able to detect generated text from CTRL and GPT-2 reasonably well under some circumstances. However, doing so does not necessarily generalize to text generated by other language models, and the detector has not been evaluated against the variety of generated text that might appear in a production system.
We hope this research will help advance experimentation in automatic detection of generated texts. We demonstrate the capability and limit of a detector of this kind, and we also show that training a detector on data across various generators can help with transferability.
Acknowledgements
We would like to thank the following individuals for their work:
• Xuanyu Zhou as the primary contributor to this project,
• Jesse Vig for his leadership and stewardship of the research,
• Bryan McCann who initiated the work on the detector and subsequent research effort, and
• Kathy Baxter for her assistance in providing ethical oversight of the project.
Resources
Code and demo available at https://github.com/salesforce/ctrl-detector
Additional details provided in the Model Card: https://github.com/salesforce/ctrl-detector/blob/master/ModelCard.pdf
Citations
[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanov. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2018.
[2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari-wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
[3] Nitish Shirish Keskar, Bryan McCann, Lav Varshney, Caiming Xiong, and Richard Socher. CTRL - A Conditional Transformer Language Model for Controllable Generation. arXiv preprint arXiv:1909.05858, 2019.
[4] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, MikeLewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019.