Given the goal of improving software development productivity with machine learning methods, software intelligence research has attracted increasing attention in both academia and industries over the last decade. Software code intelligence techniques can help developers reduce tedious repetitive workloads, enhance the programming quality, and improve the overall software development productivity. This would reduce time spent writing software as well as reduce computational and operational costs. In this work, we focus on the fundamental challenge of software code pre-training, which has great potential to boost a wide spectrum of downstream applications in the software development lifecycle.
However, existing code pre-training methods have two major limitations. First, they often rely on either an encoder-only model similar to BERT or a decoder-only model like GPT, which is suboptimal for generation and understanding tasks. For example, CodeBERT [2] requires an additional decoder when applied for the code summarization task, where this decoder cannot benefit from the pre-training. Second, most current methods simply adopt the conventional NLP pre-training techniques on source code by regarding it as a sequence of tokens like natural language (NL). This largely ignores the rich structural information in programming language (PL), which is vital to fully comprehend the code semantics.
To address these limitations, we created CodeT5, an identifier-aware unified pre-trained encoder-decoder model. CodeT5 achieves state-of-the-art performance on multiple code-related downstream tasks including understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL. In what follows, we will explain how CodeT5 works.
CodeT5 builds on the similar architecture of T5 but incorporates code-specific knowledge to endow the model with better code understanding. It takes code and its accompanying comments as a sequence input. As illustrated in the figure above, we pre-train CodeT5 by first alternatively optimizing the following objectives (a-c) and then the objective (d):
Objective (a): Masked Span Prediction (MSP) randomly masks spans with arbitrary lengths and requires the decoder to recover the original input. It captures the syntactic information of the NL-PL input and learns robust cross-lingual representations as we pre-train on multiple PLs with a shared model.
Objective (b): Identifier Tagging (IT) applied only to the encoder which distinguishes whether each code token is an identifier (e.g., variables or function names) or not. It works like the syntax highlighting feature in some developer-aided tools.
Objective (c): Masked Identifier Prediction (MIP), in contrast to MSP, only masks identifiers and employs the same mask placeholder for all occurrences of one unique identifier. It works like deobfuscation in software engineering and is a more challenging task that requires the model to comprehend the code semantics based on the obfuscated code.
Objective (d): Bimodal Dual Generation (dual-gen) jointly optimizes the conversion from code to its comments and vice versa. It encourages a better alignment between the NL and PL counterparts.
CodeT5 achieves state-of-the-art (SOTA) performance on fourteen subtasks in a code intelligence benchmark CodeXGLUE [3], as shown in the following tables. It significantly outperforms the previous SOTA model PLBART [4] on all generation tasks including code summarization, text-to-code generation, code-to-code translation, and code refinement. On understanding tasks, it yields better accuracy on defect detection and comparable results on clone detection. Besides, we observe that bimodal dual generation primarily boosts NL-PL tasks such as code summarization and text-to-code generation. Please check out our paper [1] for more details.
You might be wondering how pre-trained code intelligence models like CodeT5 could improve the developer productivity in real-world scenarios? At Salesforce, we can use CodeT5 to build an AI-powered coding assistant for Apex developers. Here we demonstrate an example of CodeT5 powered coding assistant with three code intelligence capabilities:
For the first two functionalities, developers could simply type the natural language description or the function signature to specify their intents, and our AI coding assistant can generate or complete the target function for them. This helps to accelerate their implementation and also reduce their reliance on external resources. For code summarization, it can automatically summarize a function into code comments, which enables faster documentation and easier software maintenance.
We discuss four ethical risks of our work in the following:
Yue Wang is an applied scientist at Salesforce Research Asia, which he joined after earning his PhD in The Chinese University of Hong Kong (2020). His research focuses on deep learning applications in natural language processing and its intersections with programming language processing and computer vision.
Steven Hoi is the Managing Director of Salesforce Research Asia and oversees Salesforce’s AI research and development activities in APAC. His research interests include machine learning and a broad range of AI applications.
This blog is based on a research paper [1] authored by Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. We thank Kathy Baxter for the ethical review. We thank Amal Thannuvelil Surendran for the help of the AI coding assistant. We thank Denise Perez for refining this post. We thank Michael Jones, Caiming Xiong, and Silvio Savarese for their support. We thank Akhilesh Deepak Gotmare, Amrita Saha, Junnan Li, and Chen Xing for valuable discussions.