TL;DR: CodeT5+ is a new family of open code large language models (LLMs) with improved model architectures and training techniques. CodeT5+ achieves the state-of-the-art performance among the open-source LLMs on many challenging code intelligence tasks, including zero-shot evaluation on the code generation benchmark HumanEval.
Large language models (LLMs) pretrained on vast source code (“Code LLM”) have achieved remarkable progress in code intelligence. For instance, with the help of AI generative tools, software developers can now create and maintain their codes easily, and thus, improve their productivity significantly. However, existing code LLMs still contain two major limitations.
First, they often adopt a specific architecture that substantially limits the models to adapt to downstream tasks efficiently. For instance, decoder-only models such as GPT-based LLMs do not perform well in understanding tasks such as defect detection and code retrieval. Quite often, the models require major changes in their architectures or additional tuning to suit downstream applications.
Secondly, current models often employ a limited set of pretraining objectives which might not be ideal for some downstream applications. For instance, T5-based models trained with a span denoising objective are not suitable for auto-regressive generation tasks like code completion. This discrepancy between model pretraining and inference will lead to significant performance degrade.
Meet CodeT5+, a family of code LLMs with substantially improved flexibility in terms of model architecture and learning objectives. CodeT5+ models can easily adapt to both code generation and understanding tasks while maintaining competitive or even superior performance results against many other LLMs.
The following are our main innovations and achievements:
The diagram below provides a brief, high-level overview of CodeT5+:
Despite being an encoder-decoder based model, our CodeT5+ can flexibly operate in encoder-only, decoder-only, and encoder-decoder modes to suit different downstream applications. In CodeT5+:
We combine different types of learning tasks, including span denoising, causal language modeling (CLM), text-code contrastive learning, and matching tasks. We found that such a wide set of pretraining tasks can help the models to learn rich representations from both code and text data, and bridge the pretrain-finetune gap in various applications.
In the below diagram, we illustrate our model components and how data inputs/outputs are processed.
Model Scaling To efficiently scale up our models, we adopt a strategy for compute-efficient training with frozen code LLMs (such as CodeGen as used in our work or any other GPT-style LLM). In this strategy, we employ a “shallow encoder and deep decoder” architecture and only keep the small encoder and the cross-attention layers trainable while freezing the deep decoder LLM. Such architecture is designed with the intuition that the decoder is often employed to deal with a higher level of complexity in generation tasks and requires a larger number of neural parameters.
We combine two major types of data modalities to train better code LLMs: unimodal code data and bimodal code-text data. There are 2 stages of pretraining:
We found that this stage-wise training approach can efficiently expose our models to more diverse data to learn rich contextual representations. The figure below shows some examples of how our models are trained on code data.
Recent work from the NLP domain inspired us to further improve CodeT5+ by tuning the models with synthetic instruction-following tasks. The figure below demonstrates a few examples of instruction data. The data is generated by letting pretrained LLMs i.e. text-davinci-003, generate novel tasks, including task instructions, inputs (if any), and expected outputs.
CodeT5+ can flexibly operate in various modes to support different types of tasks:
We evaluated CodeT5+ on a set of over 20 benchmarks of diverse code generation and code understanding tasks. We found that CodeT5+ models achieve state-of-the-art (SoTA) performance on code generation and completion, math programming, and text-to-code retrieval tasks. The following are a few highlights of CodeT5+ performance results:
One of the popular benchmarks for code generation is HumanEval which challenges LLMs to generate Python functions given an input of function signatures and docstrings. We found that our instruction-tuned CodeT5+ 16B achieves new SoTA results of 35.0% pass@1 on the HumanEval against other open code LLMs, even surpassing the OpenAI code-cushman-001 model. Another highlight of the results is its competitive performance even in small model scales. For instance, CodeT5+ 770M performance is on par with much larger LLMs such as Incoder 6B and PaLM 62B.
We further explore its capability to solve grade-school math problems on two benchmarks MathQA and GSM8K. The task is to generate Python programs to solve mathematical problems described in natural language descriptions. We found that CodeT5+ 770M achieves new SoTA results of 87.4% pass@80 on MathQA-Python and 73.8% pass@100 on GSM8K-Python under the finetuning evaluation setup. Notably, CodeT5+ models require much less expensive computation to adapt to this new math programming domain compared to those billion-parameter LLM baselines.
In the figure below, we show an example of a mathematical problem and how CodeT5+ can solve it by generating Python code. On the right side, we show that compared to the original CodeT5, our model can achieve better performance with an improved reasoning capability. As the number of reasoning steps increases (indicated by the number of steps required to solve a problem), CodeT5+ model is more robust and consistently performs better than the baseline model.
Another interesting aspect of CodeT5+ is that it can be used naturally as an end-to-end retrieval-augmented code generation system. By employing the encoder to first retrieve relevant code snippets, we can then use these code snippets as part of the model input to the decoder to improve the model performance of code generation. In a code generation benchmark, we found that CodeT5+ models outperform similar approaches such as REDCODER which requires separate retriever and generation models.
In the figure below, we demonstrate an example where the retrieved code from CodeT5+ provides crucial contexts (e.g., use “urllib3” for an HTTP request) to guide the generative process for more correct predictions. By contrast, the generative-only approach gives an incorrect prediction that only captures the concepts of “download” and “compress”.
CodeT5+ is a new family of open code LLMs trained with flexible model architecture and diverse learning objectives. Operating as encoder-only, decoder-only, or encoder-decoder models, CodeT5+ can be easily adapted to many downstream tasks, including both code understanding and generation tasks. In addition, it employs a compute-efficient strategy to scale up the model by leveraging off-the-shelf LLMs and performs instruction tuning to align with natural language instructions. Our models achieve the best performance among open-source LLMs on challenging benchmarks such as HumanEval, outperforming other LLMs such as LLaMA, StarCoder, and even OpenAI’s code-cushman-001 model.
CodeT5+ can be extended and improved in many ways. For instance, our approach to scale the models could be applied to integrate with any open-source LLMs. For example, we can use CodeT5+ to combine with the recent StarCoder or LLaMA and utilize the different contextual representations learned from these models.
By open-sourcing our models and sharing our research in depth, we hope CodeT5+ will spark more innovations and applications in code LLMs and encourage more open-source efforts in this line of research.
Yue Wang is a Research Scientist at Salesforce Research Asia with a focus on code LLMs. His research interests include language model pretraining, code understanding and generation, and multimodality. He is the main contributor to CodeT5, a family of open code LLMs that facilitates a wide range of code intelligence tasks.
Henry Hung Le is a Research Scientist at Salesforce Research Asia, focusing on AI for software research and machine learning applications. His research interests include code generation, program synthesis, language models, dialogue systems, and multimodality.
Akhilesh Deepak Gotmare is a Research Scientist at Salesforce Research Asia, where he works on deep learning and its natural language processing applications like text generation, code generation, and code search. He leads applied projects aimed at identifying optimizations in Apex code using deep learning.
Nghi D.Q. Bui is a Research Scientist at Salesforce Research Asia. He specializes in applying AI to software engineering problems, including code understanding, program repair, anti-pattern detection, and automated testing. His work helps push the boundaries of how AI can simplify and enhance various aspects of software development.
Junnan Li is a Research Scientist at Salesforce Research Asia. His research mainly focuses on building multimodal foundation models, including vision-language models and code models.
Steven C.H. Hoi is the Managing Director of Salesforce Research Asia and oversees Salesforce's AI research and development activities in APAC. His research interests include machine learning and a broad range of AI applications.