CodeT5+: Open Code Large Language Models

8 min read

TL;DR: CodeT5+ is a new family of open code large language models (LLMs) with improved model architectures and training techniques. CodeT5+ achieves the state-of-the-art performance among the open-source LLMs on many challenging code intelligence tasks, including zero-shot evaluation on the code generation benchmark HumanEval.  


Background: Code LLMs

Large language models (LLMs) pretrained on vast source code (“Code LLM”) have achieved remarkable progress in code intelligence. For instance, with the help of AI generative tools, software developers can now create and maintain their codes easily, and thus, improve their productivity significantly.  However, existing code LLMs still contain two major limitations.

First, they often adopt a specific architecture that substantially limits the models to adapt to downstream tasks efficiently. For instance, decoder-only models such as GPT-based LLMs do not perform well in understanding tasks such as defect detection and code retrieval.  Quite often, the models require major changes in their architectures or additional tuning to suit downstream applications.

Secondly, current models often employ a limited set of pretraining objectives which might not be ideal for some downstream applications. For instance, T5-based models trained with a span denoising objective are not suitable for auto-regressive generation tasks like code completion. This discrepancy between model pretraining and inference will lead to significant performance degrade.

Introducing CodeT5+,  a State-of-the-art Open-source Code LLM

Meet CodeT5+, a family of code LLMs with substantially improved flexibility in terms of model architecture and learning objectives. CodeT5+ models can easily adapt to both code generation and understanding tasks while maintaining competitive or even superior performance results against many other LLMs.
The following are our main innovations and achievements:

  • We design a flexible encoder-decoder architecture in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy.
  • We improve our CodeT5+ models with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align our models with natural language instructions.
  • We extensively evaluate CodeT5+ on over 20 code-related benchmarks. We observe state-of-the-art (SoTA) model performance on various tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. On the HumanEval benchmark, our 16B model even outperforms OpenAI’s code-cushman-001 model.

The diagram below provides a brief, high-level overview of CodeT5+:

Model Architecture and Training

Despite being an encoder-decoder based model, our CodeT5+ can flexibly operate in encoder-only, decoder-only, and encoder-decoder modes to suit different downstream applications. In CodeT5+:

  • The encoder learns to encode contextual representations from code/text sequences (either complete, partial, or span-masked sequences).
  • The decoder is trained to generate different types of outputs, depending on the pretraining learning tasks.
  • A mixture of pretraining tasks enables the models to learn meaningful representations of code contexts and recover missing information at different levels: code spans, partial programs, and complete programs.

We combine different types of learning tasks, including span denoising, causal language modeling (CLM), text-code contrastive learning, and matching tasks. We found that such a wide set of pretraining tasks can help the models to learn rich representations from both code and text data, and bridge the pretrain-finetune gap in various applications.
In the below diagram, we illustrate our model components and how data inputs/outputs are processed.

Model Scaling To efficiently scale up our models, we adopt a strategy for compute-efficient training with frozen code LLMs (such as CodeGen as used in our work or any other GPT-style LLM). In this strategy, we employ a “shallow encoder and deep decoder” architecture and only keep the small encoder and the cross-attention layers trainable while freezing the deep decoder LLM. Such architecture is designed with the intuition that the decoder is often employed to deal with a higher level of complexity in generation tasks and requires a larger number of neural parameters.

Stage-wise Pretraining

We combine two major types of data modalities to train better code LLMs: unimodal code data and bimodal code-text data. There are 2 stages of pretraining:

  • In the first stage of pretraining, we pretrain the model with massive unimodal code data. We obtained the data in this stage from open-source platforms like GitHub and only with permissive licenses. We adopted multilingual training data of nine programming languages in total.
  • In the second stage of pretraining, we continue to pretrain the model with bimodal code-text data at the function level. Each data sample contains a text-code pair, including a code function and its corresponding docstring describing the function semantics.

We found that this stage-wise training approach can efficiently expose our models to more diverse data to learn rich contextual representations. The figure below shows some examples of how our models are trained on code data.

Instruction tuning

Recent work from the NLP domain inspired us to further improve CodeT5+ by tuning the models with synthetic instruction-following tasks. The figure below demonstrates a few examples of instruction data.  The data is generated by letting pretrained LLMs i.e. text-davinci-003, generate novel tasks, including task instructions, inputs (if any), and expected outputs.

Supported Downstream Tasks

CodeT5+ can flexibly operate in various modes to support different types of tasks:

  • As an encoder-decoder model, CodeT5+ can support code generation and code summarization. We can also adapt the model as an end-to-end retrieval-augmented generation model to improve the quality of model outputs.
  • Through our comprehensive set of learning tasks, we can easily use CodeT5+ as a decoder-only model for autoregressive generation tasks such as code completion.
  • Finally, as an encoder-only model, we can employ CodeT5+ for code understanding tasks such as detection tasks or retrieval tasks.

Evaluation

We evaluated CodeT5+ on a set of over 20 benchmarks of diverse code generation and code understanding tasks. We found that CodeT5+ models achieve state-of-the-art (SoTA) performance on code generation and completion, math programming, and text-to-code retrieval tasks. The following are a few highlights of CodeT5+ performance results:

Zero-shot Code Generation on HumanEval

One of the popular benchmarks for code generation is HumanEval which challenges LLMs to generate Python functions given an input of function signatures and docstrings. We found that our instruction-tuned CodeT5+ 16B achieves new SoTA results of 35.0% pass@1 on the HumanEval against other open code LLMs, even surpassing the OpenAI code-cushman-001 model. Another highlight of the results is its competitive performance even in small model scales. For instance, CodeT5+ 770M performance is on par with much larger LLMs such as Incoder 6B and PaLM 62B.

Adaptation to Solve Mathematical Problems

We further explore its capability to solve grade-school math problems on two benchmarks MathQA and GSM8K. The task is to generate Python programs to solve mathematical problems described in natural language descriptions. We found that CodeT5+ 770M achieves new SoTA results of 87.4% pass@80 on MathQA-Python and  73.8% pass@100 on GSM8K-Python under the finetuning evaluation setup. Notably, CodeT5+ models require much less expensive computation to adapt to this new math programming domain compared to those billion-parameter LLM baselines.

In the figure below, we show an example of a mathematical problem and how CodeT5+ can solve it by generating Python code. On the right side, we show that compared to the original CodeT5, our model can achieve better performance with an improved reasoning capability. As the number of reasoning steps increases (indicated by the number of steps required to solve a problem), CodeT5+ model is more robust and consistently performs better than the baseline model.

Retrieval-augmented Code Generation

Another interesting aspect of CodeT5+ is that it can be used naturally as an end-to-end retrieval-augmented code generation system. By employing the encoder to first retrieve relevant code snippets, we can then use these code snippets as part of the model input to the decoder to improve the model performance of code generation. In a code generation benchmark, we found that CodeT5+ models outperform similar approaches such as REDCODER which requires separate retriever and generation models.

In the figure below, we demonstrate an example where the retrieved code from CodeT5+ provides crucial contexts (e.g., use “urllib3” for an HTTP request) to guide the generative process for more correct predictions. By contrast, the generative-only approach gives an incorrect prediction that only captures the concepts of “download” and “compress”.

The Bottom Line

CodeT5+ is a new family of open code LLMs trained with flexible model architecture and diverse learning objectives. Operating as encoder-only, decoder-only, or encoder-decoder models, CodeT5+ can be easily adapted to many downstream tasks, including both code understanding and generation tasks. In addition, it employs a compute-efficient strategy to scale up the model by leveraging off-the-shelf LLMs and performs instruction tuning to align with natural language instructions. Our models achieve the best performance among open-source LLMs on challenging benchmarks such as HumanEval, outperforming other LLMs such as LLaMA, StarCoder, and even OpenAI’s  code-cushman-001 model.

Next Steps

CodeT5+ can be extended and improved in many ways. For instance, our approach to scale the models could be applied to integrate with any open-source LLMs. For example, we can use CodeT5+ to combine with the recent StarCoder or LLaMA and utilize the different contextual representations learned from these models.

By open-sourcing our models and sharing our research in depth, we hope CodeT5+ will spark more innovations and applications in code LLMs and encourage more open-source efforts in this line of research.  

Explore More

About the Authors

Yue Wang is a Research Scientist at Salesforce Research Asia with a focus on code LLMs. His research interests include language model pretraining, code understanding and generation, and multimodality. He is the main contributor to CodeT5,  a family of open code LLMs that facilitates a wide range of code intelligence tasks.

Henry Hung Le is a Research Scientist at Salesforce Research Asia, focusing on AI for software research and machine learning applications. His research interests include code generation, program synthesis, language models, dialogue systems, and multimodality.

Akhilesh Deepak Gotmare is a Research Scientist at Salesforce Research Asia, where he works on deep learning and its natural language processing applications like text generation, code generation, and code search. He leads applied projects aimed at identifying optimizations in Apex code using deep learning.

Nghi D.Q. Bui is a Research Scientist at Salesforce Research Asia. He specializes in applying AI to software engineering problems, including code understanding, program repair, anti-pattern detection, and automated testing. His work helps push the boundaries of how AI can simplify and enhance various aspects of software development.

Junnan Li  is a Research Scientist at Salesforce Research Asia. His research mainly focuses on building multimodal foundation models, including vision-language models and code models.

Steven C.H. Hoi is the Managing Director of Salesforce Research Asia and oversees Salesforce's AI research and development activities in APAC. His research interests include machine learning and a broad range of AI applications.