The family of Salesforce CodeGen models is growing with CodeGen2.5 – a small, but mighty model! While there has been a recent trend of large language models (LLM) of increasing size, we show that a small model can obtain surprisingly good performance, when being trained well.
The key contributions towards productization of these models are:
In 2022, Salesforce Research released CodeGen [1,2], one of the first LLMs for Program Synthesis with 16B parameters. The CodeGen model allows users to “translate” natural language, such as English, into programming languages, such as Python. For such models, since the discovery of the scaling laws, power laws which relate the size of a model and the dataset, the dominant trend has been to scale up LLMs to larger sizes.
Data bottleneck. While the size of the datasets has been scaled according to these laws, this scaling of the data has been limited to the amount of available data. That is, a dataset may contain say at most one million documents. While training, one iteration over these one million documents is called “one epoch”. Previously, the belief was that a model shall only observe a document once during training, which implies one should train at most for “one epoch”. This limitation quickly exhausts the available data in the world, as it is finite. We question this belief and train for “more than one epoch” under a specialized recipe. This may allow one to train a smaller model under more data, rather than a larger model, which is costly to serve and maintain in production environments. The claim of “seeing an observation only once” for optimal training may still hold true, but, in our setting we create variants (or alterations) of these observations, such that the model can be trained for multiple epochs. While we do not claim that this data augmentation is strictly required to lift the data constraints, the following empirical results indicate that indeed a small, but very powerful model can be trained under such a recipe.
Infill sampling. Authoring source code is a unique process that involves continuously editing and improving the code over time, fixing bugs and enhancing its quality. This means that the code generation model needs to consider not only the code before the edit location but also the code after it, allowing for infilling within the existing code. Infilling is accomplished by formatting the input sequence in a way that predicts the context to be filled using the prefix and suffix contexts [2,3,4]. However, we have observed that the "fill-in-the-middle" approach  can sometimes struggle to determine when to stop infilling. This ability is crucial, especially in practical use cases. CodeGen2.5 instead employs span corruption [2,4] which does not suffer from such a tendency.
Fast sampling. In products such as assistants for code completion, end users expect to see completions within say two seconds. Internally, the completions have to be sampled from the LLM model. This process is slow, as each token (or “word”) has to be sampled sequentially, that is, generating a completion of 32 tokens requires 32 calls to the model, which may incur significant latency in sampling. To address this challenge, the CodeGen2.5 model was specifically optimized for fast sampling under self-attention with Flash computation  and compatibility with NVIDIA Triton . The model is not only small, which reduces sampling latency, but also enjoys internal optimization for further gains.
Local deployment. In the future, these assistants will run on the local machine, such as MacBooks with M2 chips. This local deployment not only allows for high personalization of the models for each user, but also ensures data privacy. Under recent technical frameworks such as llama.cpp, which is optimized for local deployment of small LLMs, CodeGen2.5 enjoys the throughput similar to LLaMA-7B on a MacBook equipped with M1 chip. This renders the deployment of CodeGen-based assistants on local machines feasible today.
Table 1: HumanEval pass-rates with n=200, A benchmark which measures if generated programs from a model are functionally correct. Note, CodeGen2.5 with only 7B parameters outperforms earlier models more than twice its size. Multi-lingual models are trained on a variety of programming languages. Mono-lingual models are fine-tuned only on Python.
Table 2: HumanEval single line infill pass-rates with n=40. The benchmark captures how well a model can “fill in the middle” of a piece of code for which “the middle” has been masked out. Note, for productization, CodeGen2.5 introduces a specialized sentinel token for truncation and features very high infill performance.
Table 3: HumanEval pass-rates with n=200, on instruction-tuned models. Our instruction-tuned models are fine-tuned on specific instruction datasets to improve the capability of following generating code based on English instructions.
Model quality. As outlined previously, LLMs are hungry for data. While training models for one epoch, that is only one iteration over the data, may be efficient, this luxury is unfeasible in data-constrained settings, such as program syntheses where there only exists a certain amount of code in the world. Instead, we hypothesize training with the following ingredients may lead to a competitive model of smaller size: (1) multiple epochs, (2) data augmentation, (3) high learning rates. For multiple epochs, while training we cycle through the data iterator multiple times in identical order.
We trained the model on StarCoderData, a programming language dataset developed by BigCode . As Figure 1 shows, an epoch constitutes about 300B tokens, while the model is pre-trained for 1.4T tokens, reaching more than 4 epochs. Further, we recruit our specific infill format  in the objective function, which may serve as a form of data augmentation in multi epoch training. That is, similar to span corruption, some tokens of an observed sequence are moved to different positions, such that the same sequence is not observed twice. While being a hypothesis, this form of corruption or augmentation of the data may mitigate data constraints. Lastly, the learning rate decay function is scheduled for a significantly larger token budget, which results in the model being trained under higher learning rates for a longer period. While one may argue that a model performs best when trained until convergence, we observe very competitive or even matching performance early in training, so that longer training under much larger token budgets seems to be a reasonable recipe. Specifically, the pass-rate in Figure 1 appears to increase consistently under both high learning rates and multi-epoch training regime.
The final performance on the HumanEval benchmark, which captures how well a model can generate a functionally correct program, is summarized in Tables 1. While CodeGen2.5 only contains 7B model parameters, it significantly outperforms CodeGen1.0 and 2.0 with 16B parameter models, more than twice its size, and is on par with the recent StarCoder 15.5B model. CodeGen2.5 is small, but might!
Figure 1: HumanEval pass@1 with n=40 over billions of training tokens. The benchmark captures how well a model can generate functionally correct programs or snippets of code. One epoch constitutes about 300B tokens, such that the model was trained for more than 4 epochs. Note, performance increases consistently until 1.2T tokens despite training under high learning rates and multiple epochs.
Infill Sampling. CodeGen2.5 adopts the same training objective as CodeGen2, which enables infill sampling, i.e. sampling text by taking into account the past and the future context. We also observe an improvement in pass-rates for HumanEval-SingleLineInfill task, a single-line python code completion task. Table 2 summarizes the comparison with other models with infill capability.
Instruction Tuning. We further take CodeGen2.5-7B-mono and finetune on public instruction datasets to improve the capability of following generating code based on English instructions. The results are summarized in Table 3. We observe continued improvement from the mono-lingual model.
Sampling Latency. CodeGen2.5 adopts LLaMA model forward pass. This allows us to enjoy the benefit of NVIDIA Triton inference server , a fast inference framework on NVIDIA GPUs. With the FlashAttention enabled, we achieve twice as fast inference while achieving better performance compared to CodeGen2.0-16B, as shown in Table 4.
Table 4: Sampling latency in milli-seconds under various inference frameworks supporting Flash attention under NVIDIA Triton. Context length is 2,000 tokens and the batch size is set to 2. The varied number of tokens represents realistic settings for code assistant products. CodeGen2.5 features lower latency, which allows for improved user experience.
The family of CodeGen models welcomes a new member – CodeGen2.5, small but mighty. We show that multi-epoch training can mitigate data constraints and lead to small, but powerful models. Besides the relatively small size, CodeGen2.5 features robust infill sampling and fast sampling optimized for serving with Flash attention, both of which enable the productization of these models for coding assistants. In the future, we will push the boundaries of such models further.
Salesforce AI Research invites you to dive deeper into the concepts discussed in this blog post (links below). Connect with us on social media and our website to get regular updates on this and other research projects.
 CodeGen1 (https://arxiv.org/abs/2203.13474)
 CodeGen2 (https://arxiv.org/abs/2305.02309)
 FIM OpenAI (https://arxiv.org/abs/2207.14255)
 InCoder (https://arxiv.org/abs/2204.05999)
 Flash Attention (https://arxiv.org/abs/2205.14135)
 NVIDIA Triton (https://developer.nvidia.com/triton-inference-server)
 abacaj/code-eval (https://github.com/abacaj/code-eval)
 StarCoder (https://arxiv.org/abs/2305.06161)
 MPT-30B (https://www.mosaicml.com/blog/mpt-30b)
 StarCoderData (https://huggingface.co/datasets/bigcode/starcoderdata)