CodeChain: Towards Modular Code Generation through Chain of Self-revisions and Representative Sub-modules

11 min read
TL;DR: With CodeChain, a pretrained large language model (LLM) can solve challenging coding problems by integrating modularity in generation samples and self-improve by employing a chain of self-revisions on representative sub-modules. CodeChain can achieve state-of-the-art results with both OpenAI GPT models and open-source LLMs on challenging coding benchmarks like APPS and CodeContests.

Limitations of code generation approaches with code LLMs

Large Language Models (LLMs) have already become quite proficient at solving simpler programming tasks like those in HumanEval or MBPP benchmarks. However, solving more complex and competitive programming tasks is still quite challenging for these models - possibly due to their tendency to generate solutions as monolithic code blocks. Another limit of this generation approach is that the models would simply generate a large number (several thousand) of solutions independently, with the hope that one of the solutions would pass all the private test cases.

On the other hand, in today’s agile development environment, experienced developers often embrace the concept of modularity in programming. Given a problem, they would instinctively write solutions that are highly modularised - each program is divided into high-level logical sub-tasks and sub-modules. The developers would then keep testing and analyzing their implementations, altering modular components from their previously developed solutions to efficiently improve their final solutions (see the figure below).

Top: An example of a code generation task from the CodeContests benchmark where the problem description and public test cases are provided as inputs to the model. Down: We illustrate a typical problem-solving process in which a developer attempts to solve the problem iteratively, revising and reusing parts of their previously developed codes until satisfied.

CodeChain - a new framework for modular and self-improved code generation

Inspired by the above problem-solving process, we propose CodeChain, a novel inference framework to improve code generation in LLMs through a chain of self-revisions with representative sub-modules. See the figure below for an illustration of how CodeChain works.

An overview of CodeChain: a pretrained LLM is first instructed with chain-of-thought prompting to generate a set of modularised solutions. Generated sub-modules are then extracted from potentially correct solutions and grouped into different semantic clusters. The cluster centroids are selected as representative sub-modules to condition the next self-revision round. The model is instructed to reuse or adapt these modules into its revised solutions.

CodeChain consists of the following steps:

  • To incorporate modularity in code generation, we first adapt the technique of chain-of-thought (CoT) prompting into code generation tasks (see figure below). In this step, we instruct LLMs to decompose their solutions into modular segments. Each modular segment represents an abstract function that is intended for a high-level logical sub-task.
  • The generated modularized samples are iteratively improved by a chain of self-revisions. In each self-revision, we perform the following:
    • We first extract the sub-modules found in all generated programs. Each extracted sub-module contains high-level information (including intended use and input/output docstrings) and the corresponding code implementation.
    • We then transform these modules into an embedding space and group them into semantic clusters with k-means clustering.
    • Within each cluster, we sample the centroid sub-modules - defined as ones that are the closest to the true centroid point in a cluster.
    • We treat these centroid sub-modules as representative and reusable code parts. This process is inspired by how a developer would often reuse the more modularized and generalized parts of their code.
    • We then augment the original chain-of-thought prompt with these selected sub-modules. With this augmented prompt, we instruct the LLM to reuse/adapt the selected sub-modules and regenerate a new set of solutions.
  • We keep repeating this generating-revising process, creating a chain of self-revisions, each conditioned by reusable modular code parts.

With CodeChain, LLMs can receive the collective insights from modular components of all past generation samples to improve their future generations, imitating the problem-solving process of an experienced developer.

An example of modularised code generation output: first, the model is required to outline sub-modules needed, each of which consists of a function header and docstring describing the intended use. Subsequently, the model implements each module fully in code and integrates them as parts of the complete final solution.

CodeChain boosts the performance of LLMs, achieving SoTA results on APPS and CodeContests

We find that by naturally encouraging the LLM to reuse the previously developed and verified sub-modules, CodeChain can significantly boost both modularity as well as correctness of the generated solutions, achieving relative pass@1 improvements of 35% on APPS and 76% on CodeContests. It is shown to be effective on both OpenAI LLMs as well as open-sourced LLMs like WizardCoder.

Model Size Introductory Interview Competition All
Codex 12B 4.14 0.14 0.02 0.92
CodeT5 770M 6.60 1.03 0.30 2.00
CodeRL+CodeT5 770M 7.08 1.86 0.75 2.69
text-davinci-002 - - - - 7.48
Self-edit+text-davinci-002 - - - - 7.94
code-davinci-002 - 29.30 6.40 2.50 10.20
WizardCoder 15B 26.04 4.21 0.81 7.90
CodeChain+WizardCoder 15B 26.29 7.49 3.75 10.50
GPT3.5 - 48.00 19.42 5.42 22.33
CodeChain+GPT3.5 - 54.50 28.11 12.38 30.24
APPS results by pass@1 (%)
Model Size Val pass@1 Val pass@5 Test pass@1 Test pass@5
code-davinci-002 - - - 1.00 -
WizardCoder 15B 1.11 3.18 1.98 3.27
+ CodeChain 15B 2.35 3.29 2.48 3.30
GPT3.5 - 6.81 16.23 5.82 11.16
+ CodeChain - 12.86 16.91 10.27 14.11
CodeContests results by pass@1 (%)

When compared with related approaches such as Self-repair, we observed significant relative performance gains when using CodeChain. Specifically, we evaluated our approach over a test subset of 20 samples from APPS, using both GPT3.5 and GPT4 as the base models. We observed that CodeChain can improve the performance with both models, with more significant gains using GPT4. CodeChain+GPT4 can achieve a SoTA result of 61.50% pass@1 on average, even outperforming Self-repair+GPT4 with human feedback.

Model Feedback source Introductory Interview Competition All
Self-repair+GPT4 GPT4 42.64 19.33 3.67 33.30
Self-repair+GPT4 Human 62.21 45.67 14.67 52.60
GPT3.5 - 30.00 18.33 0.00 23.75
CodeChain+GPT3.5 Sub-modules 31.67 27.86 0.00 26.35
GPT4 - 42.86 18.33 13.33 34.75
CodeChain+GPT4 Sub-modules 71.07 55.00 23.33 61.50
Comparison with Self-repair: we reported the results on the same subset of 20 samples on APPS test split using GPT3.5 and GPT4 as the base models.

In the figure below, we found significant improvements in all levels of problem difficulties in APPS, with optimal performance gain obtained in revision round 4. At a closer look, we observed that on different levels of problem difficulties, CodeChain has different rates of performance improvement: more challenging problems (i.e. competition and interview level) tend to benefit more from CodeChain than basic problems (i.e. introductory level). Finally, compared with related approaches (self-revision using feedback from test outcomes with natural language explanations like Self-debug or self-reflection like Reflexion), CodeChain can achieve better performance, using modularised code parts as a form of feedback for self-improving outputs.

APPS validation results by pass@1 (%): we tested CodeChain+GPT3.5 for 5 self-revision rounds and reported pass@1 in each problem difficulty level. Using GPT3.5 as the base model, we compared with related approaches, including Self-debug (with unit test (UT) feedback or explanation (expl)) and Reflexion.

From the figure below, similar observations can be seen on open-sourced models WizardCoder, with clearer performance trends on models of larger sizes, including 7B, 15B, and 34B parameters. This is consistent with recent findings on the scaling law of LLMs whereby some features such as instruction-following only emerge when model size is sufficiently large.

To understand the modularity and reusability of CodeChain generation, we conducted experiments to evaluate these qualities on randomly sampled generated programs. We observed that when using CodeChain, GPT3.5 is more likely to generate programs with high levels of modularity and reusability, with the majority of outputs rated 3 to 5 on the Likert scale. This is significantly higher than the conventional direct generation approach, with about 80% of the time generating non-modular or non-reusable codes (i.e. score 0).

Example Generation Outputs

Problem description

Your friend Edvin loves languages. A personal favourite of his is speaking the Robber Language. For those of you not familiar with it, it is a language transformation where every consonant is doubled, with an ’o’ put in-between. The rest is kept the same. For example, the word "car" would be transformed into "cocaror".

Your friend Edvin was going to tell you his password to his favourite website, so that you could check out how awesome it is. Edvin is sure that nobody but him and you speak the Robber Language, so he encrypted it using this simple technique, and wrote it down on a note. After trying the password and realising it does not work, you found out that Edvin was drunk during the transformation. Now you want to find out in how many ways the password might have looked originally.

You are given an encrypted password which Edvin has attempted to translate into the Robber Language. Your task is to determine how many passwords might produce the encrypted password. You may assume that the only thing Edvin did wrong, was when scanning through the word with his eyes, he sometimes missed to transform some consonants using the rule above. Edvin is not deterministic in his failures, i.e. if he missed to transform a consonant once, it is not necessary that he would miss it if it occured again later in the string.

A vowel is considered to be one of the letters "a", "e", "i", "o" and "u". A consonant is any other letter.

Input consists of one encrypted password, the resulting word from Edvins translation. The word can be as long as $1000000$ characters. It will solely consist of lower case letters a-z.

Output should consist of one number, the number of ways the password might have looked before Edvin translated it. Since this number may be huge, print the remainder modulo $1000009$.

Sample Input 1:
Sample Output 1:

Sample Input 2:
Sample Output 2:

Example generation output (direct generation)

s = input()
n = len(s)
vow = set(["a", "e", "i", "o", "u"])
dp = [0]*(n+1)
mod = 10**6+9

dp[0] = 1
dp[1] = 1

for i in range(2, n + 1):
    if s[i-1] in vow or s[i-2] in vow or i - 2 < 0 or s[i-1] != s[i-2] or s[i-1] == 'o':
        dp[i] = dp[i-1]
        dp[i] = dp[i-1] + dp[i-2]
        dp[i] %= mod


Example generation output (CodeChain round 0)


def is_vowel(ch):
    Description: This function returns True if the specified character is a vowel, False otherwise.
    ch (char): A character.
    (bool): True if the character is a vowel, False otherwise.

def count_possible_original_words(word):
    Description: This function counts the possible original words that could have transformed into the given Robber Language word.
    word (str): A Robber Language word.
    count (int): A count of possible original words.


MOD = 1000009

def is_vowel(ch):
    vowels = ['a', 'e', 'i', 'o', 'u']
    return ch in vowels

def count_possible_original_words(word):
    N = len(word)
    dp = [0]*(N+1)
    dp[0] = 1
    for i in range(1, N+1):
        dp[i] = dp[i-1] 
        if not is_vowel(word[i-1]) and i >= 2 and word[i-1] == word[i-2] and word[i-3] != 'o':
            dp[i] += dp[i-2]
        dp[i] %= MOD
    return dp[N]

word = input().strip()

Example generation output (CodeChain round 3)


Since we're given two very relevant functions "split_word" and "count_possible_passwords", we can keep these functions in our solution. The function "initialize_vowels_and_modulo" can be useful to initialize global variables.

def split_word(word):
    Description: This function splits the word into a list of elements that can be translated back into Robber language.
    word (str): The word that needs to be split.
    split_word_list (list): The list of strings, where each string represents a possible original letter
def initialize_vowels_and_modulo():
    Description: This function initializes the vowels and modulo variables.
    vowels (set): A set of vowels
    modulo (int): The modulo constant
def count_possible_passwords(word_list):
    Description: This function counts the number of possible password that might have produced the original word.
    word_list (list): A list with elements from the word
    (int): The number of possible passwords modulo 1000009


def split_word(word):
    i = 0
    split_word_list = []
    while i < len(word):
        if i + 2 < len(word) and word[i + 1] == 'o' and word[i + 2] != 'o':
            i += 3
            i += 1
    return split_word_list

def count_possible_passwords(word_list, vowels, modulo):
    dp = [0] * len(word_list)
    dp[0] = 1 if len(word_list[0]) == 1 or word_list[0][0] in vowels else 2
    for i in range(1, len(word_list)):
        if len(word_list[i]) == 1 or word_list[i][0] in vowels:
            dp[i] = dp[i - 1]
            dp[i] = (dp[i - 1] * 2) % modulo
    return dp[-1]

def initialize_vowels_and_modulo():
    return set(['a', 'e', 'i', 'o', 'u']), 1000009

def solve_task(word):
    split_word_list = split_word(word)
    vowels, modulo = initialize_vowels_and_modulo()
    return count_possible_passwords(split_word_list, vowels, modulo)

word = input()

The Bottom Line

In CodeChain, we introduce chain-of-thought prompting to generate more modularised programs, which creates natural boundaries for the models to sample parts of the solutions for reuse and revision. In each revision step, we iterate between selecting representative sub-modules and augmenting chain-of-thought prompting with these selected sub-modules. Our experiments indicate the significant performance improvement of CodeChain when using OpenAI GPT or open-sourced WizardCoder as the base models, achieving new SoTA results on APPS and CodeContests benchmarks. Note that CodeChain can be complemented with other self-revision approaches such as Self-debug and Reflexion by combining different feedback types (e.g. natural explanation and reflection) and selecting more diverse and representative sub-modules.


To cite this work, please use the following BibTeX:

      title={CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules}, 
      author={Hung Le and Hailin Chen and Amrita Saha and Akash Gokul and Doyen Sahoo and Shafiq Joty},

Explore More