INDICT: Towards Better Code Generation by Both Security and Helpfulness

8 min read

Henry Hung Le

Doyen Sahoo

Yingbo Zhou

Caiming Xiong

Silvio Savarese

TL;DR: We introduce INDICT, a novel framework that empowers Large Language Models (LLMs) with Internal Dialogues of Critiques for both safety and helpfulness guidance. The internal dialogue is a dual cooperative system between a safety-driven critic and a helpfulness-driven critic, each equipped with relevant knowledge from external tools.

LLMs are subject to generating insecure or harmful code

Extending from the natural language domain, Large Language Models (LLMs) have showcased great potential in code generation tasks. However, when instructed with tasks containing malicious intentions or ambiguous requirements, LLMs are subject to generating code that could facilitate harmful attacks or code that contains obscure security problems. Note that code itself is often not malicious. For example, as noted by related work, a program for an encryption method could be very useful to create a secure personal file system, but also could be exploited for a ransomware attack. Therefore, it is important to develop an efficient method for LLMs to achieve the intricate balance between helpfulness and safety in the code domain.

Recent research from the NLP domains addresses the safety issues of LLMs via finetuning with preference data and potentially with RL-based reward optimization. However, these methods are quite expensive in the code domain, requiring programming experts with cybersecurity experience to create large-scale and high-quality datasets. In this blog, we introduce INDICT, a new approach to efficiently improve LLMs by generating more secure and helpful output code. See below for an example.

0:00

/0:00

INDICT (Internal Dialogues of Critiques) enables two different critics to interact with each other autonomously and collaboratively, improving code generation by both security and helpfulness. In this example, INDICT iteratively resolves the security weakness CWE-78 (Improper Neutralization in an OS Command) and improves the code functionality with relevant supporting modules.

INDICT: Internal Dialogues of Critiques for Code Generation

INDICT is essentially a multi-agent framework, including an actor LLM for code generation and two critic LLMs for giving feedback to the actor. The goal of the framework is to improve LLMs in code generation tasks with better safety and helpfulness qualities in the generation outputs. There are three important properties of INDICT as following:

Helpfulness and Safety Critics

💡

First, we consider both helpfulness-driven critic and safety-driven critic and position them in an autonomous agent system. Instead of activating these critics independently, we propose to let the agents interact with each other in a dialogue setup to collaboratively and simultaneously optimise the generated responses. Our experiments show that this interaction scheme can create more useful and sophisticated critic feedback for the actor with a balanced focus on both safety and helpfulness. This feedback, subsequently, leads to more secure and helpful generation outputs.

Critics Grounded by External Tools

💡

Depending on how well LLMs can perceive and resurface relevant knowledge from pretraining data, these models might still cause serious hallucination problems by generating factually incorrect responses. We propose to address this issue in critic agents by equipping them with the appropriate external tools to generate more reliable critic feedback. Specifically, we let the critics generate novel queries, consisting of a code snippet and a text query, to call relevant tools like web search and code interpreters. The tool outputs are then used by the critics to generate more knowledge-grounded critiques.

Preemptive and Post-hoc Feedbacks

💡

Different from the text domain, code generation outputs could be additionally observed/ interpreted in coding environments through a code executor like a Python compiler. The code executor provides more post-hoc observations, which can then be used by the critics for more informative feedback. In INDICT, we offer two types of feedback: (1) preemptive critic feedback which is obtained during the initial code generation stage; and (2) post-hoc critic feedback which is activated after the code is observed in an execution environment. Our strategy facilitates a more holistic critic framework to reduce the potential damage from insecure/malicious output code on the coding environment.

INDICT (Internal Dialogues of Critiques) is a framework to generate code by both safety and helpfulness. The framework introduces dialogues between knowledge-grounded safety-driven and helpfulness-driven AI critics. It enables the pair of critics to collaboratively and autonomously support the LLM code generator. We apply the critic system for both preemptive and post-hoc types of critic feedback, providing a proactive and extra layer of protection against security-sensitive tasks.

Use INDICT to boost the safety and helpfulness of LLM outputs on coding tasks

We conducted a comprehensive evaluation of INDICT on 8 diverse tasks across 8 programming languages from 5 benchmarks. While evaluating qualities like helpfulness and safety is still an open question, we adopt similar evaluation strategies as much as possible from prior related work. Specifically, we used a mixture of static-analysis tools (e.g. like [1]) to scan for security and vulnerability issues in generated code and AI-based evaluation (e.g. like [2]) to determine the helpfulness and maliciousness in the model outputs. See below for more details of the experimental results.

A common observation is that more powerful models (by size or those with additional data tuning) are generally more helpful but they are also more subject to generating insecure code output. We applied INDICT on LLMs ranging from 7B to 70B parameters and observed consistent performance improvement by both safety and helpfulness metrics in the generation outputs.

Insecure coding practice tasks

We first evaluated our approach on insecure code generation tasks in which existing LLMs were found to generate outputs with significant security concerns (in CyberSecEval-1 and CVS benchmarks). As observed here, more powerful models such as GPT and code-based LLMs are found to be more helpful and generate working code solutions for input problems of high complexity. However, these models are also more likely to generate insecure codes, possibly due to imperfect training data containing hidden vulnerability and security issues.

When applying LLMs with INDICT, we observed consistent performance improvements not just in safety but also in helpfulness, outperforming strong LLMs such as Llama and GPT models. Using CommandR or Llama as our base models, INDICT boosts the performance significantly, e.g. >80% of output code is considered safe and about up to 70% of output code is considered more helpful than the prior state-of-the-art or the ground-truth code. From the results, we also noted the consistent gains from INDICT on code outputs of different programming languages, including C, Java, Javascript, PHP, Python, and Rust.

Test results of CyberSecEval-1 - Insecure Coding Practice (Autocomplete). Notations: CR: CommandR, JV: Java, JS: JavaScript, Py: Python. Results of baseline models on Llama and GPT models are as reported in the benchmark paper.

Test results of CyberSecEval-1 - Insecure Coding Practice (Instruction). Notations: CR: CommandR, JV: Java, JS: JavaScript, Py: Python. Results of baseline models on Llama and GPT models are as reported in the benchmark paper.

Test results of the CVS benchmark. Notations: CR: CommandR, L3: Llama3, JV: Java, JS: JavaScript

Security attack tasks

We also evaluated our approach against malicious coding tasks in which the instruction prompts contain obscure yet dangerous intentions to perform security attacks. We conducted experiments on three types of attacks from CyberSecEval-1 and -2: cyber attack, interpreter abuse, and prompt injection. These tasks contain test samples with attack tactics classified by industry-standard MITRE ATT&CK as well as attacks commonly seen in the code domain like abusing code interpreters to carry on unauthorized actions.

On baseline models, we can observe that larger models are not necessarily more safeguarded from security attacks. For instance, Llama3-70b model can be more vulnerable to some types of attacks than Llama3-8b. This raises the need for efficient methods to protect current LLMs from increasingly complex attacks. In our experiments, using CommandR or Llama-based models with INDICT, we observed significant performance improvement by safety measures on all three types of security attacks. Notably, despite a weaker model, when enhanced with INDICT, CommandR can achieve significant boosts and become more secure against harmful task instructions. Our results also demonstrate the benefits of INDICT on different model sizes, from 8B to 70B model parameters.

We evaluated INDICT against three major types of security attacks from CyberSecEval-1 and 2 benchmarks. The metric is the % of output that is consider benign given a malicious input instruction. Notations: CL: CodeLlama, L2: Llama2, L3: Llama3, CR: CommandR. Results of baseline models on Llama, Code-llama, and GPT models are as reported in the benchmarks (CyberSecEval-1 and 2).

Open-ended generation tasks

Our approach also generalise well to open-ended tasks, demonstrating the broader potential of a cooperative autonomous critic system for helpful yet responsible AI models. Specifically, we evaluated INDICT with the HarmBench benchmark, covering diverse domains like social engineering, harassment, bio-weapons, etc. Each test sample is augmented with different red-teaming optimisation methods, including ZS, PAP, JB, TAP, and PAIR. These red teaming methods are designed to optimize malicious instruction prompts, ultimately tricking LLMs into complying and assisting in harmful downstream tasks.

We reported the safety measure as the percentage of outputs classified as benign by the given AI evaluator from HarmBench. Consistent with our observations in prior experiments, albeit a weaker model by safety, CommandR+INDICT still improves significantly across all red-teaming optimization methods. While typically finetuned with safety alignment, Llama3 models still benefit from the INDICT method, generating more benign outputs (up to 82% of outputs are safe on average).

Model	Direct	ZS	PAP	JB	TAP	PAIR	Avg.
CommandR	33.1	23.4	25.0	23.1	18.4	18.4	23.6
CommandR+INDICT	65.3	52.5	63.1	37.5	46.9	43.4	51.5
Llama3-8b-instruct	77.5	63.4	67.8	83.1	60.6	58.1	68.4
Llama3-8b-instruct+INDICT	90.6	79.4	81.9	89.1	75.9	77.8	82.4
Llama3-70b-instruct	68.4	60.0	68.1	90.9	61.9	57.5	67.8
Llama3-70b-instruct+INDICT	85.9	75.3	74.7	90.0	75.9	75.3	79.5

Fore more experimental results and analysis, please refer to our technical paper.

The Bottom Line

INDICT essentially facilitates an autonomous agent system between two critic models, each of which focuses on either the safety or helpfulness quality of outputs from the “actor” code generation LLM. Given access to external tools, the two critics interact with each other autonomously to generate grounded critiques, collaboratively improving the model outputs. Our results demonstrated the benefits of INDICT on code-related tasks and beyond, highlighting the promising direction of an autonomous and tool-enhanced multi-critic system.

Citation

@misc{le2024indictcodegenerationinternal,
title={INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness},
author={Hung Le and Yingbo Zhou and Caiming Xiong and Silvio Savarese and Doyen Sahoo},
year={2024},
eprint={2407.02518},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2407.02518},
}

Explore More

Technical paper and code
Follow us on X: @SalesforceResearch , @Salesforce
Visit our main website to learn more about all of the exciting research projects that Salesforce AI Research is working on