“Using Language Models to Identify Vulnerable Code: An Experiment with Zero-Shot and Few-Shot Learning”
The world of artificial intelligence has witnessed yet another breakthrough in the field of programming, as OpenAI has released a new API for generating Python code. The new API uses the model "gpt-3.5-turbo", but users can adjust the model to "gpt-4" or "gpt-4-32k" as per their requirements.
The Python code generation can be done by providing a parameter to the API like this:
{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": inputQuestion}],
"max_tokens": 4000,
"temperature": 0.7
}
Here, ‘max_tokens’ can be adjusted as needed, and ‘temperature’ can be calibrated for more accuracy. For instance, a lower temperature such as 0.2 can be used for more deterministic output.
Starting with a zero-shot baseline run, OpenAI conducted several experiments, adding complexity to the prompt using strategies such as few-shot in-context learning. The AI was prompted to identify vulnerable code without mentioning which Common Weakness Enumeration (CWE) it might be looking for.
In the zero-shot prompt, the model is asked to make a prediction without any example or additional information. The prompt is inspired by a certain paper and includes a role, a code delimiter, and a request to output only in JSON format. The AI was also instructed to ‘think step-by-step’. The code snippet to be tested was inserted into {code}.
The results were impressive, with an accuracy of 0.67, precision of 0.60, recall of 0.86, and an F1 score of 0.71.
In the next experiment, OpenAI incorporated the concept of in-context or ‘few-shot’ learning. A few successful code-answer examples were included before asking the AI to perform the same operation on the unseen code.
The results further improved, with an accuracy of 0.76, precision of 0.71, recall of 0.81, and an F1 score of 0.76.
For the next experiment, OpenAI used KNN-based few-shot example selection, a technique described in a Microsoft blog post. The prompt template did not change from the second experiment. This experiment yielded an accuracy of 0.73, precision of 0.70, recall of 0.76, and an F1 score of 0.73.
In a variation of the prompt, the AI was asked to provide a fixed version of the code if a CWE was found. This approach led to an accuracy of 0.80, precision of 0.73, recall of 0.90, and an F1 score of 0.81.
This innovative AI tool from OpenAI has shown promising results for Python code generation. However, more investigations are required to test different models, datasets, and prompts. If you’re interested in exploring this tool, you can check out the pull request in OpenAI’s Cookbook on GitHub.