A New Threat - Package Hallucination
Written by Sue Gee   
Wednesday, 07 May 2025

The rise and rise of reliance on LLMs for code generation has resulted in a new threat to software supply chains. Dubbed "package hallucination", this occurs when LLMs generation references to non-existent packages.

Package hallucination is a novel phenomenon explored in a paper to be presented at the 2025 USENIX Security Symposium.

The tendency of LLMs to "hallucinate", that is to invent bogus information that goes beyond their training data, is already well known. Now a study, led by Joseph Spracklen of the University of Texas, identifies a specific type in misinformation that could seriously compromise AI-generated code.

Akin to the package confusion, or dependency confusion, "package hallucination" occurs when an LLM generates code that recommends or contains a reference to a package that does not actually exist. As the paper explains, an adversary can exploit package hallucinations, especially if they are repeated, by publishing a package containing some malicious code or functionality to an open-source repository with the same name as the hallucinated package. 

Exploiting Package Hallucination

The study used 16 of the most widely used large language models to generate 576,000 code samples in Python and JavaScript and found that over 440,000 of the package dependencies they contained, almost 20% of all dependencies.  were non-existent.

  • Open-source models had a hallucination rate of 21.7% compared to 5.2% for OpenAI's GPT models.
  • GPT-4 Turbo had the lowest hallucination rate at 3.59%. ​
  • Python code had a lower hallucination rate (15.8%) than JavaScript (21.3%). ​

Hallucination Rate ChartA total of 205,474 unique hallucinated package names were identified and the researchers looked into how different model settings, such as temperature and decoding strategies, affect the rate of package hallucinations. 

With regard to LLMs, the temperature parameter is a numerical value that is used to adjust the degree of randomness of the generated responses - a lower temperature results in more predictable and deterministic outputs, while a higher temperature increases creativity and diversity in the responses. While the range for the temperature parameters is generally between 0 and 2 for commercial LLMs, Anthropic (not included in this study) limits it to between 0 and 1 and the recommended range for the task of generating package names is considered to be 0 to 0.3 while higher values are intended for storytelling, poetry, and brainstorming where creativity is to be encouraged. For the purposes of this study temperature was varied between the minimum and maximum allowed,  i.e. between 0 and 2 in the GPT models and between 0 and 5 in the open source LLMs.

halluc temps

As shown in these graphs lower temperatures produced lower hallucination rates on package names, while higher temperatures significantly increased them. Notice how the range of values on the y-axis is much wider for the two open source models than the OpenAI models and whereas at maximum temperature GPT-3.5 had a rate of 31.8% hallucination, GPT-4's rate was only 8.9%, ​​

Compared to temperature, employing the decoding strategies of top-p and top-k, which are intended to reduce the chances of a low probability token being selected as a potential package, assuming that lower probability tokens correspond to higher probabilities of hallucination, resulted only in a slight increase in hallucination rates (1.16% on average). ​

The other factor that affected hallucination rates was the recency of the data. In order to evaluate whether the hallucination rate was correlated with topics/packages that emerged after the model was trained the coding prompts had been divided into two temporal datasets  A lower difference between the rates of recent and all-time prompts would indicate a better performance in handling questions falling outside the model’s pre-training data and therefore a more generalizable model. The models tested were shown to be more likely to generate a package hallucination when responding to prompts that dealt with more recent topics producing a 10% higher hallucination rate on average for older data versus more recent data. 

There was also a clear correlation between the number of unique package names generated during testing and the rate of hallucination, demonstrating that the more verbose a model the greater the incidence of invented package names. The chart below also reveals the superiority of the GPT models which are in a cluster well below the regression line.

Hallucination

 

More Information

We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs by Joseph Spracklen, University of Texas at San Antonio; Raveen Wijewickrama, University of Texas at San Antonio; AHM Nazmus Sakib, University of Texas at San Antonio; Anindya Maiti University of Oklahoma; Bimal Viswanath, Virginia Tech; Murtuza Jadliwala, University of Texas at San Antonio

 

Banner


JetBrains Previews VSCode Kotlin At KotlinConf
26/05/2025

JetBrains has shown off a pre-alpha version of its forthcoming official Kotlin support for Visual Studio Code and an implementation of Language Server Protocol for the Kotlin language. The announcemen [ ... ]



GCC 15.1 Released With Support For COBOL
05/05/2025

This major release of the GNU Compiler Collection is the first to include a COBOL front end. It also features improved support for Rust. Developers are also concerned about breaking changes.


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Wednesday, 07 May 2025 )