A New Threat - Package Hallucination
Written by Sue Gee   
Wednesday, 07 May 2025

The rise and rise of reliance on LLMs for code generation has resulted in a new threat to software supply chains. Dubbed "package hallucination", this occurs when LLMs generation references to non-existent packages.

Package hallucination is a novel phenomenon explored in a paper to be presented at the 2025 USENIX Security Symposium.

The tendency of LLMs to "hallucinate", that is to invent bogus information that goes beyond their training data, is already well known. Now a study, led by Joseph Spracklen of the University of Texas, identifies a specific type in misinformation that could seriously compromise AI-generated code.

Akin to the package confusion, or dependency confusion, "package hallucination" occurs when an LLM generates code that recommends or contains a reference to a package that does not actually exist. As the paper explains, an adversary can exploit package hallucinations, especially if they are repeated, by publishing a package containing some malicious code or functionality to an open-source repository with the same name as the hallucinated package. 

Exploiting Package Hallucination

The study used 16 of the most widely used large language models to generate 576,000 code samples in Python and JavaScript and found that over 440,000 of the package dependencies they contained, almost 20% of all dependencies.  were non-existent.

  • Open-source models had a hallucination rate of 21.7% compared to 5.2% for OpenAI's GPT models.
  • GPT-4 Turbo had the lowest hallucination rate at 3.59%. ​
  • Python code had a lower hallucination rate (15.8%) than JavaScript (21.3%). ​

Hallucination Rate ChartA total of 205,474 unique hallucinated package names were identified and the researchers looked into how different model settings, such as temperature and decoding strategies, affect the rate of package hallucinations. 

With regard to LLMs, the temperature parameter is a numerical value that is used to adjust the degree of randomness of the generated responses - a lower temperature results in more predictable and deterministic outputs, while a higher temperature increases creativity and diversity in the responses. While the range for the temperature parameters is generally between 0 and 2 for commercial LLMs, Anthropic (not included in this study) limits it to between 0 and 1 and the recommended range for the task of generating package names is considered to be 0 to 0.3 while higher values are intended for storytelling, poetry, and brainstorming where creativity is to be encouraged. For the purposes of this study temperature was varied between the minimum and maximum allowed,  i.e. between 0 and 2 in the GPT models and between 0 and 5 in the open source LLMs.

halluc temps

As shown in these graphs lower temperatures produced lower hallucination rates on package names, while higher temperatures significantly increased them. Notice how the range of values on the y-axis is much wider for the two open source models than the OpenAI models and whereas at maximum temperature GPT-3.5 had a rate of 31.8% hallucination, GPT-4's rate was only 8.9%, ​​

Compared to temperature, employing the decoding strategies of top-p and top-k, which are intended to reduce the chances of a low probability token being selected as a potential package, assuming that lower probability tokens correspond to higher probabilities of hallucination, resulted only in a slight increase in hallucination rates (1.16% on average). ​

The other factor that affected hallucination rates was the recency of the data. In order to evaluate whether the hallucination rate was correlated with topics/packages that emerged after the model was trained the coding prompts had been divided into two temporal datasets  A lower difference between the rates of recent and all-time prompts would indicate a better performance in handling questions falling outside the model’s pre-training data and therefore a more generalizable model. The models tested were shown to be more likely to generate a package hallucination when responding to prompts that dealt with more recent topics producing a 10% higher hallucination rate on average for older data versus more recent data. 

There was also a clear correlation between the number of unique package names generated during testing and the rate of hallucination, demonstrating that the more verbose a model the greater the incidence of invented package names. The chart below also reveals the superiority of the GPT models which are in a cluster well below the regression line.

Hallucination

 

More Information

We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs by Joseph Spracklen, University of Texas at San Antonio; Raveen Wijewickrama, University of Texas at San Antonio; AHM Nazmus Sakib, University of Texas at San Antonio; Anindya Maiti University of Oklahoma; Bimal Viswanath, Virginia Tech; Murtuza Jadliwala, University of Texas at San Antonio

 

Related Articles

Programming In The Age of AI

Does AI Help or Hinder?

GitHub Copilot Provides Productivity Boost

GitHub Sees Exponential Rise In AI

GitHub Announces AI-Powered Changes 

  

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Facebook or Linkedin.

Banner


Charles Babbage - Born This Day 154 Years Ago
26/12/2025

It is an annual I Programmer tradition to celebrate the birth of Charles Babbage, the man who invented and designed a programmable computer at the start of the Industrial Age, and who is now reco [ ... ]



LEGO SMART Bricks - Good Or Bad?
11/01/2026

Announced this week at CES 2026, the SMART Brick is designed to power a new LEGO ecosystem called SMART Play. Will this be user-programmable and provide a replacement for the Mindstorms range of robot [ ... ]


More News

pico book

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Wednesday, 07 May 2025 )