OpenAI’s o3 and o4-mini show higher hallucination rates

Illustration of a robot representing OpenAI o3 and o4-mini AI models and related hallucination issues

In a controversial move, internal tests conducted by OpenAI have revealed that the new AI models "o3" and "o4-mini", specifically designed for logical tasks, are experiencing higher hallucination rates compared to previous models.

While new models are naturally expected to provide companies with improvements in reducing hallucinations compared to previous versions, this is not the case here.

This situation raises questions about the effectiveness of these models in providing accurate information, particularly in fields requiring high reliability such as law and medicine.

Increased Hallucination Rates in New OpenAI’s Models

According to a technical report from OpenAI, the "o3" and "o4-mini" models showed notable hallucination rates when tested using the PersonQA benchmark, an internal tool for measuring knowledge accuracy about individuals.

The "o3" model recorded a hallucination rate of 33%, while this rate climbed to 48% for the "o4-mini" model.

For comparison, older reasoning models "o1" and "o3-mini" registered significantly lower hallucination rates, at 16% and 14.8% respectively.

Perhaps this increase in hallucinations is partly due to the new models' ability to generate a larger number of claims, which consequently increases the probability of presenting inaccurate information.

Furthermore, tests conducted by the Transluce foundation, a non-profit research lab, indicated that the "o3" model might invent details about how it reached certain answers, such as claiming to run code on a MacBook Pro outside the ChatGPT environment, which is not actually feasible.

Challenges in Understanding the Causes of Hallucination

OpenAI noted that further research is required to understand the reasons behind the high hallucination rates observed in the new logical models.

It is believed that the reinforcement learning techniques employed in training these models may contribute to magnifying issues that were partially mitigated in earlier versions.

This highlights the persistent challenges in developing AI models capable of delivering accurate and dependable information.

Additionally, the elevated hallucination rates in the new models prompt concerns regarding their deployment in applications demanding high precision, such as drafting legal contracts or providing medical advice.

In these contexts, inaccurate information could lead to severe consequences, thereby diminishing the trustworthiness of these models in professional settings.

Attempts to Mitigate Hallucinations

In this regard, OpenAI is working to enhance the accuracy of its models by integrating online search capabilities. An example is the GPT-4o model, equipped with a search feature, which achieved up to 90% accuracy on the SimpleQA benchmark.

Nevertheless, the issue of hallucination continues to pose a significant challenge, particularly within models designed for logical tasks.

Unexpected New Behaviors in ChatGPT

On another note, some ChatGPT users recently observed that the model began addressing them by name without being previously provided with it. This phenomenon caused surprise and concern among some individuals.

Although OpenAI has not issued an official statement concerning this behavior, some users have speculated that this phenomenon might be linked to ChatGPT's new "Memory" feature. This feature aims to personalize interactions based on past conversations.

However, many users expressed reservations about this behavior, viewing it as potentially intrusive and uncomfortable.

In any case, this behavior does not fall into the category of hallucination. Instead, it might be an attempt by the company to make its models friendlier.

Ultimately, the issue of hallucination and model behaviors will remain a central focus for both researchers and developers. This is part of the ongoing effort to enhance artificial intelligence capabilities.