
Recent results have revealed a performance decline in Meta's base model "Maverick" on the popular LM Arena AI benchmark, compared to competitors like "GPT-4o" and "Claude 3.5".
These results sparked controversy regarding Meta's previous use of an optimized pre-release version that had achieved high performance, prompting organizers to adjust testing policies.
They then proceeded to re-evaluate the base, unmodified Maverick version (Llama-4-Maverick-17B-128E-Instruct).
Maverick is considered one of four models within Meta's latest AI generation, "Llama-4."
The base version showed a clear gap of 15-25% in complex tasks like reasoning and critical thinking, according to data published by the LM Arena platform on April 12, 2025.
Not only that, but it also ranked significantly lower than models released months ago, such as DeepSeek v2.5 and Gemini 1.5 Pro.
The release version of Llama 4 has been added to LMArena after it was found out they cheated, but you probably didn't see it because you have to scroll down to 32nd place which is where is ranks pic.twitter.com/A0Bxkdx4LX
— ρ:ɡeσn (@pigeon__s) April 11, 2025
However, Meta defended its strategy of providing a customizable open-source model, rather than focusing solely on excelling in standardized benchmarks.
Analysts pointed out that benchmarks do not necessarily reflect the real-world performance of models, especially given the possibility of optimizing them to achieve high scores under specific conditions.
For its part, Meta clarified that the previous pre-release version underwent intensive optimizations aimed at enhancing dialogue, but these might not suit all practical use cases.
This controversy reflects a broader challenge in the AI industry: balancing transparency and competitiveness.
While companies like "OpenAI" focus on highly efficient closed-source models, Meta adopts a different approach by empowering developers to modify models according to their needs, even if the initial performance is modest.
Meta is expected to continue developing "Maverick," focusing on incorporating developer feedback to improve core capabilities in the coming months.
It is worth noting that LM Arena is a leading platform for evaluating conversational models, but the debate surrounding the accuracy of its results is growing as companies increasingly rely on benchmark tests for marketing.
Ultimately, it remains best for developers to choose models based on their practical applications, not just theoretical results.