Meta Faces Criticism Over Llama 4 Benchmarking, Company Responds

Meta is facing intense criticism from AI researchers after it allegedly submitted a modified version of its latest Llama 4 models for performance benchmarks, misleading developers about the model's true capabilities.

Rumors have circulated that the company specifically trained an enhanced version of the Maverick model—one of the variants of this generation—to perform better in benchmark tests and mask its weaknesses

Inconsistent Performance Across Versions

According to a report published by TechCrunch, the Maverick model ranked second on the LM Arena platform, which relies on human evaluations to score the performance of AI models.

However, the version tested was not the same as the one Meta later released to developers.

In an official blog post, Meta clarified that the model evaluated on LM Arena was an experimental chat-optimized version, which differs from the general release version made available to the public.

Supporting documents on the official Llama website confirmed that the tested model was a “chat-tuned” variant, raising concerns about fairness and transparency in AI performance evaluation.

Typically, companies provide unaltered models for benchmarking to ensure the results reflect real-world usage.

By using a custom-tuned version for testing and then releasing a different one, Meta risks misleading developers and undermining the validity of model comparisons.

Researchers have pointed out clear differences between the publicly available version and the one tested on LM Arena.

The benchmarked model reportedly used excessive emojis and produced unusually long responses.

This discrepancy deepens doubts about whether benchmark results accurately represent how the model would perform in real-world applications.

A rumor also circulated, reportedly from someone claiming to be a former Meta employee, who alleged they resigned in protest over the company’s benchmarking practices.

The individual accused Meta of manipulating test results.

Meta Responds

In response, Ahmad Al-Dahle, Vice President of Generative AI at Meta, denied the circulating claims.

In a post on X, Al-Dahle stated, "Training models on test sets is completely false."

He acknowledged that there were reports of inconsistent performance between the Maverick and Scout models across different cloud providers hosting these systems.

He attributed this to the rapid deployment of the models once ready, noting that it might take a few days for all public-facing applications to be fully optimized.

Al-Dahle also reaffirmed Meta’s commitment to fixing bugs and supporting its partners in the field.

In conclusion, this incident highlights the urgent need to improve benchmarking standards and ensure transparency in AI performance evaluations, enabling developers to make informed decisions when adopting new technologies.