Did DeepSeek Train Its AI on Gemini Data?

The AI world is once again stirred by controversy. This time, attention turns to Chinese AI lab DeepSeek and questions about whether it may have trained its latest language model using output data from Google’s Gemini.

Although no definitive evidence has surfaced, several AI experts believe there are enough indicators to warrant deeper scrutiny.

Accusations: Did DeepSeek R1 Use Gemini Outputs?

Last week, DeepSeek unveiled an updated version of its R1 inference model updated version of its R1 inference model, which reportedly delivers strong performance in tasks involving mathematics and programming.

However, the company remained vague about the data sources used during training. That ambiguity sparked speculation among AI researcher.

Some suspect that part of the training set came from Google’s Gemini family of models.

One such claim came from Sam Paech, a Melbourne-based developer who works on evaluating emotional intelligence in AI.

In a post on X, he suggested that DeepSeek’s latest version may have shifted from training on synthetic OpenAI data to using synthetic Gemini outputs.

If you’re wondering why new deepseek r1 sounds a bit different, I think they probably switched from training on synthetic openai to synthetic gemini outputs. pic.twitter.com/Oex9roapNv– Sam Paech (May 29, 2025)

He pointed out that DeepSeek’s R1-0528 model appears to favor expressions and language structures similar to those generated by Gemini 2.5 Pro.

While that’s not solid proof, another developer-the anonymous creator of the AI tool SpeechMap-noted that DeepSeek’s reasoning patterns often resemble those of Gemini.

According to him, the “traces of thought” generated by R1 match the inference trails typically associated with Google’s models.

Not the First Time

This isn’t DeepSeek’s first brush with allegations of training on proprietary AI outputs.

Back in December, developers observed that DeepSeek’s V3 model occasionally referred to itself as ChatGPT, suggesting possible exposure to OpenAI conversation logs during training.

Earlier this year, OpenAI told the Financial Times it had found evidence linking DeepSeek to data distillation-an approach that involves training smaller models by extracting data from larger, more capable ones.

According to Bloomberg, Microsoft (an OpenAI partner and investor) discovered unusual download patterns on some OpenAI developer accounts in late 2024, believed to be operated by DeepSeek.

While distillation isn’t illegal or uncommon in itself, OpenAI’s terms of service prohibit clients from using its model outputs to create competing AI systems.

Could It All Be Coincidence?

Some experts argue that similarities in output styles between AI models aren’t necessarily suspicious.

As training data continues to saturate the internet-much of it AI-generated-it becomes increasingly difficult to isolate purely “organic” data. Tools like Reddit bots and AI-written articles flood the web and influence model behavior.

Given this backdrop, researchers like Nathan Lambert of the non-profit AI2 Institute don’t rule out the idea that DeepSeek might have trained its models on Gemini outputs-whether directly or indirectly.

In a post on X, Lambert wrote:

“If I were DeepSeek, I would absolutely be generating tons of synthetic data from the best API-accessible model available.”

He added that DeepSeek faces GPU shortages but has ample funding, giving them strong computational incentives to rely on external outputs.

Countermeasures and Security Tactics

To combat such practices, AI companies are increasing security measures.

In April, OpenAI began requiring institutions to verify their identities-via government-issued IDs from API-supported countries-to access its advanced models. Notably, China is not on that list.

Meanwhile, Google started “summarizing” the trace outputs from models available on its AI Studio, making it harder to reverse-engineer high-performance models like Gemini.

And in May, Anthropic announced it would start summarizing its model outputs too, citing the need to protect “competitive advantages.”

This growing tension highlights how fierce the AI arms race has become. Training sophisticated models like Gemini or GPT-4 takes years.

So, when a newcomer like DeepSeek suddenly makes rapid progress, questions inevitably arise:

Did they build it from scratch-or borrow from someone else’s foundation?

Regardless of whether DeepSeek actually used Gemini outputs, one thing is clear:

AI labs are increasingly protective of their methods-and in a market worth billions, the fight for intellectual property is only just beginning

Did DeepSeek Train Its AI on Gemini Data?

Accusations: Did DeepSeek R1 Use Gemini Outputs?

Not the First Time

Could It All Be Coincidence?

Countermeasures and Security Tactics

Related Articles

Google Launches Nano Banana Pro Officially: AI Image Tool Arrives with Unprecedented Professional Capabilities

Google Launches Gemini 3 Officially: The Smartest AI Model Arrives with Unprecedented Capabilities

15 Ready-to-Use Gemini Prompts for Stunning Winter Photos

Google Brings Deep Research to NotebookLM with New File Type Support

Comments

No Comments Yet