Redis Creator Debunks Chinese AI Distillation Claims: Is China Really Just Distilling?
The release of DeepSeek-R1 in January 2025 triggered one of the most peculiar episodes in the recent history of artificial intelligence. Within days of the model becoming widely available, commentary across Western media swung between two poles of disbelief. On one side, observers marveled at the model’s performance, which appeared to match or exceed leading American systems on a range of reasoning benchmarks, and at the reported training cost, allegedly a small fraction of what frontier labs in the United States were spending. On the other side, a louder chorus insisted that this could not be real, that the achievement was impossible, and that the only plausible explanation was theft. Distillation, the technique of training a smaller model to mimic a larger one, became the central accusation. American AI companies, including OpenAI and Microsoft, suggested publicly that DeepSeek had improperly harvested outputs from their systems to train competing models. The narrative was irresistible to many in the technology press: a Chinese firm had reached parity with the American frontier not through its own ingenuity but through some combination of industrial espionage, API abuse, and intellectual property theft. It is a compelling story, and like many compelling stories, it is mostly wrong. To understand why requires looking carefully at what distillation actually is, what evidence has been presented to support the accusations, and what Chinese AI researchers have genuinely built on their own merits.
Distillation, in the most straightforward sense, is a method of compressing knowledge from a large, expensive model into a smaller, cheaper one. The technique has been a part of machine learning for nearly a decade. The basic idea is intuitive. Imagine a teacher who knows everything about a subject, including the reasoning behind every answer. Now imagine a student who, rather than learning only the final answers on a homework set, also gets to see the teacher’s worked solutions. The student can learn the patterns of reasoning itself, not just the conclusions, and can sometimes even generalize better than the teacher because the worked examples reveal a more abstract form of understanding. In machine learning, the “teacher” is typically a large neural network, sometimes containing hundreds of billions of parameters, and the “student” is a much smaller network that is trained to match not only the final outputs of the teacher but also the softer probability distributions over possible answers that the teacher produces. This is where the term “distillation” comes from: the rich, probabilistic knowledge of the teacher is distilled into the more compact representation of the student.
The original 2015 paper by Geoffrey Hinton and his colleagues introduced the concept, but the idea has since been extended in many directions. One common variant is feature-based distillation, where the student is trained to match the internal representations of the teacher, not just the final outputs. Another is self-distillation, where a model is trained to mimic its own earlier versions to improve consistency. A third is dataset distillation, where the goal is to compress an entire training dataset into a small set of synthetic examples. The technique is used by essentially every major AI lab in the world, including OpenAI itself, which has published research on distillation and uses it to produce smaller, more efficient versions of its flagship models. The GPT-4 family includes distilled variants, and the o-series reasoning models are themselves the result of a process that resembles distillation in many respects, since they are trained on outputs generated by even larger and more capable internal systems.
This last point is important because it highlights an uncomfortable truth for those making the distillation accusations against Chinese labs. The practice of training one model on the outputs of another is not a violation of any widely accepted norm in the AI research community. It is standard practice. Every major AI lab in the United States, China, and Europe uses some form of synthetic data generation or distillation as part of their training pipelines. The reasons are practical. Human-generated training data is finite and expensive. As models become more capable, their outputs become a valuable resource for training the next generation. A model that has learned to reason well about mathematics can generate millions of high-quality examples for training a smaller model on mathematical reasoning. A model that has learned to write code in many languages can produce training data for code generation in styles and languages for which human-written examples are scarce. The line between legitimate and illegitimate use of synthetic data is genuinely unclear, and the field has not yet developed a consensus on where that line should be drawn.
This brings us to the specific accusations against DeepSeek and other Chinese AI firms. The most concrete piece of evidence cited by those making the accusations is that DeepSeek’s models sometimes, when prompted in certain ways, claim to be ChatGPT or to have been made by OpenAI. This is taken as proof that the models were trained on outputs from OpenAI’s systems. The evidence is weak, and understanding why requires a brief discussion of how language models work. When a language model generates text, it produces the most statistically likely continuation of the prompt based on its training data. If a model has been trained on a large corpus of internet text, much of which contains conversations with ChatGPT or references to ChatGPT’s behavior, the model will have learned patterns associated with being an AI assistant. Without specific training to the contrary, a fine-tuned model might default to identifying itself as ChatGPT, because that is the most common identity associated with high-quality AI responses in its training distribution. This is a known phenomenon, not evidence of distillation. It can be fixed with a few hundred examples in fine-tuning, and the fact that the DeepSeek team had not done this is a minor oversight, not a smoking gun.
The other form of evidence is even more circumstantial. Analysts have pointed to the fact that DeepSeek’s models perform particularly well on benchmarks that are popular in the West, suggesting that they must have been trained on these specific benchmarks. But this argument proves too much. Any well-trained model will perform well on widely used benchmarks, since these benchmarks are designed to capture general capabilities that are valued in the field. A more telling point, and one that was largely absent from the initial wave of accusations, is that DeepSeek published detailed technical papers describing their methodology in considerable depth. The DeepSeek-V3 technical report, for example, runs to more than fifty pages and describes novel architectural choices, training procedures, and engineering optimizations. If the model were simply a distilled copy of an existing system, there would be no need to publish this material, and certainly no way to make it convincing. The level of detail in the technical reports, including the specific design choices that differ from those used by American labs, is itself strong evidence of independent innovation.
To understand what Chinese AI researchers have actually built, it is worth examining the specific contributions of DeepSeek and other Chinese labs. DeepSeek’s most significant architectural innovation is probably the Multi-head Latent Attention mechanism, or MLA. Standard transformer models use a technique called Key-Value caching to speed up inference, but this cache grows with the length of the context and becomes a bottleneck for long-context applications. MLA addresses this bottleneck by compressing the keys and values into a much smaller latent space, dramatically reducing the memory required for inference while preserving model quality. This is not a minor engineering trick. It is a fundamental architectural contribution that has implications for how large language models can be deployed efficiently. Other labs, including some American ones, have subsequently adopted similar approaches, which is a typical pattern in AI research: a clever idea is introduced, and the field collectively benefits from it.
DeepSeek’s mixture-of-experts architecture, used in their V3 model, is another significant contribution. The model contains 256 expert sub-networks, but for any given input, only a small number of experts are activated. This allows the model to have a very large total parameter count while keeping the computational cost of any single inference low. The approach itself is not new, but DeepSeek’s implementation introduced several novel elements, including a fine-grained expert segmentation and a shared expert structure that improved both training stability and inference efficiency. The training of this model was reportedly completed for a small fraction of the cost typically associated with frontier models, and while the exact figure of six million dollars cited in some reports almost certainly understates the true cost by ignoring prior research, failed experiments, and infrastructure investment, the underlying point is correct: DeepSeek achieved competitive performance with substantially less compute than its American competitors.
Beyond DeepSeek, the broader Chinese AI ecosystem has produced a stream of genuine innovations. Alibaba’s Qwen team has released a series of models that have consistently been among the top performers on public benchmarks and have been widely adopted in the open source community. The Qwen models have introduced their own architectural improvements, including work on long-context handling and multilingual capabilities. Baidu’s ERNIE series, while less prominent in Western discourse, has explored different architectural choices and has been deployed at massive scale within Chinese internet applications. Smaller Chinese labs have contributed work on efficient fine-tuning, on alignment techniques, and on specialized models for mathematics and code. The cumulative effect of this work is that China has built a genuinely competitive AI research community, not through any single dramatic breakthrough but through the kind of incremental, distributed progress that has characterized successful technology ecosystems throughout history.
The most revealing aspect of the distillation accusations, however, is what they reveal about the accusers. The American AI industry has spent the last several years building a narrative in which the United States holds a substantial and durable lead in artificial intelligence, a lead justified by the country’s superior research institutions, its deep pool of talent, and its vibrant startup ecosystem. The emergence of competitive Chinese models threatens this narrative, and the distillation accusations function as a way to preserve it. If Chinese models are merely distilled copies of American ones, then the American lead remains intact, and the impressive technical achievements of Chinese researchers can be dismissed as derivative. This is a comforting story for American policymakers, investors, and technologists, but it is not a story that the evidence supports.
The deeper problem with the distillation narrative is that it assumes a very particular relationship between training data and model capabilities. If one believes that a model is essentially a compressed version of its training data, then distillation would explain the performance of Chinese models as a kind of intellectual theft. But this view of how language models work is incorrect. Modern language models are not databases that regurgitate their training data. They are systems that learn statistical patterns and develop internal representations that allow them to generalize in ways that go far beyond the specific examples they were trained on. A model trained on the outputs of another model, even with extensive distillation, would inherit the limitations of the teacher and would not be able to exhibit the novel capabilities that DeepSeek’s models demonstrate. The fact that DeepSeek’s models sometimes produce outputs that are clearly different in style, structure, and content from OpenAI’s models is evidence against the distillation hypothesis, not for it.
None of this is to say that intellectual property concerns in AI are illegitimate. They are real, and they will become more important as AI systems become more capable. But the appropriate response to these concerns is not to dismiss the achievements of an entire national research community as theft. The appropriate response is to develop clearer norms around the use of synthetic training data, to invest in techniques for protecting proprietary model outputs, and to engage in serious international dialogue about the governance of AI. The American AI industry has been largely uninterested in any of these conversations, preferring instead to assert its dominance and to lobby for export controls that would prevent Chinese researchers from accessing the hardware they need to train frontier models. The result has been a kind of techno-nationalism that serves no one well, least of all the researchers in both countries who would benefit from greater collaboration.
My position on this question is clear. The accusations of distillation against DeepSeek and other Chinese AI labs are not supported by the available technical evidence. The circumstantial evidence cited by accusers has plausible alternative explanations, and the substantial body of published technical work from Chinese researchers demonstrates genuine innovation. The narrative of Chinese AI as a derivative of American AI is comforting to those who want to believe in American technological supremacy, but it is not consistent with the reality of what Chinese researchers have built. This does not mean that all uses of distillation are legitimate, or that there are no legitimate concerns about how training data is sourced. It means that the specific accusations that have been made in the past year are largely unfounded, and that the AI community, in both the East and the West, would benefit from a more honest and more technically informed discussion of the real issues at stake. The future of artificial intelligence is too important to be shaped by nationalist mythology on either side of the Pacific, and the work of researchers everywhere deserves to be evaluated on its actual merits rather than on the geopolitical anxieties of the moment.