Behind Closed Doors: China's AI Pioneers Gather for an Exclusive Discourse on DeepSeek
Jan 26, A High-Level Closed-Door Panel Discussion Among China's Top AI Minds, As DeepSeek Takes the World by Storm
A closed-door discussion about DeepSeek has generated unexpectedly passionate response within the global AI community. On January 26, 2025, Li Guangmi, founder and CEO of Shixiang, organized a closed-door discussion about DeepSeek with dozens of top AI researchers, investors, and practitioners. They explored DeepSeek's technical details, organizational culture, and its short to long-term impact after going viral.
As famed Silicon Valley investor Marc Andreessen commented on DeepSeek-R1: "As open source, a profound gift to the world." In this spirit of openness, the participants shared their collective insights from this closed-door meeting.
Here are the key points from the discussion:
The Mysterious DeepSeek
"DeepSeek's primary mission is pushing the boundaries of intelligence"
Founder and CEO Liang Wenfeng is DeepSeek's core figure. Unlike Sam Altman, he is deeply technical.
DeepSeek earned its reputation by being the first to replicate and release MoE, o1, etc. Their early move was key, though whether they can be the best remains to be seen. Future challenges lie in limited resources, forcing them to focus resources on the most impactful areas. The team's research capabilities and culture are excellent - with 100,000-200,000 more GPUs, they could achieve even more.
From preview to official release, DeepSeek's long-context capabilities improved rapidly. DeepSeek achieved 10K context length using unconventional methods.
Scale.ai's CEO claimed DeepSeek has 50,000 GPUs, but the actual number is much lower. Public information suggests DeepSeek has 10,000 old A100 cards and possibly 3,000 pre-ban H800s. DeepSeek strictly complies with regulations and hasn't purchased any non-compliant GPUs. American GPU usage is much less efficient.
DeepSeek focused all its energy on a very narrow point, abandoning many other areas like multimodal. Rather than simply serving humans, they focus on advancing intelligence itself - this may be key to their success.
In a sense, quantitative trading can be seen as DeepSeek's business model. Phantasm (Liang's other quantitative investment company) was a product of the previous machine learning wave. DeepSeek's primary mission is pushing intelligence forward. Money and commercialization are lower priorities. China needs several leading AI labs exploring ways to beat OpenAI. The intelligence journey is long, and this year differentiation is beginning - new innovations will emerge.
From a purely technical perspective, DeepSeek serves as a "West Point" with significant impact on talent diffusion.
U.S. AI labs also lack good business models. AI currently has no proven business model and may need to figure this out later. Liang is ambitious - DeepSeek doesn't care about form, they're just pursuing AGI.
After reading DeepSeek's papers, many techniques focus on reducing hardware costs. In several major scaling directions, DeepSeek's techniques can lower costs.
Long-term computing power demand won't be affected, but short-term everyone will focus on making AI more efficient. Demand remains strong - everyone lacks sufficient computing power.
Regarding DeepSeek's organization:
In investing, people usually choose the highest-level talent combinations, but looking at DeepSeek's model (team mostly comprises smart young graduates from domestic universities), good team chemistry can gradually elevate capabilities. Whether poaching one person can break their advantageous combination is questionable - current impact on DeepSeek may not be significant.
There's plenty of money in the market, but DeepSeek's core is culture and organization. DeepSeek and ByteDance's research cultures are similar - more fundamental. A good culture's measure depends on having sufficient funding and long-term vision. You need a solid business model for long-term cultural sustainability - both companies have very good business models.
How could DeepSeek catch up so quickly?
Reasoning models need higher quality data and training. For long text and multimodal tasks, catching up to a closed-source model from scratch would be harder, but pure reasoning models haven't changed architecturally much - reasoning is an easier direction to pursue.
R1's quick catch-up may be because the task wasn't particularly difficult. RL just helps models choose more accurately. R1 hasn't broken Consensus 32's efficiency, using 32 times the efficiency - essentially changing parallel exploration to serial. It hasn't raised the boundaries of intelligence, just made things easier.
Explorer vs Follower
"AI is like a step function - followers need 10x less computing power"
AI resembles a step function, with followers now needing 10x less computing power. While follower costs remain low, explorers still need to train many models. Algorithm and architecture exploration won't stop. Behind the step function is significant investment from many people, so computing power investment will continue, with many investing in products. Besides reasoning, many directions are compute-intensive.
When exploring directions, 1,000 GPUs might not be better than 100, but there's likely a threshold - with only 100 GPUs, success is unlikely as iteration takes too long.
Advancing physics involves both academic researchers exploring multiple directions without return requirements, and industrial labs focusing on efficiency improvements.
From explorer vs follower perspectives, small companies with few GPUs must focus on efficiency, while large companies prioritize faster model development. Many methods that improve efficiency on 2,000-GPU clusters don't work on 10,000 GPUs, where stability becomes more important.
CUDA's ecosystem advantage lies in its comprehensive operators. Chinese companies like Huawei breakthrough by focusing on common operators, with late-mover advantage. With 100,000 GPUs, the cost of being a leader is high while following is more efficient - how to choose? What's China's next catch-up direction, like multimodality, since GPT-5 keeps getting delayed?
Technical Detail 1: SFT (Supervised Fine-Tuning)
"No need for SFT at inference level"
DeepSeek's most shocking impact isn't open source or low cost, but that SFT is no longer needed. (Note: SFT - Supervised Fine-Tuning, an important model optimization technique using labeled data to improve model performance in specific tasks.) However, this only applies to inference tasks; tasks beyond inference may still need SFT.
DeepSeek-R1 demonstrates to some extent that using SFT for distillation has significant benefits. R1 isn't completely avoiding SFT, but rather only uses it in the third step, with RLHF (Reinforcement Learning from Human Feedback) for final alignment.
R1 is essentially trained through SFT, with the special characteristic being that its data is generated by RLHF-trained models. This shows that with sufficiently good methods, SFT distillation alone is sufficient.
GRPO's essence lies in having a sufficiently intelligent base model, with one prompt generating 16 generations, requiring several attempts to achieve a high probability of correct answers. Good base model plus verification ability is R1's approach. Math and coding are suitable because these tasks are easy to verify, but theoretically, similar processes could be applied to other scenario tasks, ultimately achieving a universal RL model.
R1-Zero developed CoT processes without using SFT, with CoT becoming increasingly longer. This emergence process is very meaningful. SFT is more like an auxiliary means - models can produce without it, but with SFT they can generate faster.
This shows many small model companies can use SFT to distill large models with good results, but it hasn't been completely abandoned in R1's process.
In theory, an LLM with infinite-length CoT can be viewed as a Turing machine, theoretically able to solve extremely complex computational problems through infinite-length CoT. But CoT is essentially intermediate search results, continuously sampling potential outputs in an optimized way, possibly outputting correct results, then guiding the model toward more credible directions.
Although DeepSeek's paper doesn't mention long context, there's a noticeable improvement in context window between R1-preview and R1. It's speculated they made some Long2Short CoT improvements, including CoT used in the third stage SFT being removed in generation, with the final version possibly using cleaner CoT data for SFT.
There are several types of SFT data: one is cold start data, more like giving the model a good strategy and initialization for better exploration - in RL, one optimization goal is to be closer to the original strategy; another type is generating lots of data after RL, combining it with other data for base model SFT. Essentially, each domain has its own data processing pipeline, this data ability comes from the base model, distillation is lossless, and putting multiple domains together might lead to generalization.
R1's data efficiency is uncertain. OpenAI likely did similar things for data efficiency, like fine-tuning. R1's third stage didn't use RL-trained models as base for training, but generated data for SFT to get R1, including 600K reasoning data and 200K non-reasoning data. The second stage model might show problem-solving abilities even in scenarios outside example domains but still requiring reasoning, thus obtaining reasoning data. Non-reasoning data is part of V3 SFT data, where V3 imagined a CoT. 800K data is quite small and efficient.
Technical Detail 2: Data
"DeepSeek pays extremely high attention to data annotation"
Scale.AI won't necessarily fail. RL is now needed across various domains, particularly math and coding, still requiring expert annotation. Data annotation may become more complex, but the market will exist.
In training, multimodal data shows almost no effect, or the cost is too high. There's no evidence yet showing its usefulness, though future opportunities might be significant.
DeepSeek places extremely high value on data annotation - even Liang himself participates in labeling. Besides algorithms and techniques, data precision is crucial. Tesla's annotation costs are almost 20 times higher than Chinese autonomous driving companies. Chinese autonomous driving data went through stages from quantity to quality, finally realizing they needed people with extensive driving experience and ability - something Tesla did from the start. Tesla's robot movements are annotated by people with exceptionally healthy cerebellums, resulting in smoother movements, while Chinese annotations lack this smoothness. Therefore, DeepSeek's investment in data annotation is key to their model's efficiency.
Technical Detail 3: Distillation
"The downside of distillation is reduced model diversity"
If you don't understand the biggest technical pain points in model training and choose to avoid understanding them by using distillation techniques, you might fall into traps when next-generation technology emerges.
The capabilities of large and small models don't match. Distillation from large models to small models is true teacher-to-student distillation. If you try to distill various Chinese data into a model that completely doesn't understand Chinese, performance might decline. However, in practice, distilling small models shows clear performance improvements, and after R1 distillation, models show significant growth with RL because they're using data that doesn't match the model's capabilities.
The disadvantage of distillation is that model diversity decreases, affecting the model's upper limit - it cannot surpass the strongest models. However, in the short term, distillation is still a viable path.
There will be some hacks in distillation. Early on, models that have been instruction-tuned doing RL will show characteristics like: first generating useless thoughts, then suddenly getting the right answer. This is because many RL hacks are very subtle - the model might have memorized many problems during pre-training, so on the surface it appears to be thinking, but is actually just approaching memorized answers. This is distillation's hidden danger. If you do distillation without annotation, when doing RLVR (Reinforcement Learning with Verifiable Rewards) now, it will cause the model to use simpler methods to solve problems rather than thinking about them - OpenAI hasn't solved this either. This might be a defect of this generation of technology.
In the long term, taking shortcuts without developing your own vision for technological solutions and just replicating others could lead to unknown pitfalls. Without qualitative changes in long context under this generation of technology, the upper limit of problem-solving might be restricted. R1-zero might be a correct direction - starting from scratch with R1-zero or without o1-like data initialization might be better. Following others' technical solutions might not be ideal; more exploration is needed.
Other models can achieve relatively good results through distillation. In the future model ecosystem, there might be clear teacher and student role distinctions - having the ability to be a good student could also be a viable business model.
Regarding distillation and technical approaches, R1's impact isn't as shocking as AlphaGo, but commercially, its ability to break through is much better than AlphaGo.
Distillation happens in two stages. If you only distill o1 or R1 without establishing your own system and verifiable rewards, it will lead to increasing dependence on distillation. But distillation in general domains is impossible because rewards cannot be obtained, and how to obtain special CoT during the distillation process.
It's hard to believe that models using pure internet data without annealing could achieve such behavior, because there's almost no high-quality data on the internet.
Currently, maybe only a few top labs are exploring exactly how much annealing stage data is needed and what the data ratios should be. Whether to distill or not are all types of RL algorithms - SFT is behavior imitation, it's infinite reinforcement learning, but doing only SFT has very low upper limits and will damage diversity.
First-market startups are very excited about DeepSeek. If DeepSeek can continue to iterate, non-large public companies will have great flexibility in using AI. DeepSeek has also distilled several small versions that can be used on mobile phones - if this direction is proven, it will raise the ceiling for many AI applications.
For distillation, the most important thing is determining what the goal is. OpenAI doesn't do data distillation - to surpass OpenAI, distillation definitely isn't an option.
In the future, models might need to learn to skip steps in answering like humans do - whether model ability performance upper limits can be improved within fixed context length.
Technical Detail 4: Process Rewards
"Process supervision has human limits, result supervision determines model limits"
Process Reward isn't necessarily ineffective, but can easily be subject to reward hacking - meaning the model hasn't learned anything but can achieve high rewards. When solving math problems, if 1,000 model generations can't get close to the correct answer, RLVR-like methods won't train anything. However, if there's a decent process reward, it might help approach the correct direction - process scores can help. It depends on how difficult the problem is and how reliable the process reward is.
Process scores in PRM estimation are easy to hack if they deviate from reality. Process supervision is theoretically possible; the issues lie in the process granularity and how to assign rewards based on process granularity. Currently, result supervision also uses extracted answers for matching, and no one has mature solutions for model scoring without hacking. Model self-iteration is the easiest to hack. Process labeling isn't difficult and can be enumerated - people just haven't done it, it might be a promising direction.
Process supervision has human limits - humans often can't think of solutions. Result supervision determines the model's upper limit.
AlphaZero is effective because chess game endings can be judged for wins/losses, and the entire reward can be calculated based on win rates. But with LLMs, we don't know if continuous generation will provide answers - it's somewhat like genetic algorithms, possibly having higher limits but also possibly unhackable.
One advantage of AlphaGo to AlphaZero is that Go rules are fixed. Starting with math and coding now is because they're easier to verify - the quality of verification methods will affect final RL quality. Rules must be sufficiently complete, otherwise models will hack them - models might satisfy rules but generate unwanted results.
Why didn't other companies use DeepSeek's approach?
"Big companies' models have to keep a low profile"
The reason why OpenAI and Anthropic hadn't pursued DeepSeek's direction before is a matter of a company's focus. OpenAI and Anthropic might have felt that investing their existing computing power in other areas would be more valuable.
Compared to large companies, DeepSeek may have been able to achieve results because it didn't do multimodal tasks and instead focused on language. The large companies' models are not weak in capability, but they have to keep a low profile and can't release too much. Multimodal is not very critical at the moment, and intelligence mainly comes from language, which doesn't help improve intelligence.
The Divergence and Betting of Technology in 2025
"Can we find other architectures besides Transformer"
Models will diverge in 2025. The most enticing vision is to continuously push the boundaries of intelligence. There may be many breakthrough paths, and methods may change, such as synthetic data and different architectures.
In 2025, the focus will first be on new architectures, whether alternatives to Transformer can be found. There are already some explorations that can reduce costs and explore the boundaries of intelligence while reducing costs. Secondly, the full potential of RL has not yet been realized. In terms of products, people care about agents, which have not been widely applied.
In 2025, multimodal products that can challenge the form of ChatGPT may emerge.
The low cost and high efficiency brought by R1 and V3 indicate that this is a direction that does not conflict with another direction of expanding hardware and increasing parameters. Domestic companies are restricted and can only go for the former.
First, DeepSeek is forced out of the base model and still follows the Scaling Law. Second, from the perspective of distillation, DeepSeek distills from large to small, which is good for closed-source models that are getting larger and larger. Third, in the development of technology, there has not yet been an indicator against scale. If it appears, it may be a relatively big blow to the Scaling Law. Moreover, everything in the open-source model can be done in the closed-source model, and the cost can be reduced at the same time, which is also good for the closed-source model.
It is understood that Meta is currently in the process of reproducing DeepSeek, but there has not yet been any particular impact on infra or long-term roadmap. In the long run, in addition to exploring boundaries, cost must also be considered. Only when the cost is lower can there be more ways to play.
Will developers migrate from closed-source models to DeepSeek?
"Not yet at the moment"
Will developers migrate from closed-source models to DeepSeek? At present, there hasn't been a large-scale migration, because the leading models' coding instruction compliance capabilities are relatively advantageous, but it's uncertain whether this advantage will be overcome in the future.
From the developer's perspective, Claude-3.5-Sonnet has undergone specialized training for tool use, which is very beneficial for creating agents, but DeepSeek and similar models have not provided this capability yet. However, DeepSeek brings a lot of potential.
For large model users, DeepSeek V2 has already met all the needs. R1 has increased the speed but hasn't brought particularly significant additional value. However, when enabling deep thinking, questions that could be answered correctly before are now being answered incorrectly.
When choosing models, users will simplify problems using engineering methods. 2025 may be a year of application, where various industries will use existing capabilities. There may gradually be a bottleneck because daily use may not require such intelligent models.
Now, RL has solved problems with standard answers, but it hasn't made more breakthroughs compared to AlphaZero; it's even simpler. Distillation has solved the problem of standard answers. When using RL methods to train after having standard answers, very good results can be obtained. This is why distillation or RL can make breakthroughs quickly now.
Human demand for intelligence is far underestimated. For example, cancer problems and heat-insulating materials on SpaceX have not been solved yet. The current tasks are automation problems, and there are still many issues. The future incremental explosion is very optimistic, and intelligence cannot be stopped.
OpenAI Stargate 500B narrative and changes in computing power demand
The emergence of DeepSeek has made people start to question the latest 500B narrative of NVIDIA and OpenAI. There is no clear judgment on the training resource problem yet. OpenAI's 500B narrative is to give itself a lifeline.
There are doubts about OpenAI's 500B infrastructure investment because OpenAI is a commercial company. If it involves borrowing, there may be risks.
500B is an exaggerated number and may be executed over 4-5 years. Because the leading roles are SoftBank and OpenAI, the former is the funding, and the latter is the technology. SoftBank's current funds are not enough to support 500B, but they are using their assets as collateral. OpenAI's own funds are also not very abundant. Other participants are more technical rather than funding providers, so there are challenges to fully realizing 500B.
OpenAI's 500B computing power makes sense. In the exploration stage, the cost of trial and error is very high, and both labor and investment costs are high. But because the path is unclear, going from o1 to R1 may not be easy either. But at least we know what the end result will be like, and the intermediate features can also be observed. We can aim at the final form of others from the beginning, which gives a sense of direction. If Google or Anthropic succeeds in their exploration, they may become the most cutting-edge company.
Anthropic may replace all inference with TPU or AWS Chip in the future.
Domestic companies were previously constrained by computing power, but now it has been proven that the potential technical space is very large. For more efficient models, they may not need particularly large GPUs, and relatively customized chips can be provided. They can be adapted on AMD and ASIC chips. From an investment perspective, NVIDIA's barrier is very high, but ASIC will also have greater opportunities.
DeepSeek's development has little to do with computing power. It's more about making the U.S. feel that China is quite powerful and efficient. NVIDIA's weakness is not in DeepSeek. As long as AI is developing, NVIDIA can develop. NVIDIA's advantage lies in its ecosystem, which is accumulated over time. When technology is developing rapidly, the ecosystem is very important. The real crisis is that when technology matures, like electricity, it becomes a standard product. People will focus on making products, and many ASIC chips will come out to optimize for specific scenarios.
The impact on the market
"Short-term sentiment is under pressure, long-term narrative continues"
In the short term, DeepSeek has a big impact on the U.S. AI circle and affects stock prices: pretrain demand growth slows down, post-train and inference scaling have not scaled up fast enough, and there will be a gap in the narrative of related companies, which will indeed have an impact on short-term trading.
DeepSeek is more about FP8, while the U.S. is about FP16. DeepSeek's greatest highlight is the efficient use of computing power based on limited computing power engineering capabilities. Last Friday, DeepSeek had a huge fermentation in North America. Zuckerberg gave a higher expectation for Meta's capital expenditure, but NVIDIA and TSMC both fell, only Broadcom rose.
In the short term, DeepSeek puts pressure on stock prices and valuations in terms of sentiment, and puts pressure on secondary computing power-related companies and even energy companies, but the long-term narrative will continue.
Secondary practitioners will worry that NVIDIA will have some air pockets in the transition from H-series to B-series GPUs, coupled with the pressure from DeepSeek. In the short term, there will be pressure on stock prices, but it may be a better opportunity in the long run.
The short-term impact is a reflection of the sentiment of DeepSeek's low-cost investment in training. For example, NVIDIA's stock price is very direct. But AI is an incremental market with great potential. In the long run, AI is just beginning. If CUDA is still the preferred choice, then there is still a lot of room for hardware growth.
Open source VS closed source
"If the capabilities are similar, it's a challenge for closed source"
The reason why DeepSeek has received attention is more about the contention between the open-source and closed-source routes.
It may cause OpenAI and others to hide better models behind, and currently, the leading models have not been released. But after DeepSeek came out, other AI companies' better models may not be able to be hidden anymore.
DeepSeek has made a lot of optimizations in terms of cost. Amazon and others have not yet seen changes made because of this, and they are still following their established plans. It is currently a state of coexistence. Open-source and closed-source models do not contradict each other. Universities and small labs should prioritize DeepSeek and will not compete with cloud vendors because cloud vendors support both open-source and closed-source, and the ecosystem will not change. It is also a state of coexistence at present. DeepSeek is not as mature as Anthropic in terms of tool use, and the latter has spent a lot of time on AI security. If DeepSeek wants to gain recognition in the European and American markets in the long run, it needs to be considered.
Open source has control over the margin of the entire market. If open source can achieve 95% of closed source, then if closed source is too expensive, it can be done completely with open source. If the capabilities of open source and closed source are similar, it is a big challenge for closed source.
The impact of DeepSeek's popularity
"Vision is more important than technology"
DeepSeek's popularity has made the outside world realize that China's AI is very strong. Previously, the outside world believed that China's AI progress lagged behind the United States by two years, but DeepSeek shows that the gap is actually 3-9 months, and in some aspects, it is even stronger.
Historically, things that China was blocked by the U.S., if they can be broken through, will eventually become very competitive. AI may also be the case. DeepSeek being able to run out is proof.
DeepSeek did not suddenly explode. This time, the R1 results are very impressive, touching the core circles of the United States from top to bottom.
DeepSeek is standing on the shoulders of giants, but the time and labor costs required for cutting-edge exploration are still much higher. R1 does not mean that future training costs will be reduced at the same time.
AI explorers definitely need more computing power. As a pursuer, China can leverage its advantages in engineering capabilities. How Chinese large model teams can use less computing power to achieve results, thus having a certain ability to resist, or even do better, may be the future of the Sino-US AI landscape.
Today, China is still reproducing technical solutions. Reasoning was proposed by OpenAI in o1, so the difference between AI labs in the future lies in who can propose the next reasoning. Infinite-length reasoning may be a vision.
The core difference between models of different AI labs lies in the next vision of the AI labs themselves, not technology.
After all, vision is more important than technology.