In recent weeks there has been a lot of discussion about DeepSeek, the Chinese AI firm, and its family of Large Language Models. Keystone’s Advanced Technology Services (K.ATS) team has received many questions from clients and colleagues. Below is a short, semi-technical write-up addressing some of the key misconceptions and our thoughts. We cover DeepSeek specifically, but more generally we discuss the progress we’re seeing in this space and what to expect.
Overall, DeepSeek’s models are an impressive feat of engineering and algorithmic innovation, and a gift to the open-source community. However, in the research community, DeepSeek was a “known quantity,” and the cost and quality gains were largely expected on the curve of improvements we’ve observed for the past few years in Generative AI. We believe access to compute and data will continue to be hugely important, but we’re excited for algorithmic innovation to drive even more progress. Finally, DeepSeek and the many other frontier models released in the past two weeks illustrate competition and dynamism in foundation model development, which is beneficial for the ecosystem and consumers.
To inform our technical review and analysis, we generated a detailed memo using OpenAI’s Deep Research, a recently released agentic capability that conducts multi-step research tasks within ChatGPT. We asked Deep Research to synthesize findings on DeepSeek-V3 and -R1 and write a detailed research memo on progress in LLM development and implications for competitive dynamics — the result was an incredibly detailed 24-page memo which was generated in 15 minutes. You can read the full memo here.
On Monday, January 27, 2025, more than 70% of the 69 stocks in the SPDR® Technology Select Sector Index dropped in market value. With nearly $1 trillion of market value wiped out in a single day (and nearly $600 billion from Nvidia’s market capitalization alone — the largest single-day market-cap drop to date for any company), it goes without saying that markets were rattled.
Some have speculated this was due to, in part, a blog post titled "The Short Case for Nvidia Stock” published on Saturday, January 25, 2025, by Jefferey Emanuel, a blogger with a background in finance and technology. His thesis was simple: US technology companies were nowhere near as smart and efficient as Wall Street was hyping them to be. In particular, he noted the following:
“… a small Chinese startup called DeepSeek released two new models that have basically world-competitive performance levels on par with the best models from OpenAI and Anthropic (blowing past the Meta Llama3 models and other smaller open source model players such as Mistral).
[…]
By being extremely close to the hardware and by layering together a handful of distinct, very clever optimizations, DeepSeek was able to train these incredible models using GPUs in a dramatically more efficient way. By some measurements, over ~45x more efficiently than other leading-edge models. DeepSeek claims that the complete cost to train DeepSeek-V3 was just over $5mm. That is absolutely nothing by the standards of OpenAI, Anthropic, etc., which were well into the $100mm+ level for training costs for a single model as early as 2024.”
DeepSeek is a Chinese AI company founded in 2023 by Liang Wenfeng, an engineer/computer scientist with a background in quantitative finance. With reportedly fewer than 200 employees, the DeepSeek team applied their significant math and engineering expertise to AI/LLM research. DeepSeek is also the name given to its free AI-powered “chatbot,” which, for all practical purposes, looks and feels very much like ChatGPT.
There were two recent DeepSeek releases over December and January — DeepSeek-V3 (December 27, 2025) and DeepSeek-R1 (January 22, 2025), which we cover briefly below. For those interested in the technical details, we highly recommend reading the DeepSeek-V3 technical report and DeepSeek-R1 paper, reviewing the memo compiled by OpenAI Deep Research, or reviewing these other great resources by Hugging Face and Alexander Rush.
DeepSeek has quickly established itself as a company that is able to develop frontier-level foundation models, pushing boundaries in model performance and cost efficiency with impressive achievements in both engineering and algorithmic innovation. DeepSeek-R1 appears to be on par with OpenAI-o1 in certain reasoning and math-focused benchmarks, such as the American Invitational Mathematics Examination (AIME). But what truly sets DeepSeek apart is its affordability, offering inference at a fraction of the cost of other frontier models (up to 30 times cheaper), though with some access restrictions within China.
Beyond cost and performance, DeepSeek is pioneering new approaches in model training. Unlike conventional pipelines that rely on supervised fine-tuning (SFT) before reinforcement learning (RL) for alignment, DeepSeek-R1-Zero demonstrates that a model can develop emergent reasoning capabilities by skipping SFT entirely and applying RL directly to a base pre-trained model.
Another standout feature is the low training cost — quoted at $5.6 million for DeepSeek-V3 (note that R1 and R1-Zero are built on top of V3), which is significantly lower than competing models like Claude Sonnet 3.5 (which cost in the tens of millions) and GPT-4o (over $100M). However, it’s important to note the DeepSeek V3 paper clearly states that the $5.6 million figure is only for the final training run. Given that it is a replication of a model that is a couple of months dated at this point, it isn’t an apples-to-apples comparison — read on for more details.
Additionally, DeepSeek has leveraged its flagship R1 model to distill smaller, highly capable open-source models (such as Llama3 and Qwen), outperforming larger models at a fraction of the cost. Crucially, DeepSeek’s R1 paper suggests that (1) RL can be applied directly on capable base models which results in emergent reasoning capabilities, and (2) sophisticated reasoning capabilities from larger models (e.g., DeepSeek R1) can be “distilled” into smaller models effectively.
DeepSeek-V3 and -R1 are both major engineering and algorithmic achievements. However, they have also generated a flurry of media and market commentary, which we believe at times has been overstated or misconstrued. Below we argue that DeepSeek was a “known quantity,” and V3 and R1 are on an “expected” curve of improvements.
First, the DeepSeek team is not a new team that appeared out of nowhere. They are a sophisticated team with an impressive historical track record, have been building in the open for quite some time, and were known to the research community since early to mid-2024. This is shown in the below graph. For each of the models in the graph released by DeepSeek, there was a detailed technical paper, along with model weights and code (dating back to DeepSeek LLM, DeepSeek-Coder, and DeepSeekMoE, which were all papers released in January 2024).
Second, DeepSeek’s compute efficiency gains and quality improvements should not have been a “shock” to the markets — they were to be expected given the previous rate of improvement in Generative AI. Foundation model development has been characterized by significant and sustained progress in both performance and cost. For example, see the analysis below from a16z, which shows 10x year-over-year cost declines over the past three years.
Currently, DeepSeek-R1 is priced at $0.14 per million input tokens and $0.55 per million output tokens, getting us closer to “intelligence too cheap to meter.” As of a few days ago, this was 96% cheaper than OpenAI’s o1-mini, and 99% cheaper than o1. However, note that o1 was released in September 2024. On January 31, 2025, OpenAI released a new set of more capable reasoning models, o3-mini, with the full o3 model expected to be released shortly.
In summary, DeepSeek-R1 is an impressive model that replicates o1-like reasoning capabilities and is the first capable reasoning model to be released open-source. DeepSeek-V3 incorporated an assortment of extremely smart engineering and efficiency improvements in both training and inference and was the base off which DeepSeek-R1 was built. However, both models are largely still catching up to frontier models in terms of capabilities while improving significantly on costs, which is on the curve of improvements we’ve come to expect in foundation model development. In many ways, perhaps the most surprising part for the media and markets was that this improvement was done by a Chinese firm and not in Silicon Valley.
On compute costs: The DeepSeek V3 paper very clearly states that the $5.6 million figure is only for the final training run. It does not include capital expenditures or total R&D/experimentation costs. Any comparison of the $5.6 million against billion-dollar budgets by US tech firms is not a fair comparison — it’s comparing one training run against total capital and operating expenditures required to build and operate a data center.
SemiAnalysis notes that total server CapEx for DeepSeek is ~$1.6 billion, with ~$944 million associated with operating costs. There is debate over the precise number and set up of the GPU cluster, and no one outside of DeepSeek will know these details precisely, but it provides a ballpark approximation that is significantly higher than the $5.6 million single training run budget and a fairer comparison.
Next, applying pure RL to base models isn’t, by itself, a novel idea. Why did it start working? We must imagine it is because the base models have become much more sophisticated in underlying capabilities. An interesting analysis would be a comparison in reasoning improvement curves with different base models trained with different amounts of compute and token counts to see differences in scaling. Ultimately, compute still matters hugely, as does the volume and diversity of training data you need to train these models.
“I will say that Deep Learning has a legendary ravenous appetite for compute, like no other algorithm that has ever been developed in AI. You may not always be utilizing it fully but I would never bet against compute as the upper bound for achievable intelligence in the long run. Not just for an individual final training run, but also for the entire innovation / experimentation engine that silently underlies all the algorithmic innovations.”
In parallel to compute, algorithmic innovation has always driven progress in ML. DeepSeek’s R1-Zero is a nice example of this (even if the idea itself is not novel, the application and engineering certainly is), as it demonstrates emergent reasoning from pure RL with minimal reward signals. There are three inputs to progress in ML: data, compute, and algorithmic innovation. Much has been said about data and compute, but in some ways, it seems there is less excitement and recognition for algorithmic innovation in the “scaling laws” era of Generative AI.
Intrepid Growth Partners has a fantastic podcast on DeepSeek, where Richard Sutton expresses that future progress will come from new algorithmic ideas and innovation, probably more so than access to energy or compute. Another way to think about this is a “shifting of the curve,” as Dario Amodei, co-founder and CEO of Anthropic, notes in his recent post “On DeepSeek and Export Controls”:
“Shifting the curve. The field is constantly coming up with ideas, large and small, that make things more effective or efficient: it could be an improvement to the architecture of the model (a tweak to the basic Transformer architecture that all of today's models use) or simply a way of running the model more efficiently on the underlying hardware. New generations of hardware also have the same effect. What this typically does is shift the curve: if the innovation is a 2x "compute multiplier" (CM), then it allows you to get 40% on a coding task for $5M instead of $10M; or 60% for $50M instead of $100M, etc.”
Wherever progress comes from, it’s clear that quality improvements and drastic reductions in cost will continue. Artificial Analysis’s fantastic 2024 highlights report shows the extent of competition in this space, even in frontier models.
Despite this being a very recent report, it is telling how outdated it feels in February 2025. It does not capture DeepSeek-V3 or DeepSeek-R1, Qwen 2.5 Max (January 28, 2025), Tülu 3 (January 30, 2025), Mistral Small 3 (January 30, 2025), or Llama 4 which was also announced recently. This has led to intense competition at the foundation model level (even at the “frontier” between state-of-the-art models) and a “closing of the gap” between open-source and proprietary models, which will only continue to expand the open-source ecosystem and all of the use cases that can be built on top of open-weight models.
As of now, it does not appear to us that there will be a single “moat” that will differentiate one AI firm from another. As described above, while compute is likely going to be important, it isn’t an insurmountable barrier to entry, and companies will need to stay at the frontier of algorithmic innovation to compete.
Notably, as seen in the below survey by Artificial Analysis, there is significant multi-homing, where users and developers readily switch between multiple models to meet varying requirements for their use cases.
One recent example of this is OpenAI’s Deep Research, a new agentic capability released by OpenAI on February 2, 2025, which was trained using “end-to-end reinforcement learning on hard browsing and reasoning tasks across a range of domains.” The result is an immensely capable agent that can browse the web for you and compile a detailed research memo on any topic you want (which is similar to Gemini Deep Research in terms of overall capability, but our initial testing shows that OpenAI’s version is significantly better for detailed and nuanced research).
OpenAI Deep Research is a nice example of (1) applying new algorithmic techniques in novel ways (end-to-end RL for agentic tasks), (2) translating this into a clearly valuable product for end users, and (3) a product that probably required significant amounts of compute, for both training the underlying model, but also for “test time compute,” as the model reasons through web pages and puts together the memo. Finally, it remains to be seen as of writing, but we would imagine (4), that there will be competition within a few weeks to months, including open-source replications of similar agentic capabilities, which will further drive the need to innovate and compete.