DeepSeek-R1’s bold bet on reinforcement learning: How it outpaced OpenAI at 3% of the cost
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More (Updated Monday, 1/27 8am) DeepSeek-R1’s release last Monday has sent shockwaves through the AI community, disrupting assumptions about what’s required to achieve cutting-edge AI performance. Matching OpenAI’s o1 at just 3%-5% of the cost, this open-source model has not only captivated developers but also challenges enterprises to rethink their AI strategies. The model has rocketed to become the top-trending model being downloaded on HuggingFace (109,000 times, as of this writing), as developers rush to try it out and seek to understand what it means for their AI development. Users are commenting that DeepSeek’s accompanying search feature (which you can find at DeepSeek’s site) is now superior to competitors like OpenAI and Perplexity, and is rivaled only by Google’s Gemini Deep Research. (Update as of Monday 1/27, 8am: DeepSeek has also shot up to the top of the iPhone app store, and caused a selloff on Wall Street this morning as investors reexamine the efficiencies of capital expenditures by leading U.S. AI companies.) The implications for enterprise AI strategies are profound: With reduced costs and open access, enterprises now have an alternative to costly proprietary models like OpenAI’s. DeepSeek’s release could democratize access to cutting-edge AI capabilities, enabling smaller organizations to compete effectively in the AI arms race. This story focuses on exactly how DeepSeek managed this feat, and what it means for the vast number of users of AI models. For enterprises developing AI-driven solutions, DeepSeek’s breakthrough challenges assumptions of OpenAI’s dominance — and offers a blueprint for cost-efficient innovation. It’s “how” DeepSeek did what it did that should be the most educational here. DeepSeek-R1’s breakthrough #1: Moving to pure reinforcement learning In November, DeepSeek made headlines with its announcement that it had achieved performance surpassing OpenAI’s o1, but at the time it only offered a limited R1-lite-preview model. With Monday’s full release of R1 and the accompanying technical paper, the company revealed a surprising innovation: a deliberate departure from the conventional supervised fine-tuning (SFT) process widely used in training large language models (LLMs). SFT, a standard step in AI development, involves training models on curated datasets to teach step-by-step reasoning, often referred to as chain-of-thought (CoT). It is considered essential for improving reasoning capabilities. DeepSeek challenged this assumption by skipping SFT entirely, opting instead to rely on reinforcement learning (RL) to train the model. This bold move forced DeepSeek-R1 to develop independent reasoning abilities, avoiding the brittleness often introduced by prescriptive datasets. While some flaws emerged — leading the team to reintroduce a limited amount of SFT during the final stages of building the model — the results confirmed the fundamental breakthrough: Reinforcement learning alone could drive substantial performance gains. The company got much of the way using open source — a conventional and unsurprising way First, some background on how DeepSeek got to where it did. DeepSeek, a 2023 spinoff of Chinese hedge fund High-Flyer Quant, began by developing AI models for its proprietary chatbot before releasing them for public use. Little is known about the company’s exact approach, but it quickly open-sourced its models, and it’s extremely likely that the company built upon the open projects produced by Meta, for example the Llama model, and ML library Pytorch. To train its models, High-Flyer Quant secured over 10,000 Nvidia GPUs before U.S. export restrictions kicked in, and reportedly expanded to 50,000 GPUs through alternative supply routes despite trade barriers (actually, no one knows; these extras may have been Nvidia H800’s, which are compliant with the barriers and have reduced chip-to-chip transfer speeds). Either way, this pales compared to leading AI labs like OpenAI, Google, and Anthropic, which operate with more than 500,000 GPUs each. DeepSeek’s ability to achieve competitive results with limited resources highlights how ingenuity and resourcefulness can challenge the high-cost paradigm of training state-of-the-art LLMs. Despite speculation, DeepSeek’s full budget is unknown DeepSeek reportedly trained its base model — called V3 — on a $5.58 million budget over two months, according to Nvidia engineer Jim Fan. While the company hasn’t divulged the exact training data it used (side note: critics say this means DeepSeek isn’t truly open-source), modern techniques make training on web and open datasets increasingly accessible. Estimating the total cost of training DeepSeek-R1 is challenging. While running 50,000 GPUs suggests significant expenditures (potentially hundreds of millions of dollars), precise figures remain speculative. But it was certainly more than the $6 million budget that is often quoted in the media. (Update: Good analysis just released here by Ben Thompson goes into more detail on cost and the significant innovations the company made on the GPU and infrastructure levels.) What’s clear, though, is that DeepSeek has been very innovative from the get-go. Last year, reports emerged about some initial innovations it was making, around things like mixture-of-experts and multi-head latent attention. (Update: Here is a very detailed report just published about DeepSeek’s various infrastructure innovations by Jeffrey Emanuel, a former quant investor and now entrepreneur. It’s long but very good. See the “Theoretical Threat” section about three other innovations worth mentioning: (1) mixed-precision training, which allowed DeepSeek to use 8-bit floating numbers throughout the training, instead of 32-bit, allowing DeepSeek to dramatically reduce memory requirements per GPU, translating into needing fewer GPUs; (2) multi-token predicting during inference; and (3) advances in GPU communication efficiency through their DualPipe algorithm, resulting in higher GPU utilization.) How DeepSeek-R1 got to the “aha moment” The journey to DeepSeek-R1’s final iteration began with an intermediate model, DeepSeek-R1-Zero, which was trained using pure reinforcement learning. By relying solely on RL, DeepSeek incentivized this model to think independently, rewarding both correct answers and the logical processes used to arrive at them. This approach led to an unexpected phenomenon: The model began allocating additional processing time to more complex problems, demonstrating an ability to prioritize tasks based on their difficulty. DeepSeek’s researchers described this as an “aha moment,” where the model itself identified











