Avatar
Posts of varying effort on technology, cybersecurity, transhumanism, rationalism, self-improvement, DIY, and other stereotypical technologist stuff.

DeepSeek and the AI arms race

The release of the DeepSeek-r1 model by the Chinese AI lab DeepSeek, along with its availability on the App Store, marks a pivotal moment in the AI arms race between the US and China. This development underscores the intensifying competition in AI advancements between these two global powers.

A lot has been written about DeepSeek already, including this DeepSeek FAQ at Stratechery and this write-up by the Anthropic CEO.

To perhaps understate the case, DeepSeek is a shock to the US AI industry, whose lasting superiority is now an open question. Previously, China appeared asleep and that our edge was sizable. DeepSeek is a splash of cold water on those notions.

Assorted remarks

DeepSeek’s claim is they spent $6 million on their final pre-training run.

Without knowing what frontier labs have spent on their final pre-training runs it’s hard to really know how impressive this is. My napkin math of what it takes to look at the whole internet and build a next-token-prediction model from it put me at about $100 million in compute, so their number is potentially pretty impressive!

I should say, frontier AI labs had been cagey about releasing these figures. In the last week Dario, the Anthropic CEO, has just come out with it.

DeepSeek does not “do for $6M what cost US AI companies billions”. I can only speak for Anthropic, but Claude 3.5 Sonnet is a mid-sized model that cost a few $10M’s to train (I won’t give an exact number).

Basically, DeepSeek did for $6 million what Anthropic spent between $10-$100 million to do. If you take the numbers at face value (that is, instead of propaganda or lies), that’s genuinely impressive!

There are caveats. DeepSeek’s runs are many months after Anthropic’s. That’s a huge opportunity to take advantage of a new better baseline in a field where the state of the art advances quickly (both hardware-wise but also methodologically), and they’re somewhat different architectures so it’s not a true apples to apples comparison.

They also did a bunch of serious low-level hacking, for example, resorting to using PTX instead of CUDA because CUDA didn’t allow for cross-chip communication, which is what you’d expect to see from an AI lab that maybe pivoted from doing quant finance.

One reason the DeepSeek paper stands out as impressive though is frontier labs have mostly stopped publishing their research. At best they believe that releasing open models and open weights is unsafe and that revealing their research gives competitors an edge and also makes it harder to develop AI safely. So, the world has not seen for several years the vanguard of what cracked AI teams are doing, laid out so openly.

Indeed, OpenAI has not even acknowledged that GPT4 is a Mixture of Experts (MoE) based model like DeepSeek, though other AI experts appear to believe it. All communications so far describe it as a “dense” model, where the entire model must be activated to answer queries, driving up inference costs. This has led people to conclude DeepSeek-r1 must have 1/20th the inference costs of ChatGPT, but again that only holds if GPT4 is dense and not MoE. Unlikely.

(You can probably tell a similar story for Multi-head Latent Attention, mentioned in the DeepSeek paper.)

At the minimum, nobody should deny that a Chinese AI lab can’t roll with their US counterparts.

What about Facebook’s LLaMa?

Facebook, which had the reigning most advanced open source model looks pretty embarrassed here. DeepSeek’s models are superior in almost every way from a capability perspective. Does this mean that Facebook’s internal proprietary models are as bad as their open LLaMa model? Not necessarily, but why would they be so panicked about it otherwise?

The perfect psyop

DeepSeek’s claims, whether they’re true or not (I’m leaning on true), certainly makes for an amazing psyop.

The fact that they were able to do this despite a chips export regime in the US that prohibits sending state of the art, and even prior generation chips to China (as of 2023 when the export regime was tightened up further), should certainly demoralize advocates of increasing chip export restrictions because what’s the point? Ban all of the chips you want, Chinese brilliance will prevail!

Last week, OpenAI announced a $500 billion(!) commitment to building new datacenters. Certainly after DeepSeek everyone who invested in this must have at least thought, wait, what do we need this much investment for if a Chinese lab can reproduce the state of the art in a matter of months with orders of magnitude less investment?

Though compelling at first blush, both of these claims don’t really stand up.

DeepSeek likely struggled to get the chips they did train on, and paid premiums due to the smuggling required. The export restrictions have already had an effect, and they will still probably need orders of magnitude more chips to get to AGI. OpenAI’s still unreleased o3 model, which had some of the most mind bogglingly awesome benchmark performances by far, was consuming around $4,000 per task(!) in resources when told to think really hard. We’re nowhere near the end of needing state-of-the-art chips, and tons of them. Efforts to restrict chip exports should intensify, not be downgraded.

Also, another way to look at OpenAI’s $500 billion commitment is that if you really believe DeepSeek reduced inference costs to 1/20th (and maybe training costs as well), that means by last week’s capabilities metric, OpenAI’s $500 billion in capacity investment is now more like $20 trillion worth of capacity investment. Unless you believe the world will run out of uses for AGI rather than find even more uses for AGI as supply increases, this is an even more valuable prize to build.

SemiAnalysis confirms some numbers

SemiAnalysis writes in their latest report DeepSeek Debates: Chinese Leadership On Cost, True Training Cost, Closed Model Margin Impacts

We believe [DeepSeek] have access to around 50,000 Hopper GPUs, which is not the same as 50,000 H100, as some [notably, the ScaleAI CEO] have claimed. There are different variations of the H100 that Nvidia made in compliance to different regulations (H800, H20), with only the H20 being currently available to Chinese model providers today. Note that H800s have the same computational power as H100s, but lower network bandwidth. We believe DeepSeek has access to around 10,000 of these H800s and about 10,000 H100s. Furthermore they have orders for many more H20’s, with Nvidia having produced over 1 million of the China specific GPU in the last 9 months.

Our analysis shows that the total server CapEx for DeepSeek is almost $1.3B, with a considerable cost of $715M associated with operating such clusters.

They add that DeepSeek also hires like they mean business and pays jaw-dropping comp packages

DeepSeek has sourced talent exclusively from China, with no regard to previous credentials, placing a heavy focus on capability and curiosity. DeepSeek regularly runs recruitment events at top universities like PKU and Zhejiang, where many of the staff graduated from. Roles are not necessarily pre-defined and hires are given flexibility, with jobs ads even boasting of access to 10,000s GPUs with no usage limitations. They are extremely competitive, and allegedly offer salaries of over $1.3 million dollars USD for promising candidates, well a over big Chinese tech companies. They have ~150 employees, but are growing rapidly.

The arms race intensifies

DeepSeek has caught the attention of the Premier of China (second to Chairman Xi), who met with the CEO of DeepSeek.

China has announced a 1 trillion yuan (about $140 billion USD) plan to build AI infrastructure of their own.

It’s worth noting China are no dummies at AI. They actually lead the US in discriminative AI (surveillance technology) and are at least as approximately as good as we are at autonomous systems (self-driving, robotics, drones). The impressiveness of DeepSeek’s play and China’s commitment to the massive capital investments needed show they are taking this seriously.

It’s also worth noting that China is much better at building power capacity and marshalling resources than we are in the US. They bring enormous amounts of electrical power online every year, about 500 gigawatts of generation, while the US mostly remains flat at about 10 gigawatts added annually. OpenAI has plans to build 1 gigawatt and 10 gigawatt datacenters. For scale, New York City uses about 33 gigawatts in the peak of summer. Even if we do become AGI-pilled, we’re probably not going to be able to add the generation necessary to compete with them; though there has been recent promising news on resurging interest in building nuclear plants, we’re nowhere near positioned to have this pay off fast enough.

The US will lag in the electrical arms race but we have some edge with a chip exports regime. Whether we can parlay this into still a durable advantage or we slip to parity or fall behind remains to be seen. On a promising note, the US appears to be taking this seriously. The Anthropic CEO is certainly saying all of the things I would hope for him to say regarding maintaining our lead versus China, and the new US administration does not appear to be in denial about AI anymore.

all tags