February 2

Sunday

Another simple DeepSeek R1 reproduction — This reproduction of GRPO has one distinct feature: it is exceedingly simple and quite elegant. To run it on the Mac, I only need to do a few minor changes, such as removing the quantization using bitsandbytes, which only works for CUDA. I have also used the following pyproject.toml:

[project]
name = "grpo"
version = "0.1.0"
description = "DeepSeek R1 reproduction using small models"
readme = "README.md"
requires-python = ">3.11, <=3.12"
dependencies = [
    "torch",
    "accelerate",
    "transformers",
    "datasets",
    "tqdm",
    "wandb"
]

and the following command:

uv run R1ZeroTrain.py

Out of the several DeepSeek R1 reproductions, this is my favourite. Not only it is simple enough and does not depend on any external RL library (such as TRL and veRL), it shows some of the nice features in GRPO. Obviously, due to its simplicity, its GRPO implementation is not complete and may need more work. But this is an educational codebase, and the author even posted a YouTube video, which I will try to find some time to watch.

OpenAI releases Deep Research — ChatGPT Pro users who pay $200 a month get 100 Deep Research questions per month. No coding examples in the introduction.

DeepSeek R1 reproduction now runs on my Mac — With a slight modification to train.py to turn off the usage of flash attention 2, I got the DeepSeek R1’s GRPO reproduction on small models with GSM8K running on my Mac, with the following pyproject.toml:

[project]
name = "grpo"
version = "0.1.0"
description = "DeepSeek R1 reproduction using small models"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
  "peft>=0.14.0",
  "torch>=2.6.0",
  "torchvision>=0.21.0",
  "transformers>=4.48.2",
  "trl>=0.14.0",
  "wandb>=0.19.5",
]

and the command:

uv run train.py

On my late-2021 M1 Max 64GB MacBook Pro, it runs at around 8.6 times slower than the NVIDIA RTX 4090, completing each step of RL in about 403 seconds, rather than 47 seconds on the 4090. Memory usage is up to 58 GB.

Interestingly, on my server with 3 NVIDIA RTX A4500 GPUs (each with 20 GB of CUDA memory), each step takes around 193 seconds, about 4x slower than the 4090. Out of a total of 60 GB CUDA memory, 23 GB is utilized¹. At least for this training session, the M1 Max (without using flash attention 2) is only roughly 2x slower than 3 A4500s.

DeepSeek FAQ — I have long admired Ben Thompson’s writing style in terms of its clarity, and this article on DeepSeek is no exception. It is indeed a long read, but the time is worth it. I enjoyed reading about the DeepSeek V2, which very few others mentioned:

Let’s work backwards: what was the V2 model, and why was it important?

The DeepSeek-V2 model introduced two important breakthroughs: DeepSeekMoE and DeepSeekMLA. The “MoE” in DeepSeekMoE refers to “mixture of experts”. Some models, like GPT-3.5, activate the entire model during both training and inference; it turns out, however, that not every part of the model is necessary for the topic at hand. MoE splits the model into multiple “experts” and only activates the ones that are necessary; GPT-4 was a MoE model that was believed to have 16 experts with approximately 110 billion parameters each.

DeepSeekMoE, as implemented in V2, introduced important innovations on this concept, including differentiating between more finely-grained specialized experts, and shared experts with more generalized capabilities. Critically, DeepSeekMoE also introduced new approaches to load-balancing and routing during training; traditionally MoE increased communications overhead in training in exchange for efficient inference, but DeepSeek’s approach made training more efficient as well.

DeepSeekMLA was an even bigger breakthrough. One of the biggest limitations on inference is the sheer amount of memory required: you both need to load the model into memory and also load the entire context window. Context windows are particularly expensive in terms of memory, as every token requires both a key and corresponding value; DeepSeekMLA, or multi-head latent attention, makes it possible to compress the key-value store, dramatically decreasing memory usage during inference.

And it concludes with an upbeat note on competition:

China is also a big winner, in ways that I suspect will only become apparent over time. Not only does the country have access to DeepSeek, but I suspect that DeepSeek’s relative success to America’s leading AI labs will result in a further unleashing of Chinese innovation as they realize they can compete.

That leaves America, and a choice we have to make. We could, for very logical reasons, double down on defensive measures, like massively expanding the chip ban and imposing a permission-based regulatory regime on chips and semiconductor equipment that mirrors the E.U.’s approach to tech; alternatively, we could realize that we have real competition, and actually give ourself permission to compete. Stop wringing our hands, stop campaigning for regulations — indeed, go the other way, and cut out all of the cruft in our companies that has nothing to do with winning. If we choose to compete we can still win, and, if we do, we will have a Chinese company to thank.

This is with vLLM turned off. With it turned on, the server with 3 A4500s always ran out of CUDA memory, for reasons that are still unknown, at this point, to me. ↩

Baochun’s Notes

Explorer

Sunday

Footnotes