Baochun Li — blog

Redesigned Personal Website with a Minimal Writing Workflow

Baochun Li — Sun, 08 Feb 2026 00:00:00 GMT

I redesigned my personal website on the flight from Doha back to Toronto. Last year’s design took me a day and used a content management framework called Quartz, while the new design took me only about two hours in codex, and used only Astro as a lightweight framework. I used the plan mode in codex to produce a plan first before implementation. I have to say, codex completely changed how website is to be designed: the old days of manually coding websites are gone.

The new website allows creating a new post with a CLI tool: bun run new -- --title "My New Post", which is codex friendly.

tiny-llm and Practical PyTorch Learning Prerequisites

Baochun Li — Tue, 29 Apr 2025 00:00:00 GMT

tiny-llm Exactly what I wished for. It also contains links to two existing PyTorch related courses to machine learning from Carnegie Mellon University, to be used as prerequisites for this course.

Arc Browser and the Modern shadcn/ui Tooling Stack

Baochun Li — Wed, 02 Apr 2025 00:00:00 GMT

Arc — My new browser of choice. I love the fact that bookmarks are organized on the side panel, rather than clustered at the top of the window. Split windows and spaces are also quite nice, and it’s cool to read about using Swift to build the UI. The “Little Arc” windows are kind-of cute, too.

Vaul — What a beautiful and simple design! Enjoyed reading the designer [Emil Kowalski]'s website. His other creation, Sonner, is also something I use — its documentation is a thing of beauty, including its use of the marvellous Berkeley Mono typeface.

shadcn/ui — The best UI component distribution mechanism and library out there. It’s fully compatible with Tailwind CSS 4.1, and tweakcn can be used to customize it.

bazza/ui — Best data table filters, based on shadcn/ui.

Circle — A dashboard template based on shadcn/ui, that allows components to be dragged and dropped across columns, as in Kanban.

Dashboards in shadcn UI Kit — A pretty good dashboard, but not without minor issues when adapting to narrower windows.

shadcnblocks — Hundreds of useful blocks (for a flat fee). Very useful to have such a large selection.

Supabase UI Library — Distributed using the shadcn CLI.

Dice UI — Unstyled UI component library based on the latest Tailwind CSS 4 and distributed using the shadcn CLI. It includes a nice Linear-like table filter and sorting.

Silk — Native‑like swipeable sheets on the web, and more advanced than Vaul. 299 Euro for small businesses with fewer than 5 employees.

Evaluating Eleventy as a Lightweight Static Site Generator

Baochun Li — Tue, 01 Apr 2025 00:00:00 GMT

Eleventy appears to be a pretty simple static website generator that is worth exploring. A competitor to Hugo.

How I Use LLMs: Key Notes from Andrej Karpathy

Baochun Li — Tue, 11 Mar 2025 00:00:00 GMT

How I use LLMs by Andrej Karpathy — A must watch.

Panasonic S1R II and Early Claude Code Impressions

Baochun Li — Tue, 25 Feb 2025 00:00:00 GMT

Panasonic S1R II — With the Sigma 28-105 f/2.8, this would be my dream camera. It is just slightly heavier than my Panasonic S5 IIx (1.57 lb vs. 1.45 lb body only).

Claude Code — I joined the waitlist last night and received the invitation today.

I gave it a try on one of my ongoing projects and it was pretty costly (costing $0.40 to simply set up a basic understanding in CLAUDE. d). I also tried CodeBuff and it does seem to fare better, but not magically solving the issues I experienced.

Morale of the story: Know what you are doing and do not try to use AI blindly. It will mess up the codebase to a point where it is taking longer to rescue the code than to work on the project manually without AI from the beginning.

Ultra-Scale LLM Training Playbook and Streaming DiLoCo

Baochun Li — Wed, 19 Feb 2025 00:00:00 GMT

The Ultra-Scale Playbook: Training LLMs on GPU Clusters — Amazing, and finally we have a 100-page open-source online book on how models are trained with multiple GPUs, with reproducible source code.

Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch — Latest paper from DeepMind about efficient geographically distributed training with overlapped communication.

Crafted UI, Fumadocs, and Design System References

Baochun Li — Tue, 18 Feb 2025 00:00:00 GMT

Crafted — What a great looking set of open-source, hand-crafted UI templates based on shadcn/ui!

Fumadocs — Better Auth’s documentation is built with this excellent documentation framework based on Next.js.

The website of Gwern Branwen — Beautiful design, with Adobe Source Serif Pro as the main serif font choice. I couldn’t believe that the entire site infrastructure is open source, and constantly being updated.

Syntax: Tailwind’s documentation template — A bit pricy, but good quality.

Remix Icons — Used by the Grok website.

Better Auth, Origin UI, and Open Research Data Tools

Baochun Li — Sun, 16 Feb 2025 00:00:00 GMT

Better Auth — A new authentication library that is feature-complete and easy-to-use. Compared to Lucia, which advocates a copy-and-paste approach, this library requires less intimate knowledge about authentication, and its plug-in system implies that it doesn’t sacrifice on extensibility. It feels more like an automatic with manual overrides in cars, rather than a manual transmission. My choice going forward.

Origin UI — What an excellent set of UI components based on shadcn/ui! The number of variants for each UI category is mind-boggling.

Top 10 Email Deliverability Tips — Resend’s tips on improving the deliverability of outbound emails.

Simon Willison’s take on MLX — Simon Willison (finally) added MLX as a new plugin to his LLM CLI utility, llm. His experiences with MLX were very positive:

This is really good software. This small team at Apple appear to be almost single-handedly giving NVIDIA’s CUDA a run for their money!

Semantic Scholar — Unlike Google Scholar, Semantic Scholar provides an open REST API to obtain metadata about papers and their authors, forming an academic graph. Pretty cool and I didn’t know about it before.

Zed supports edit prediction with its open-source Zeta model — The blog post that introduces Zeta was pretty comprehensive and covered a lot of grounds, including their deployment with Baseten with latency minimized.

OpenAlex — A fully open catalog of the global research system. The world becomes a better place with the dedication and hard work of people behind these efforts at OurResearch. It is also part of in the recently released Common Corpus 2, a second version of the Common Corpus.

A Minimal GRPO Implementation from First Principles

Baochun Li — Sat, 15 Feb 2025 00:00:00 GMT

Andriy Burkov’s minimalist implementation of GRPO from scratch — Rather than using a library such as Hugging Face’s TRL, it would always be a good idea to read a minimalist, back-to-square-one implementation of the GRPO reinforcement learning algorithm.

Transformer Lab: MLX Fine-Tuning Workspace on Mac

Baochun Li — Fri, 14 Feb 2025 00:00:00 GMT

Transformer Lab — a free, open-source LLM workspace that prepares a custom dataset and fine-tunes a model using MLX on the Mac (or of course, using a GPU-powered computer or in the cloud). Deep Gandhi offered a quick step-by-step guide for using MLX to fine-tune a model. It’s open-source with the MIT license, and the tech stack for building its UI seems to be Electron and React. This is first UI I found that can fine-tune models using MLX.

Lucia's New Authentication Design and Practical Tradeoffs

Baochun Li — Tue, 11 Feb 2025 00:00:00 GMT

Lucia — Lucia, the authentication library, has adopted the design of cutting and pasting code, just like shadcn/ui, rather than implementing a library to encapsulate the details. This should work well with authentication, and reflects the design principle of working with simpler libraries rather than all-in-ones. In this case, the new Lucia design uses Arctic and Oslo, but all session and cookie management code need to be written (cut and pasted).

From 0 to Production: Notes on Theo's Modern React Tutorial

Baochun Li — Sun, 09 Feb 2025 00:00:00 GMT

From 0 to Production — The Modern React Tutorial — Theo released it last year, and I always wanted to learn from this marathon tutorial. It covers all of the modern frameworks, Next.js, shadcn/ui, and TypeScript. I will find some time to finish it.

Unsloth GRPO, S1-Style Scaling, and RL Learning Resources

Baochun Li — Sat, 08 Feb 2025 00:00:00 GMT

Unsloth.ai’s GRPO — it seems that the Unsloth implementation of GRPO uses less GPU memory, and it supports QLoRA and LoRA.

S1-style test-time scaling with MLX — Awni Hannun, the primary architect of MLX, posted a simple implementation of S1-style test-time scaling using DeepSeek R1 distilled models locally, with only 138 lines of Python code. Simplicity at its best.

Spinning Up in Deep RL — Excellent introduction to deep reinforcement learning, with a sufficient amount of math but skips unnecessary formalism. It comes with PyTorch implementations for the algorithms. As it stated in its introduction:

However, while there are many resources to help people quickly ramp up on deep learning, deep reinforcement learning is more challenging to break into. To begin with, a student of deep RL needs to have some background in math, coding, and regular deep learning. Beyond that, they need both a high-level view of the field—an awareness of what topics are studied in it, why they matter, and what’s been done already—and careful instruction on how to connect algorithm theory to algorithm code.

The high-level view is hard to come by because of how new the field is. There is not yet a standard deep RL textbook, so most of the knowledge is locked up in either papers or lecture series, which can take a long time to parse and digest. And learning to implement deep RL algorithms is typically painful, because either

the paper that publishes an algorithm omits or inadvertently obscures key design details,

or widely-public implementations of an algorithm are hard to read, hiding how the code lines up with the algorithm.

Connecting algorithm theory to algorithm code is what’s sorely missing in many other online books or resources, especially in reinforcement learning. Many used Jupyter notebooks, which lead to horrible ways of learning from source code.

AI Peer Review with LLMs and S1 Test-Time Scaling

Baochun Li — Thu, 06 Feb 2025 00:00:00 GMT

DOGE: Make AI Conferences Great Again — Zeyuan (Allen) Zhu wrote a very interesting piece on using LLMs as arbitrators in the reviewer-author discussions and the paper review process. Zhu is one of the co-authors of the 2021 LoRA paper, which with over 11000 citations became the de facto standard in parameter-efficient fine-tuning, and widely used throughout the entire machine learning community.

In the paper, one surprising fact was that the 2021 LoRA paper was initially rejected by NeurIPS 2021, even after author rebuttal. I believe this shows clear evidence that the paper review system is broken, at least in the ML/AI community, making Zhu’s proposal of using LLMs to improve the fairness of the review process more interesting.

P.S. It looks like the widely cited paper, “Distilling the Knowledge in a Neural Network”, co-authored by Geoffrey Hinton and Jeff Dean, was also rejected by NeurIPS 2014, and later appeared in the NeurIPS 2014 Deep Learning Workshop. It has since received over 23000 citations.

s1: Simple test-time scaling — Stanford University showed in this paper that, by fine-tuning the Qwen2.5-32B-Instruct model with a curated high-quality dataset of only 1000 samples, and by appending wait to force the model to think longer, a 32B model can perform as well as o1-preview. It is the simplest way to do test-time scaling over the total number of thinking tokens, but it appears that it works well.

Interestingly, the paper stated:

The concurrently released r1-32B shows stronger performance than s1-32B while also only using SFT (DeepSeek-AI et al., 2025). However, it is trained on 800 × more reasoning samples. It is an open question whether one can achieve their performance with just 1,000 samples.

While DeepSeek r1-32B was indeed trained with far more reasoning samples, once the training is complete, it doesn’t need to perform test-time compute, which degrades user experience in terms of waiting time.

Karpathy's LLM Deep Dive and MLX Rust Ecosystem Links

Baochun Li — Wed, 05 Feb 2025 00:00:00 GMT

Deep Dive into LLMs like ChatGPT — Andrej Karpathy continues his top-notch hours-long education on large language models with a new episode today. I am also keeping an eye on his new venture, Eureka Labs, which hopefully will eventually arrive with genuinely helpful educational content on all things machine learning.

Deep Learning — a long list of 26 white board lectures on deep learning, taught by Professor Bryce, Davidson College.

mlx-rs — Rust bindings for Apple’s MLX machine learning library on Apple silicon. Two of my favourite technologies are Rust and MLX, and this one has a bit of both.

GRPO on Apple MLX and Minimal-R1 Scaling Insights

Baochun Li — Mon, 03 Feb 2025 00:00:00 GMT

GRPO will soon be added to Apple MLX — The PR now works, using about 32 GB of memory when training Qwen2.5-0.5B.

Minimal-R1 — Another excellent reproduction of DeepSeek R1 with GRPO, using only a 8xH100 server. It addresses the issue of scalability in Hugging Face’s Open-R1 when generating long completions. What makes it stand out is that it doesn’t depend on TRL, and has its own GRPO implementation. It dedicated one GPU for vLLM generation, and one GPU for the reference model.

Kevin Bryan shares his view on OpenAI Deep Research — Kevin Bryan from the University of Toronto shares his early experiences with OpenAI’s Deep Research. He is extremely upbeat about it, even sharing a paper that Deep Research (a.k.a. the o3 model with web search capabilities) wrote in 15 minutes, as well as another paper that is more theoretical.

Here are some interesting quotes of what Prof. Bryan said:

Nick Pretnar asks: Can it simultaneously write a paper + model code, estimate/calibrate such model, discern which results are relevant to discuss then present such results in a way humans can understand?

Kevin Bryan: That’s beyond current capabilities. But the proof of concept is pretty clear. At this point, it’s by far most useful as a complement — you should be writing your code with Cursor plus frontier models, having AI supplement and check analysis, having AI check proof accuracy, etc.

This is what I would call a human-in-the-loop approach to academic research. But of course, when abused, the landscape of academic research papers can have lots of mediocre AI-generated content in the near-term future.

WTF happened in 1971? — 1971 is indeed a special year, it was when Elon Musk, Marc Andreessen, Ma Huateng, Liu Yunhao, and I were born.

Simple GRPO Implementations and DeepSeek FAQ Highlights

Baochun Li — Sun, 02 Feb 2025 00:00:00 GMT

Another simple DeepSeek R1 reproduction — This reproduction of GRPO has one distinct feature: it is exceedingly simple and quite elegant. To run it on the Mac, I only need to do a few minor changes, such as removing the quantization using bitsandbytes, which only works for CUDA. I have also used the following pyproject.toml:

[project]
name = "grpo"
version = "0.1.0"
description = "DeepSeek R1 reproduction using small models"
readme = "README.md"
requires-python = ">3.11, <=3.12"
dependencies = [
    "torch",
    "accelerate",
    "transformers",
    "datasets",
    "tqdm",
    "wandb"
]

and the following command:

uv run R1ZeroTrain.py

Out of the several DeepSeek R1 reproductions, this is my favourite. Not only it is simple enough and does not depend on any external RL library (such as TRL and veRL), it shows some of the nice features in GRPO. Obviously, due to its simplicity, its GRPO implementation is not complete and may need more work. But this is an educational codebase, and the author even posted a YouTube video, which I will try to find some time to watch.

OpenAI releases Deep Research — ChatGPT Pro users who pay $200 a month get 100 Deep Research questions per month. No coding examples in the introduction.

DeepSeek R1 reproduction now runs on my Mac — With a slight modification to train.py to turn off the usage of flash attention 2, I got the DeepSeek R1’s GRPO reproduction on small models with GSM8K running on my Mac, with the following pyproject.toml:

[project]
name = "grpo"
version = "0.1.0"
description = "DeepSeek R1 reproduction using small models"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
  "peft>=0.14.0",
  "torch>=2.6.0",
  "torchvision>=0.21.0",
  "transformers>=4.48.2",
  "trl>=0.14.0",
  "wandb>=0.19.5",
]

and the command:

uv run train.py

On my late-2021 M1 Max 64GB MacBook Pro, it runs at around 8.6 times slower than the NVIDIA RTX 4090, completing each step of RL in about 403 seconds, rather than 47 seconds on the 4090. Memory usage is up to 58 GB.

Interestingly, on my server with 3 NVIDIA RTX A4500 GPUs (each with 20 GB of CUDA memory), each step takes around 193 seconds, about 4x slower than the 4090. Out of a total of 60 GB CUDA memory, 23 GB is utilized[^1]. At least for this training session, the M1 Max (without using flash attention 2) is only roughly 2x slower than 3 A4500s.

DeepSeek FAQ — I have long admired Ben Thompson’s writing style in terms of its clarity, and this article on DeepSeek is no exception. It is indeed a long read, but the time is worth it. I enjoyed reading about the DeepSeek V2, which very few others mentioned:

Let’s work backwards: what was the V2 model, and why was it important?

The DeepSeek-V2 model introduced two important breakthroughs: DeepSeekMoE and DeepSeekMLA. The “MoE” in DeepSeekMoE refers to “mixture of experts”. Some models, like GPT-3.5, activate the entire model during both training and inference; it turns out, however, that not every part of the model is necessary for the topic at hand. MoE splits the model into multiple “experts” and only activates the ones that are necessary; GPT-4 was a MoE model that was believed to have 16 experts with approximately 110 billion parameters each.

DeepSeekMoE, as implemented in V2, introduced important innovations on this concept, including differentiating between more finely-grained specialized experts, and shared experts with more generalized capabilities. Critically, DeepSeekMoE also introduced new approaches to load-balancing and routing during training; traditionally MoE increased communications overhead in training in exchange for efficient inference, but DeepSeek’s approach made training more efficient as well.

DeepSeekMLA was an even bigger breakthrough. One of the biggest limitations on inference is the sheer amount of memory required: you both need to load the model into memory and also load the entire context window. Context windows are particularly expensive in terms of memory, as every token requires both a key and corresponding value; DeepSeekMLA, or multi-head latent attention, makes it possible to compress the key-value store, dramatically decreasing memory usage during inference.

And it concludes with an upbeat note on competition:

China is also a big winner, in ways that I suspect will only become apparent over time. Not only does the country have access to DeepSeek, but I suspect that DeepSeek’s relative success to America’s leading AI labs will result in a further unleashing of Chinese innovation as they realize they can compete.

That leaves America, and a choice we have to make. We could, for very logical reasons, double down on defensive measures, like massively expanding the chip ban and imposing a permission-based regulatory regime on chips and semiconductor equipment that mirrors the E.U.’s approach to tech; alternatively, we could realize that we have real competition, and actually give ourself permission to compete. Stop wringing our hands, stop campaigning for regulations — indeed, go the other way, and cut out all of the cruft in our companies that has nothing to do with winning. If we choose to compete we can still win, and, if we do, we will have a Chinese company to thank.

[^1]: This is with vLLM turned off. With it turned on, the server with 3 A4500s always ran out of CUDA memory, for reasons that are still unknown, at this point, to me.

Reproducing DeepSeek R1 GRPO on Consumer Hardware

Baochun Li — Sat, 01 Feb 2025 00:00:00 GMT

Fourth attempt on reproducing DeepSeek R1’s GRPO on small models — The ~~third~~ fourth time is the charm. I can successfully run this repo, without activating vLLM (keep vllm=true uncommented in the source code), on a single NVIDIA RTX 4090 with 24 GB CUDA memory, training the Qwen2.5-Math-1.5B model with the gsm8k dataset.

I used the following pyproject.toml:

[project]
name = "grpo"
version = "0.1.0"
description = "DeepSeek R1 reproduction using small models"
readme = "README.md"
requires-python = ">=3.11, <=3.12"
dependencies = [
    "torch",
    "transformers",
    "datasets",
    "peft",
    "wandb",
    "vllm",
    "trl",
    "flash-attn",
]

[tool.uv]
no-build-isolation-package = ["flash-attn"]

[tool.uv.sources]
torch = [
  { index = "pytorch-cu121", marker = "sys_platform == 'linux' or sys_platform == 'win32'" },
]
torchvision = [
  { index = "pytorch-cu121", marker = "sys_platform == 'linux' or sys_platform == 'win32'" },
]

[[tool.uv.index]]
name = "pytorch-cu121"
url = "https://download.pytorch.org/whl/cu121"
explicit = true

And the following command to run the repo:

uv run train.py

I obtained the following result after around 6 hours and over 450 steps:

Third attempt on reproducing DeepSeek R1’s GRPO on small models — Will Brown’s GRPO reproduction uses the openai/gsm8k dataset with 7470 samples, rather than the Countdown Game dataset in the two previous attempts — TinyZero and Mini-R1 — which is much more meaningful. It has been shown by others that even the small Qwen2.5-0.5B model can be trained from 41.6% to 51% on the gsm8k test set. I will try to reproduce this result some time, but for now it ran out of CUDA memory for a single NVIDIA RTX A4500 with 20 GB of CUDA memory, even for training the Qwen2.5-0.5B model.

Home server at $2000 for DeepSeek R1 at 4-bit quantization — $2000 home server, running the DeepSeek R1 671b model at 4-bit quantization and 3.5-4 tokens per second.

NVIDIA hosts DeepSeek R1 — much slower than Lambda Labs.

OpenAI o3-mini — On ChatGPT Plus, the rate limits are 150 messages per day for o3-mini-medium, and 50 messages per week for o3-mini-high. The latter is designed to be the strongest model on coding.

Running DeepSeek R1 on Lambda Labs and Notes on Ghostty

Baochun Li — Fri, 31 Jan 2025 00:00:00 GMT

Lambda Labs hosts DeepSeek R1 — the dashboard is simple, nice to look at, free to use, and pretty fast when generating tokens. Overall, an excellent user experience. The DeepSeek Llama 3.3 70B is also available, and it is much faster: reasoning is done in 9 seconds for my question What are the axioms of probability theory?, as opposed to 69 seconds with DeepSeek R1 671B.

Ghostty — version 1.1.0 is available with lots of updates and bug fixes. The best terminal emulator becomes even better.

Fine-Tuning Open LLMs in 2025 with Hugging Face and Mini-R1

Baochun Li — Thu, 30 Jan 2025 00:00:00 GMT

How to fine-tune open LLMs in 2025 with Hugging Face — Philipp Schmid a Technical Lead at Hugging Face, posted this article on fine-tuning LLMs using Hugging Face tools, without using the Unsloth API. I find it comprehensive and I will need to give it a try myself.

Mini-R1 — Philipp Schmid also posted this interesting reproduction of DeepSeek R1’s RL training. Similar to TinyZero, it used the Countdown Game as the task, but the article is much better written.

Mini-R1 used Hugging Face’s own TRL, designed to train transformer language models with RL in the post-training phase, which Hugging Face introduced in its smol course. To support multi-GPU training, it used DeepSpeed. In contrast, TinyZero used ByteDance’s veRL for both RL and distributed training, which doesn’t have either TRL or DeepSpeed in its dependencies.

veRL is based on HybridFlow, a University of Hong Kong/ByteDance paper published in EuroSys 2025, co-authored by Prof. Chuan Wu from the University of Hong Kong. I will allocate some time to study this paper in greater detail.

Microsoft added DeepSeek R1 to GitHub Models — I tried it with a simple question and not only the inference speed is astonishingly low, errors occurred before completing the answer. It is unusable at this point.

DeepSeek, Export Controls, and Open-Weight AI Debates

Baochun Li — Wed, 29 Jan 2025 00:00:00 GMT

On DeepSeek and Export Controls — Dario Amodei, Anthropic’s CEO, wrote a fairly long editorial on DeepSeek. However, it doesn’t mention at all the fact that DeepSeek’s models are open-weight models under a permissive MIT license, while Anthropic and OpenAI remained closed-weight models with no transparency on the technologies they used for both training and inference. At one point, Amodei mentioned that both DeepSeek and OpenAI o1 used RL, and used this to imply that DeepSeek’s use of RL to train R1-Zero is not so innovative. But we don’t know how OpenAI used RL to train o1, except that o1 “uses a chain of thought when attempting to solve a problem,” and that reinforcement learning has been used to train it[^1]. It could be the case that DeepSeek’s use of RL for train-time compute is very different from o1, and the fact that its affiliated technical report goes into sufficient technical detail on GRPO that makes it fully reproducible is much more noteworthy.

DeepSeekMath Paper Explained — Yannic Kilcher gave this one-hour explanation of the DeepSeekMath paper. I watched the first five minutes and minute 30 and beyond on GRPO. His explanations of GRPO are top-notch. The final five minutes, on Section 5.2.2 (“Why RL Works”), is insightful and worth tuning into.

Complete hardware for the full DeepSeek R1 at Q8 quantization, at $6000 — The fact that this CPU-only server can generate at 6-8 tokens per second — the same as human reading speed — shows the very substantial advantage of Mixture-of-Experts (MOE) models when running in CPU-only home servers, as compared to dense models such as the Llama 3.1 405B. Assembling such a server is non-trivial and not for the faint of heart, but it certainly has been proven possible.

[^1]: Learning to reason with LLMs, OpenAI, September 12, 2024.

The Illustrated DeepSeek-R1: A Clear Visual Walkthrough

Baochun Li — Tue, 28 Jan 2025 00:00:00 GMT

The Illustrated DeepSeek-R1 — Jay Alammar, the author of O’Reilly’s Hands-On Large Language Models, wrote a short piece on explaining DeepSeek R1 at a high level. I found it easy to read and the illustrations are pleasing to the eye.

Qwen 2.5 7B 1M Local Testing and RL Survey Notes

Baochun Li — Mon, 27 Jan 2025 00:00:00 GMT

Qwen 2.5 7B 1M — I have just tried Qwen’s latest local model, the 7B 1M, locally in LM Studio 0.3.8 (Build 4). I loaded an entire PhD thesis into the model, and LM Studio gleefully chose inject-full-content as its content injection strategy, rather than retrieval, which uses — the notoriously useless, in my humble opinion — RAG. This was not feasible before using a previous model, such as the DeepSeek R1 Distill Qwen 7B, with a context length of 128K.

It took 38 minutes to successfully inject the PhD thesis (with 166 pages), and fans in my MacBook Pro M1 Max 64GB memory were blowing at full speed. Once injected, it generates 2 output tokens per second, and once the content is injected, the next question only needs 21 seconds to the first token. So asking this model to read an entire PhD thesis works on a local Mac, but one would have to be a bit more patient. LM Studio reports that 20 GB of RAM has been used after the model is loaded, with the context length set to 256K.

Overall, this is indeed a very useful model for local use.

P.S. Of course, if data privacy is not a concern, one can also use the 14B 1M model available on Qwen chat. I tried it and it takes about 2 minutes to inject the entire PhD thesis and answer the first question. It’s interesting to observe that the time to first token for the second question is not much faster, taking about a minute and a half. The quality of the summaries is quite solid, but the language is not much easier to understand than the original thesis. This implies that if the original document is not well written, the summaries will not be too helpful either.

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models — a recently updated (v3) survey of reinforced reasoning with LLMs from Tsinghua University. After a quick read, I felt it is already somewhat out of date, despite the fact that it is last updated a few days ago. The DeepSeek R1 technical report has not been cited yet, for example. The paper spent quite a bit of space talking about the Process Reward Model (PRM):

Process Reward Model (PRM) based Reinforcement Learning represents a significant advancement in LLM reasoning, emphasizing the evaluation of intermediate steps rather than solely focusing on end-state outcomes.

While discussing PRMs, it did include a brief mention of GRPO, with a citation to the DeepSeekMath paper that introduced it originally back in February 2024. The paper also spent quite some more space discussing the use of Monte Carlo Tree Search (MCTS).

However, the DeepSeek R1 technical report rendered both PRM and MCTS unsuccessful, at least with DeepSeek’s own attempts:

PRM is a reasonable method to guide the model toward better approaches for solving reasoning tasks (Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023). However, in practice, PRM has three main limitations that may hinder its ultimate success. First, it is challenging to explicitly define a fine-grain step in general reasoning. Second, determining whether the current intermediate step is correct is a challenging task. Automated annotation using models may not yield satisfactory results, while manual annotation is not conducive to scaling up. Third, once a model-based PRM is introduced, it inevitably leads to reward hacking (Gao et al., 2022), and retraining the reward model needs additional training resources and it complicates the whole training pipeline. In conclusion, while PRM demonstrates a good ability to rerank the top-N responses generated by the model or assist in guided search (Snell et al., 2024), its advantages are limited compared to the additional computational overhead it introduces during the large-scale reinforcement learning process in our experiments.

Inspired by AlphaGo (Silver et al., 2017b) and AlphaZero (Silver et al., 2017a), we explored using Monte Carlo Tree Search (MCTS) to enhance test-time compute scalability. This approach involves breaking answers into smaller parts to allow the model to explore the solution space systematically. To facilitate this, we prompt the model to generate multiple tags that correspond to specific reasoning steps necessary for the search. For training, we first use collected prompts to find answers via MCTS guided by a pre-trained value model. Subsequently, we use the resulting question-answer pairs to train both the actor model and the value model, iteratively refining the process.

However, this approach encounters several challenges when scaling up the training. First, unlike chess, where the search space is relatively well-defined, token generation presents an exponentially larger search space. To address this, we set a maximum extension limit for each node, but this can lead to the model getting stuck in local optima. Second, the value model directly influences the quality of generation since it guides each step of the search process. Training a fine-grained value model is inherently difficult, which makes it challenging for the model to iteratively improve. While AlphaGo’s core success relied on training a value model to progressively enhance its performance, this principle proves difficult to replicate in our setup due to the complexities of token generation.

Nvidia, DeepSeek, and RL Reasoning: Long-Form Analysis Notes

Baochun Li — Sun, 26 Jan 2025 00:00:00 GMT

The Short Case for Nvidia Stock — I spent less than an hour reading a pretty substantial portion of this article. It’s so good that I will need to allocate some time to read it again. The entire article, and especially the DeepSeek portion of it, is highly recommended, even if one is not interested in investing. It’s a detailed outlook for the entire AI industry.

As I am reading it the second time, the article covered tech that I have been following quite closely as well:

It mentioned how Cerebras solved its yield problem, while I have read its CePO test-time compute strategies;
It mentioned Groq, and I have tried its excellent and speedy inference service with a free account;
It mentioned George Hotz’s Tiny Corp. and its tinygrad, which I have been closely following on X. Back in the day, George Hotz was famous for jailbreaking the original iPhone as a teenager;
It mentioned MLX, which, as the article said, provides a PyTorch-like API that can run efficiently on Apple Silicon, showing how abstraction layers can enable AI workloads to run on completely different architectures. MLX is particularly interesting as it supports distributed computation — both training and inference — across multiple Macs. And its main contributor, Awni Hannun, mentioned today that DeepSeek R1 can run with 4-bit quantization across three 192 GB M2 Ultra Mac Studios at 12 tokens per second, requiring a minimum of 450 GB GPU memory;
And of course, it covered DeepSeek R1 in sufficient technical detail.

Wow, what a gem as a long-form read!

P.S.

Chamath Palihapitiya also thought the article is very good:

With R1, DeepSeek essentially cracked one of the holy grails of AI: getting models to reason step-by-step without relying on massive supervised datasets. Their DeepSeek-R1-Zero experiment showed something remarkable: using pure reinforcement learning with carefully crafted reward functions, they managed to get models to develop sophisticated reasoning capabilities completely autonomously. This wasn’t just about solving problems— the model organically learned to generate long chains of thought, self-verify its work, and allocate more computation time to harder problems.

The technical breakthrough here was their novel approach to reward modeling. Rather than using complex neural reward models that can lead to “reward hacking” (where the model finds bogus ways to boost their rewards that don’t actually lead to better real-world model performance), they developed a clever rule-based system that combines accuracy rewards (verifying final answers) with format rewards (encouraging structured thinking). This simpler approach turned out to be more robust and scalable than the process-based reward models that others have tried.

With extraordinary prescience, NVidia stock is down by over 14% around 11 a.m. the next morning, after this article is written. And the tech-heavy Nasdaq Composite falls 2.5%.
Simon Willison likes it too, calling it “Long, excellent piece by Jeffrey Emanuel capturing the current state of the AI/LLM industry.” —

The real joy of this article is the way it describes technical details of modern LLMs in a relatively accessible manner. I love this description of the inference-scaling tricks used by O1 and R1, compared to traditional transformers.

7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient — Interesting. DeepSeek R1’s RL training techniques can be successfully applied to smaller models as well, at least for simple math datasets.

DeepSeek R1 for Everyone and DeepSeek V3 101 — With a brief read, these look promising as an easy read to understand some of the technical details of DeepSeek R1 and V3.

“Agents” still haven’t really happened yet — “If you tell me that you are building “agents”, you’ve conveyed almost no information to me at all. Without reading your mind I have no way of telling which of the dozens of possible definitions you are talking about.”

Open-R1 and TinyZero: Early DeepSeek R1 Reproductions

Baochun Li — Sat, 25 Jan 2025 00:00:00 GMT

Open-R1 — Hugging Face started to reproduce DeepSeek R1 in the open, and discussed the R1 technical report in a recorded YouTube video.

TinyZero — a reproduction of DeepSeek R1 Zero in countdown and multiplication tasks.

What I've Been Reading

Baochun Li — Fri, 24 Jan 2025 00:00:00 GMT

This website is a space for storing — and sharing, if anyone cares about these — some of the websites, code repositories, and tweets that I have read. They are mostly about technology, but not necessarily tech that is currently in the spotlight. They are stored here because I thought they are worthy of preserving for a longer period of time.

I have two relatively quick ways of preserving quality content. I can bookmark a link in a web browser or app; it is painless, but those bookmarks are too easy to misplace or forget about, and pretty difficult to search for after accumulating a larger number over time. For longevity, I can also place the link in a personal note (such as iOS Notes or Obsidian), which takes more effort and time, and thus the motivation for doing so in the long run is a bit questionable: why would anyone diligently copy and paste links to a personal note every time something interesting comes up?

The beauty of sharing links publicly, besides the nature of sharing itself, is to add a slice of motivation to the cocktail: it motivates me to do the work for copying and pasting. It also motivates me to add a bit of commentary, which records what I have been thinking about when reading the content linked to. Some called such a publicly shared website that contains links and commentaries a digital garden or a microblog, terms that I don’t quite like. Simon Willison and John Gruber called such a style a link blog. I will simply call it “what I’ve been reading.”