February 8

Saturday

Unsloth.ai’s GRPO — it seems that the Unsloth implementation of GRPO uses less GPU memory, and it supports QLoRA and LoRA.

S1-style test-time scaling with MLX — Awni Hannun, the primary architect of MLX, posted a simple implementation of S1-style test-time scaling using DeepSeek R1 distilled models locally, with only 138 lines of Python code. Simplicity at its best.

Spinning Up in Deep RL — Excellent introduction to deep reinforcement learning, with a sufficient amount of math but skips unnecessary formalism. It comes with PyTorch implementations for the algorithms. As it stated in its introduction:

However, while there are many resources to help people quickly ramp up on deep learning, deep reinforcement learning is more challenging to break into. To begin with, a student of deep RL needs to have some background in math, coding, and regular deep learning. Beyond that, they need both a high-level view of the field—an awareness of what topics are studied in it, why they matter, and what’s been done already—and careful instruction on how to connect algorithm theory to algorithm code.

The high-level view is hard to come by because of how new the field is. There is not yet a standard deep RL textbook, so most of the knowledge is locked up in either papers or lecture series, which can take a long time to parse and digest. And learning to implement deep RL algorithms is typically painful, because either

the paper that publishes an algorithm omits or inadvertently obscures key design details,

or widely-public implementations of an algorithm are hard to read, hiding how the code lines up with the algorithm.

Connecting algorithm theory to algorithm code is what’s sorely missing in many other online books or resources, especially in reinforcement learning. Many used Jupyter notebooks, which lead to horrible ways of learning from source code.

Baochun’s Notes

Explorer

Saturday