Thursday
DOGE: Make AI Conferences Great Again — Zeyuan (Allen) Zhu wrote a very interesting piece on using LLMs as arbitrators in the reviewer-author discussions and the paper review process. Zhu is one of the co-authors of the 2021 LoRA paper, which with over 11000 citations became the de facto standard in parameter-efficient fine-tuning, and widely used throughout the entire machine learning community.
In the paper, one surprising fact was that the 2021 LoRA paper was initially rejected by NeurIPS 2021, even after author rebuttal. I believe this shows clear evidence that the paper review system is broken, at least in the ML/AI community, making Zhu’s proposal of using LLMs to improve the fairness of the review process more interesting.
P.S. It looks like the widely cited paper, “Distilling the Knowledge in a Neural Network”, co-authored by Geoffrey Hinton and Jeff Dean, was also rejected by NeurIPS 2014, and later appeared in the NeurIPS 2014 Deep Learning Workshop. It has since received over 23000 citations.
s1: Simple test-time scaling — Stanford University showed in this paper that, by fine-tuning the Qwen2.5-32B-Instruct
model with a curated high-quality dataset of only 1000 samples, and by appending wait
to force the model to think longer, a 32B model can perform as well as o1-preview. It is the simplest way to do test-time scaling over the total number of thinking tokens, but it appears that it works well.
Interestingly, the paper stated:
The concurrently released r1-32B shows stronger performance than s1-32B while also only using SFT (DeepSeek-AI et al., 2025). However, it is trained on 800 × more reasoning samples. It is an open question whether one can achieve their performance with just 1,000 samples.
While DeepSeek r1-32B was indeed trained with far more reasoning samples, once the training is complete, it doesn’t need to perform test-time compute, which degrades user experience in terms of waiting time.