Monday

Qwen 2.5 7B 1M — I have just tried Qwen’s latest local model, the 7B 1M, locally in LM Studio 0.3.8 (Build 4). I loaded an entire PhD thesis into the model, and LM Studio gleefully chose inject-full-content as its content injection strategy, rather than retrieval, which uses — the notoriously useless, in my humble opinion — RAG. This was not feasible before using a previous model, such as the DeepSeek R1 Distill Qwen 7B, with a context length of 128K.

It took 38 minutes to successfully inject the PhD thesis (with 166 pages), and fans in my MacBook Pro M1 Max 64GB memory were blowing at full speed. Once injected, it generates 2 output tokens per second, and once the content is injected, the next question only needs 21 seconds to the first token. So asking this model to read an entire PhD thesis works on a local Mac, but one would have to be a bit more patient. LM Studio reports that 20 GB of RAM has been used after the model is loaded, with the context length set to 256K.

Overall, this is indeed a very useful model for local use.

P.S. Of course, if data privacy is not a concern, one can also use the 14B 1M model available on Qwen chat. I tried it and it takes about 2 minutes to inject the entire PhD thesis and answer the first question. It’s interesting to observe that the time to first token for the second question is not much faster, taking about a minute and a half. The quality of the summaries is quite solid, but the language is not much easier to understand than the original thesis. This implies that if the original document is not well written, the summaries will not be too helpful either.


Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models — a recently updated (v3) survey of reinforced reasoning with LLMs from Tsinghua University. After a quick read, I felt it is already somewhat out of date, despite the fact that it is last updated a few days ago. The DeepSeek R1 technical report has not been cited yet, for example. The paper spent quite a bit of space talking about the Process Reward Model (PRM):

Process Reward Model (PRM) based Reinforcement Learning represents a significant advancement in LLM reasoning, emphasizing the evaluation of intermediate steps rather than solely focusing on end-state outcomes.

While discussing PRMs, it did include a brief mention of GRPO, with a citation to the DeepSeekMath paper that introduced it originally back in February 2024. The paper also spent quite some more space discussing the use of Monte Carlo Tree Search (MCTS).

However, the DeepSeek R1 technical report rendered both PRM and MCTS unsuccessful, at least with DeepSeek’s own attempts:

PRM is a reasonable method to guide the model toward better approaches for solving reasoning tasks (Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023). However, in practice, PRM has three main limitations that may hinder its ultimate success. First, it is challenging to explicitly define a fine-grain step in general reasoning. Second, determining whether the current intermediate step is correct is a challenging task. Automated annotation using models may not yield satisfactory results, while manual annotation is not conducive to scaling up. Third, once a model-based PRM is introduced, it inevitably leads to reward hacking (Gao et al., 2022), and retraining the reward model needs additional training resources and it complicates the whole training pipeline. In conclusion, while PRM demonstrates a good ability to rerank the top-N responses generated by the model or assist in guided search (Snell et al., 2024), its advantages are limited compared to the additional computational overhead it introduces during the large-scale reinforcement learning process in our experiments.

Inspired by AlphaGo (Silver et al., 2017b) and AlphaZero (Silver et al., 2017a), we explored using Monte Carlo Tree Search (MCTS) to enhance test-time compute scalability. This approach involves breaking answers into smaller parts to allow the model to explore the solution space systematically. To facilitate this, we prompt the model to generate multiple tags that correspond to specific reasoning steps necessary for the search. For training, we first use collected prompts to find answers via MCTS guided by a pre-trained value model. Subsequently, we use the resulting question-answer pairs to train both the actor model and the value model, iteratively refining the process.

However, this approach encounters several challenges when scaling up the training. First, unlike chess, where the search space is relatively well-defined, token generation presents an exponentially larger search space. To address this, we set a maximum extension limit for each node, but this can lead to the model getting stuck in local optima. Second, the value model directly influences the quality of generation since it guides each step of the search process. Training a fine-grained value model is inherently difficult, which makes it challenging for the model to iteratively improve. While AlphaGo’s core success relied on training a value model to progressively enhance its performance, this principle proves difficult to replicate in our setup due to the complexities of token generation.