Autoresearch — Baochun Li

It has been two weeks since Andrej Karpathy released Autoresearch.

It has a simple idea: give an AI agent an environment where it knows what is the benchmark that it should run and optimize on, and ask it to repeatedly take actions to optimize the project for this particular benchmark. Experiment runs on the benchmark can be used to keep or discard the optimization, and optimizations that are kept will accumulate over time.

Surprisingly, such a simple idea turns out to be extremely effective. As Kaparthy proclaimed:

…in any case no one could tell if that’s right or wrong as the “code” is now a self-modifying binary that has grown beyond human comprehension.

The key is to define a precise benchmark that can be used to evaluate any solutions to a problem, so that an AI agent — or multiple collaborating agents — can run this benchmark to decide whether an idea should be kept or discarded. Naturally, since this requirement is not too exacting, quite a large number of projects have spun up, including my own experiments trying the idea on the Days discrete-event network simulator, improving performance by over 25%. Autoresearch doesn’t really care about what you wish to optimize, as long as some precise benchmark is defined.

This requirement, however, was not really satisfied in many academic research papers. Often, it is difficult to read between the lines to see what a paper is trying to optimize for. A paper can go on for 10 pages, yet there is not a single prescribed benchmark that can precisely capture the problem that it wishes to solve, and how the paper advances the state-of-the-art on this benchmark. In my own words, these papers are not autoresearch-friendly.

Here are some noteworthy autoresearch projects over the past two weeks:

—

Shopify’s CEO, Tobi Lütke, announced that David Cortés and he implemented Autoresearch as a Pi extension, pi-autoresearch, in about 2500 lines of TypeScript code.

My own experiments in Days used this extension, and it worked extremely well. Without any prompts and with only /autoresearch, it would automatically dig into the codebase to find the most suitable benchmark to optimize for. After I provided a specific benchmark, it would switch to the one I asked for in my explicit prompt. For the initial benchmark that included a routing protocol implementation, the agent got a bit too eager and coded a custom routing implementation for FatTree topologies only, which breaks the routing mechanism when the topology is not a FatTree. Overall, however, autoresearch saved about 25% in runtime performance on this particular benchmark, which is quite a bit given that the codebase has already gone through many rounds of optimizations in the past.

Since its inception, the pi-autoresearch extension has been evolving. Two noteworthy improvements have been added in the past three days: a confidence score has been added, and additional actionable side information on why an optimization is discarded has been recorded.

—

Nous Research used its open-source Hermes agent to write a novel using autoresearch. The benchmark, in the context of autoresearch, is reader_panel.py, which uses four different personas from Claude Opus 4.6 — the editor, the genre reader, the writer, and the first reader — to review the novel. It also runs review.py, which also uses Claude Opus 4.6 with the following dual-persona prompt:

Read the below novel, “{title}”. Review it first as a literary critic (like a newspaper book review) and then as a professor of fiction. In the later review, give specific, actionable suggestions for any defects you find. Be fair but honest. You don’t have to find defects.

I like the inclusion of you don’t have to find defects in the prompt. Strictly speaking, these reviews are not really a precise benchmark, as Karpathy mentioned:

Not exactly verifiable but might still work quite well given some effort.

Though they may be helpful to edit the writeup, similar to discarding an experiment, the loop can certainly iterate confidently towards mediocre results. Still, this is a worthy experiment towards writing anything, not just fiction.

—

Autoresearching Apple’s LLM in a Flash to run Qwen 397B locally, by Dan Woods, is a mind-boggling autoresearched advance towards running large models off SSDs on Macs. With freshly coded Objective-C, the AI agent can improve the performance of running a Qwen 3.5 397B MoE model on a MacBook Pro to around 6 token/second, which is extremely impressive. This also showcases the immense power of autoresearch and of AI agents in general, given the right context for them to get started working.