What comes after LLMs?

05 May, 2025

Recent progress in AI has largely been driven by training on human-generated data at a massive scale. In the case of large language models (LLMs), pre-training on trillions of tokens of text data, followed by fine-tuning on human preferences. The result is systems that are incredibly capable across innumerable domains but whose abilities are fundamentally constrained by imitation.

While imitating humans has been enough until now, it’s unlikely to be sufficient on its own to achieve superhuman intelligence. The knowledge available in human data is finite and approaching its limit. Most high quality sources have already been used or soon will be. We need a new approach.

To understand what might come next, we can look to earlier breakthroughs in AI, and to thoughts from the most influential people in the field (Sutton, Silver).

Beyond imitation

In 1997 IBM’s DeepBlue beat Gary Kasparov, then world champion, at chess. It became the first time a machine beat a reigning human world champion and is a historic moment in AI. DeepBlue used a combination of human-crafted features (around 8,000 of them) and massive, brute force search in order to select the best next move (it could search ~200 million moves per second).

Nearly 20 years later, Google DeepMind introduced AlphaGo to the world. Instead of chess, the game here was (you guessed it) “Go”, a game with 10¹⁷⁰ possible board configurations which is more atoms than in the known universe, seen as the grand challenge of AI. The team at DeepMind used a totally different approach than what had been seen before from IBM. There were no human-crafted features, instead AlphaGo was initially trained on human Go games, to learn how the best humans played. But the aim wasn’t to match the best human Go players, the aim was to beat them. So from there, DeepMind used search combined with reinforcement learning and self-play. AlphaGo played against itself millions of times to generate its own “training” data, allowing it to continue learning beyond human data.

In 2016 AlphaGo defeated then world champion Lee Sedol 4-1 in what, to those of us in the field of AI, is still seen as a turning point. There was one moment above all that shook the world: move 37. This move by AlphaGo was so unusual that observers and commentators of the game thought the system had malfunctioned, it had just a 0.0001% probability of being played. It visibly stunned Lee Sedol. However this move became a defining moment of the game and enabled AlphaGo to seize victory. This is the power of reinforcement learning and self-play, models can find these so called “out of distribution” actions that imitation alone would never discover.

The story doesn’t end there though. The next iteration from DeepMind was AlphaGo Zero. The zero reflects the fact that no human data whatsoever was given to AlphaGo Zero, just the rules of the game and a reward signal. AlphaGo Zero learned entirely by playing against itself. AlphaGo Zero played AlphaGo a hundred times and won 100-0. Removing human input from the system enabled it to dramatically improve, human data was holding it back. Next came AlphaZero, which again had no human priors but had a more general architecture meaning it could play any game. Of course, this continued the trend of beating what came before it by leveraging search, self-play, and reinforcement learning.

Beyond game environments came AlphaProof, which recently became the first program to achieve a medal in the International Mathematical Olympiad (IMO). Initially trained on ~100,000 formal proofs, created over many years by human mathematicians, AlphaProof’s reinforcement learning algorithm subsequently generated one hundred million more through continual interaction with a formal proving system. This focus on interactive experience allowed AlphaProof to explore mathematical possibilities beyond the confines of pre-existing formal proofs, and to discover solutions to novel and challenging problems.

We’ve seen recent progress in applying reinforcement learning techniques to LLMs, too, most notably with DeepSeek’s series of “reasoning” models. These models started with a pre-trained “base model”, and then in subsequent rounds of training leveraged reinforcement learning in verifiable domains such as coding problems with clear right/wrong answers which the models had to work towards. These models spontaneously developed behaviours such as reflection, revisiting and reevaluation of its prior reasoning steps, and the exploration of alternative problem-solving strategies, without any human involvement. Interestingly, these reasoning models had the tendency to generate illegible reasoning traces, such as mixing multiple languages within a single response, but when the reward function was modified to enforce language consistency, model performance actually decreased. In other words human-like output wasn’t aligned with optimal reasoning performance.

The Bitter Lesson

What I’ve just described can be summarised as The Bitter Lesson, coined by Richard Sutton, the Godfather of reinforcement learning. Human involvement actually harms performance in the long run. More formally, “general methods that leverage computation are ultimately the most effective, and by a large margin”. The “human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation”. Human ingenuity can lead to short term success, but “breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach”.

The power of these approaches lies in the nature of self-generated experience. As Richard Sutton puts it: “In uncharted territory — where one would expect learning to be most beneficial — an agent must be able to learn from its own experience.” As the system improves, it naturally encounters more sophisticated challenges perfectly matched to its current capabilities. The learning process becomes open-ended. As long as it can act, observe, and adapt, the system can keep getting stronger. There is no limit.

Of course games are particularly well suited to reinforcement learning as they have clear verifiable outcomes: win, lose or draw. And DeepSeek has shown that reinforcement learning can be applied to broader domains with verifiable solutions (which then, remarkably, can be applied to non-verifiable domains). But it’s also true that the “real world” has verifiable outcomes abound: likes, follows, profit/loss, weight loss, job offers. Learning how to exploit these signals will be the next great leap.

The systems that will define the next era of AI won’t learn only by imitation. They will build on it by learning from their own experience. The next era will give us agents that can learn over long periods of time, with their own stream of experiences, continually learning from the signals and rewards from their actions. Agents will generate their own data based on their experiences, human data will no longer be there to impose a ceiling. We’re not there yet and there are significant challenges to overcome, we’ll need algorithmic advancements, we’ll need advancements in reward design and value estimation. The same core reinforcement learning challenges will still apply, credit assignment, exploration vs. exploitation, stability over long horizons, but now in harder, noisier, and more dynamic contexts. There will be ambiguous rewards, sparse feedback, and shifting objectives. But we’ve already seen Google, Anthropic and OpenAI launch agents capable of autonomously performing tasks through web browser interactions, feasibly giving them the opportunity to begin to learn from real world signals and to discover on their own what works and what doesn’t, so perhaps we’re at the start of the transition.

What is the move 37 in medical treatment? In climate strategy? In power generation?

To quote David Silver, welcome to the era of experience.

A note on RLHF

Reinforcement learning from human feedback (RLHF) is central to how today’s LLMs are refined. But RLHF is not really RL, it’s not experience-driven. The feedback doesn’t come from consequences in the world, but from what human annotators say looks right. RLHF works well for alignment and building usefulness but it constrains models to stay within the bounds of what humans already know and expect. But we want to go beyond what humans know. RLHF systems do not have the ability to go beyond human knowledge; if a human doesn’t recognise an idea, the system never will either. RLHF would have dismissed move 37.

Reinforcement Learning, Sutton & Barto

David Silver’s Era of Experience position paper, goes into much more of what will be needed if we’re to make the transition. MUST read:
The Era of Experience (DeepMind)

How IBM’s Deep Blue Beat World Champion Chess Player Garry Kasparov (IEEE Spectrum)

The Bitter Lesson (Rich Sutton)

AlphaGo (DeepMind)

AlphaZero: Shedding New Light on Chess, Shogi and Go (DeepMind)

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (DeepSeek)

Another interesting path forward from LeCun:
V-JEPA: Yann LeCun’s AI Model for Video – Joint Embedding Predictive Architecture (Meta AI)

#AI #RL