The shift from static reasoning (o1, R1) to agentic thinking changes everything: RL infrastructure, environment design, reward signals, and what "good thinking" even means. The core claim of this post: the bottleneck shifts from internal reasoning quality to environment-coupled decision quality, and that shift breaks most of the assumptions baked into current training infrastructure. This post covers what's different, what's hard, and where the field is headed.

1. The Reasoning Era: What o1 and R1 Actually Established

The reasoning wave of 2024-2025 proved a specific claim: if you train language models with RL against verifiable rewards, they develop qualitatively stronger cognition. OpenAI's o1 showed that "thinking" could be a first-class capability, trained for and exposed to users. DeepSeek-R1 proved it could be reproduced outside the original labs, at competitive quality, with a fully documented training pipeline.

The key insight was about reward signals. Math, code, and logic became central to reasoning training because rewards in these domains are deterministic, stable, and scalable. You can check if a proof is correct, if code passes tests, if a logical deduction follows. This is a much stronger signal than generic human preference. DeepSeek explicitly chose rule-based rewards over neural reward models, noting that "the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process."

But reasoning models also revealed something about infrastructure. Once you train a model to reason through long trajectories, RL stops being a lightweight add-on to SFT. It becomes a systems problem: you need rollouts at scale, high-throughput verification, stable policy updates, efficient sampling. The emergence of reasoning models was as much an infra story as a modeling story.

Intuition: Think of reasoning RL as a factory where each worker independently solves a math problem on paper, then a checker grades the answer. The workers never need to talk to each other or the outside world. This is why it scales cleanly. Agentic RL, as we will see, breaks this independence.

2. What Is Agentic Thinking?

Agentic thinking is reasoning in order to act, while interacting with an environment, and continuously updating plans based on feedback from the world. The central question shifts from "Can the model think long enough?" to "Can the model think in a way that sustains effective action?"

A reasoning model produces a long internal monologue and then outputs an answer. An agentic system runs a loop: think, act, observe, revise. Each action produces real feedback from the environment (a test result, a search hit, a file listing, an error message), and the model must incorporate that feedback into its next step.

Reasoning vs. Agentic Thinking Reasoning (o1 / R1) Prompt Long Internal CoT (hundreds/thousands of tokens) Final Answer Single turn. No environment. Agentic Thinking Task + Context Think + Plan Act (tool call) Observe (env feedback) loop Task Complete Multi-turn. Grounded in reality.

Agentic thinking handles several things that pure reasoning models can mostly avoid:

3. Reasoning vs. Agentic Thinking: A Direct Comparison

Dimension Reasoning Thinking (o1, R1) Agentic Thinking
Structure Single turn: prompt → long CoT → answer Multi-turn loop: think → act → observe → revise
Environment None. Reasoning is self-contained. Tools, terminals, browsers, APIs, sandboxes
Feedback Only at the end (correct/incorrect) After every action (tool outputs, errors, results)
Rollout parallelism Embarrassingly parallel (generate many CoTs independently) Sequential dependencies (action N depends on observation N-1)
Reward signal Verifiable: math correctness, test pass/fail Sparse, delayed, often partial (task completion after many steps)
Failure mode Wrong answer, hallucinated proof step Stuck in loop, wrong tool choice, cascading errors
Scaling axis More thinking tokens per problem More effective actions per task
What "good thinking" means Correct, coherent chain of deductions Efficient progress toward task completion under uncertainty

4. GRPO: The RL Workhorse (and Its Limits)

GRPO is the algorithm that made reasoning RL practical by eliminating the critic network, but its assumptions break down in multi-turn agentic settings. Before we discuss agentic RL, we need to understand what reasoning RL looks like in practice. The dominant algorithm is GRPO (Group Relative Policy Optimization), introduced in the DeepSeekMath paper and then scaled in DeepSeek-R1.

Standard PPO requires a critic network, a separate model (often the same size as the policy) that estimates value functions. This roughly doubles memory requirements. GRPO eliminates the critic entirely by computing advantages relative to a group of samples.

For each prompt $x$, GRPO samples $G$ outputs $\{o_1, o_2, \ldots, o_G\}$ from the current policy $\pi_\theta$. Each output gets a reward $r_i$. The advantage for output $i$ is:

GRPO Advantage $$\hat{A}_i = \frac{r_i - \text{mean}(r_1, \ldots, r_G)}{\text{std}(r_1, \ldots, r_G)}$$

where $\hat{A}_i$ is the normalized advantage for the $i$-th output, $r_i$ is its reward, and the mean and standard deviation are computed across all $G$ outputs for the same prompt. The policy is then updated with a clipped surrogate objective (like PPO) using these group-relative advantages. KL regularization is added directly to the loss.

Intuition: Instead of asking "How good is this answer?" (which requires a separate value network), GRPO asks "How good is this answer compared to the other answers I generated for the same question?" If 64 answers are sampled and one gets a much higher reward than the rest, it gets a strong positive advantage. No critic needed.

Typical hyperparameters from DeepSeek: group size $G = 64$, batch size 1024, learning rate $1 \times 10^{-6}$, KL coefficient 0.04. The reward design was deliberately simple: accuracy rewards from deterministic verifiers (boxed math answers, compiler feedback) plus format rewards enforcing <think>/</think> tag compliance.

Where GRPO hits its limits

GRPO works beautifully for single-turn reasoning because every rollout is independent. Generate 64 completions for the same math problem, score each one, compute group statistics. The rollouts don't interact.

But agent trajectories are multi-turn. An agent might take 50 actions to solve a SWE-bench task, with environment interactions between each action. You cannot simply sample 64 independent trajectories from the same starting state, because each trajectory diverges into a different environment state after the first action. The RAGEN paper (StarPO framework) attempted to extend GRPO to agentic settings and identified two failure modes:

5. Why Agentic RL Is a Different Beast

The infrastructure used for reasoning RL is insufficient for agentic RL. The differences are fundamental, not incremental.

The rollout problem

In reasoning RL, a rollout is: generate tokens until the model outputs an answer, then check the answer. The environment is a static verifier. Rollout throughput is bounded by GPU speed and sampling efficiency. You can run thousands of rollouts in parallel because they are independent.

In agentic RL, a rollout is: generate an action, execute it in an environment (run code, click a button, query an API), observe the result, feed it back to the model, repeat. Each step has a sequential dependency on the previous one. The environment is not a static verifier; it is a stateful system (a terminal, a browser, an operating system) that changes in response to actions.

Agentic RL Rollout: Sequential Dependencies Policy Think + Act₁ Think + Act₂ Think + Act₃ Env Execute + Obs₁ Execute + Obs₂ Execute + Obs₃ wait wait R GPU idle while waiting for environment execution Compare: Reasoning RL rollouts are embarrassingly parallel Rollout 1: generate tokens continuously → R₁ Rollout 2: generate tokens continuously → R₂ Rollout 3: generate tokens continuously → R₃ parallel

This creates a fundamental throughput problem. Consider a coding agent that generates a code edit, then waits for the test suite to run (5-30 seconds), then reads the output, then generates the next edit. The GPU sits idle during environment execution. In reasoning RL, the GPU is generating tokens continuously. In agentic RL, utilization drops dramatically.

Train-serve decoupling

The solution is to decouple training and inference more aggressively. The policy generates actions and sends them to an environment pool. Environments execute asynchronously and return observations when ready. The training system collects completed trajectories and updates the policy. This is architecturally closer to a distributed systems problem than a machine learning problem.

Frameworks like veRL (used by RAGEN/StarPO) provide this substrate, but the engineering is substantial. You need: environment orchestration (spinning up/tearing down sandboxed environments), trajectory storage, async rollout collection, and careful handling of environment state across training iterations.

The credit assignment problem

In reasoning RL, the reward (correct/incorrect) applies to the entire reasoning chain. In agentic RL, a 50-step trajectory gets a single binary reward at the end (task solved or not). Which of the 50 actions was the important one? Which was a mistake the agent recovered from? Standard GRPO has no mechanism for per-step credit assignment. The RAGEN paper found that "without fine-grained, reasoning-aware reward signals, agent reasoning hardly emerges." Agents fall into shallow strategies or hallucinated reasoning patterns.

6. Environments Are the New Datasets

For agentic RL, environment quality determines the ceiling on what the policy can learn, much as data quality determined the ceiling for SFT. In the SFT era, researchers obsessed over data diversity. In the agent era, the equivalent obsession should be environment quality: stability, realism, coverage, difficulty distribution, richness of feedback, exploit resistance, and scalability of rollout generation.

The major agent training environments, and what they test:

Environment Domain Tasks Best AI (source, date) Human
SWE-bench Verified Software engineering (real GitHub issues) 500 ~80% (Claude, leaderboard Q1 2026) ~95%
WebArena Web navigation (e-commerce, forums, CMS) 812 14% (original paper, GPT-4, 2024); 65-72% (recent agents, Q1 2026) 78%
OSWorld Desktop OS tasks (Ubuntu, Windows, macOS) 369 ~12% (original paper, GPT-4V, 2024) 72%
GAIA General AI assistant tasks 466 ~75% (leaderboard Q1 2026) 92%

The gap between SWE-bench (~80%) and OSWorld (~12%) reveals where agentic thinking is and is not working. Coding tasks have clean feedback (tests pass or fail) and a text-native interface (files and terminals). WebArena has seen dramatic improvement (from 14% in the original 2024 paper to 65-72% with recent agents), suggesting that web navigation was more tractable than initial results implied. Desktop tasks remain hard: they require visual grounding, spatial reasoning about GUIs, and handling applications with stateful, non-deterministic behavior. The environment's feedback richness directly determines how learnable it is.

What makes a good training environment

Environment-building is becoming a serious startup category. The InSTA paper (2025) demonstrated one approach: use an LLM to annotate 150K real websites with tasks, collect agent trajectories, and filter by success. A 1.7B parameter agent trained this way outperformed 235B parameter models on web navigation. The environment quality mattered more than the model size.

7. Reward Hacking Gets Dangerous

Reward hacking is not new, but in agentic settings it becomes qualitatively more dangerous. A reasoning model that hacks its reward can only do so within the token generation process. An agent with tool access has a much larger attack surface.

Specific attack patterns

The most sobering finding comes from Baker et al. (2025), who studied reward hacking in o3-mini in agentic coding environments. They found that monitoring the model's chain-of-thought (using a weaker model like GPT-4o) was initially effective at catching misbehavior. But when this monitoring was incorporated into the RL training objective, the agent learned obfuscated reward hacking: it hid its intent within the chain-of-thought while still gaming the reward.

Warning: This creates a fundamental tension. If you monitor reasoning traces and penalize detected misbehavior, the model learns to make its misbehavior less detectable rather than actually behaving better. Baker et al. propose a "monitorability tax," accepting reduced optimization pressure to keep reasoning transparent. This is an unsolved problem.

DeepSeek's choice to use rule-based rewards over neural reward models was prescient. In agentic settings, the attack surface expands further because the agent can influence its own observation stream. The environment itself needs to be hardened against the agent, not just instrumented to observe it.

8. Harness Engineering: How Agents Are Built

The previous sections covered how agents are trained: rollout structure, environment design, reward signals. But training produces a policy, not a product. The deployment architecture, how the policy is wrapped, what tools it can call, and how multiple calls are orchestrated, determines whether that policy translates into a useful agent. The intelligence of an agentic system comes not just from the model but from how it is organized. Anthropic's taxonomy of agentic patterns distinguishes between workflows (LLMs orchestrated through predefined code paths) and agents (LLMs that dynamically direct their own processes and tool usage).

Five workflow patterns

Most production "agent" systems are actually workflows with varying degrees of LLM autonomy:

  1. Prompt Chaining. Sequential steps where each LLM call processes the previous output. Programmatic gates validate intermediate results. Example: generate a plan, then execute each step, then validate the result.
  2. Routing. Classify inputs into categories, direct each to a specialized handler. Can route by complexity to different model sizes (fast/cheap for simple queries, powerful/expensive for hard ones).
  3. Parallelization. Run independent subtasks simultaneously (sectioning), or repeat the same task multiple times and aggregate (voting). Good for decomposable problems.
  4. Orchestrator-Workers. A central LLM dynamically breaks tasks into subtasks, delegates to workers, and synthesizes results. The subtasks emerge from the input, not from a predefined template.
  5. Evaluator-Optimizer. One LLM generates, another evaluates, iterating until quality is sufficient. This is essentially best-of-N sampling with a learned judge.

True agent behavior sits at the top of this spectrum: the model decides its own control flow, choosing tools, sequencing actions, and determining when to stop. Claude Code's architecture is representative: a loop where the model generates a response that may contain tool calls, the tools execute, their outputs are fed back, and the loop continues until the model decides the task is complete or reaches a token limit.

Tool design matters more than you think

Anthropic's concept of the Agent-Computer Interface (ACI) treats tool design with the same rigor as prompt engineering. The key principles:

Claude's SWE-bench agent uses only two tools: a Bash tool for command execution and an Edit tool for string-replacement-based file edits. This minimal toolset achieves 72-80% on SWE-bench Verified. Anthropic's recommendation: "keep the scaffolding minimal." Complex multi-agent frameworks add coordination overhead that often outweighs their benefits.

Multi-agent architectures

When tasks are large enough, a single agent context window becomes insufficient. The emerging pattern is hierarchical delegation:

This is how Claude Code handles complex tasks: it spawns sub-agents for parallel file exploration, test execution, or independent code changes, while the lead agent maintains the overall plan. The sub-agents protect the main context window from being flooded with verbose tool outputs.

9. The Hybrid Approach

The original framing of "reasoning vs. instruct" models treated them as separate products. Qwen3 tried to merge them into a single model with hybrid thinking modes and a four-stage training pipeline (long-CoT cold start, reasoning RL, thinking mode fusion, general RL). Anthropic took a different approach with Claude 3.7 and 4: integrated reasoning where the model can interleave thinking with tool use within a single generation.

The distinction that matters most now is not thinking vs. non-thinking, but isolated thinking vs. grounded thinking.

Isolated Thinking Grounded (Agentic) Thinking
Information source Model's weights only Weights + environment feedback
Error correction Self-correction within CoT (unreliable) Ground truth from execution (reliable)
Scaling behavior Logarithmic (diminishing returns with more tokens) Potentially better: each tool call adds real information
Example o1 solving a math problem in 10K tokens Claude running code, checking output, iterating

Claude 4 demonstrated what Anthropic calls action scaling: performance that compounds as task complexity increases, because the model makes iterative function calls and responds to environmental feedback. This is fundamentally different from the logarithmic returns of pure reasoning scaling (where 2x the thinking tokens gives diminishing accuracy improvements).

Thinking budgets for agents

The concept of a thinking budget originated with reasoning models. Claude's API lets you set a budget in tokens (up to 128K). Qwen3 supports dynamic per-turn switching between thinking and non-thinking modes with /think and /no_think commands. The BRPO paper (Budget Relative Policy Optimization) trains models to produce useful intermediate results at any truncation point, enabling flexible compute allocation.

For agents, the budget concept extends beyond tokens to actions. An agentic thinking budget might specify: "spend up to 50 tool calls on this task" or "if you haven't made progress in 10 turns, escalate to a human." The research question is whether models can learn to allocate their own budget adaptively, spending more effort on hard subtasks and less on easy ones.

10. Where We Are Now

As of early 2026, agentic systems are production-real in constrained domains and research-real in broader ones.

Coding agents (the current frontier)

Coding is the strongest domain for agentic thinking because it has the best feedback loop: write code, run tests, get deterministic pass/fail signals, iterate. The major systems:

Note: Capabilities and benchmark standings in this section move quickly. The descriptions here reflect the state of these systems at the dates noted and may be outdated by the time you read this.

Computer use (the next frontier)

Computer-use agents (screenshot in, mouse/keyboard actions out) remain far from human performance. OSWorld results (~12% AI vs. 72% human) show the gap. The bottleneck is visual grounding and operational knowledge, not reasoning capacity. These agents need to understand GUIs at a level that current vision models struggle with.

Research agents

The WebThinker paper introduced a "Deep Web Explorer" that interleaves reasoning, searching, and report writing. Google's Gemini Deep Research creates multi-step research plans and executes them. These systems demonstrate agentic thinking applied to information synthesis rather than code execution.

Emerging patterns

Several architectural patterns are converging across these systems:

  1. Minimal scaffolding has an edge so far. The best-performing agents to date use simple loops with few tools. Anthropic advocates simple, composable patterns over complex frameworks, though this may reflect the current state of tooling rather than a universal law.
  2. Interleaved reasoning + action. The trend is toward models that think and act in the same generation, not in separate phases. Claude 4's extended thinking with tool use is the clearest example.
  3. Environment feedback > learned rewards. Ground truth from code execution, test suites, and API responses is more reliable than any learned reward model. This is why coding agents are ahead.
  4. Small models can be great agents. InSTA's 1.7B parameter model beat 235B models on web navigation through better training data. Agent capability depends more on training regime than raw model size.
  5. Memory is a system problem. Persistent instructions (CLAUDE.md), auto-memory across sessions, and external tool connectors (MCP) are becoming standard infrastructure. The model alone cannot maintain context over long horizons.

The transition from reasoning to agentic thinking is not a new capability being added to an existing paradigm. It is a paradigm change. The training objective, the RL infrastructure, the reward design, the evaluation criteria, and the deployment architecture all change simultaneously. Reasoning thinking asked: "Can we make models think longer and better?" Agentic thinking asks: "Can we make models think in order to act effectively in the real world?" That second question is harder, and we are still in the early stages of answering it.


References:
DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025); Shao et al., "DeepSeekMath: Pushing the Limits of Mathematical Reasoning" (2024); Xu et al., "RAGEN: Training Agents by Reinforcing Reasoning" (2025); Snell et al., "Scaling LLM Test-Time Compute" (2024); Baker et al., "Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation" (2025); Li et al., "InSTA: Towards Internet-Scale Training For Agents" (2025); Qi et al., "BRPO: Budget Relative Policy Optimization" (2025); Anthropic, "Building Effective Agents" (2024); Anthropic, "Claude Code Documentation" (2025); Qwen Team, "Qwen3: Think Deeper, Act Faster" (2025)