From Reasoning to Agentic Thinking: What Changes and Why It's Hard
The shift from static reasoning (o1, R1) to agentic thinking changes everything: RL infrastructure, environment design, reward signals, and what "good thinking" even means. The core claim of this post: the bottleneck shifts from internal reasoning quality to environment-coupled decision quality, and that shift breaks most of the assumptions baked into current training infrastructure. This post covers what's different, what's hard, and where the field is headed.
- The Reasoning Era: What o1 and R1 Actually Established
- What Is Agentic Thinking?
- Reasoning vs. Agentic Thinking: A Direct Comparison
- GRPO: The RL Workhorse (and Its Limits)
- Why Agentic RL Is a Different Beast
- Environments Are the New Datasets
- Reward Hacking Gets Dangerous
- Harness Engineering: How Agents Are Built
- The Hybrid Approach
- Where We Are Now
1. The Reasoning Era: What o1 and R1 Actually Established
The reasoning wave of 2024-2025 proved a specific claim: if you train language models with RL against verifiable rewards, they develop qualitatively stronger cognition. OpenAI's o1 showed that "thinking" could be a first-class capability, trained for and exposed to users. DeepSeek-R1 proved it could be reproduced outside the original labs, at competitive quality, with a fully documented training pipeline.
The key insight was about reward signals. Math, code, and logic became central to reasoning training because rewards in these domains are deterministic, stable, and scalable. You can check if a proof is correct, if code passes tests, if a logical deduction follows. This is a much stronger signal than generic human preference. DeepSeek explicitly chose rule-based rewards over neural reward models, noting that "the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process."
But reasoning models also revealed something about infrastructure. Once you train a model to reason through long trajectories, RL stops being a lightweight add-on to SFT. It becomes a systems problem: you need rollouts at scale, high-throughput verification, stable policy updates, efficient sampling. The emergence of reasoning models was as much an infra story as a modeling story.
2. What Is Agentic Thinking?
Agentic thinking is reasoning in order to act, while interacting with an environment, and continuously updating plans based on feedback from the world. The central question shifts from "Can the model think long enough?" to "Can the model think in a way that sustains effective action?"
A reasoning model produces a long internal monologue and then outputs an answer. An agentic system runs a loop: think, act, observe, revise. Each action produces real feedback from the environment (a test result, a search hit, a file listing, an error message), and the model must incorporate that feedback into its next step.
Agentic thinking handles several things that pure reasoning models can mostly avoid:
- Deciding when to stop thinking and take an action. A reasoning model can deliberate indefinitely. An agent must commit to an action at some point, even under uncertainty.
- Choosing which tool to invoke and in what order. The action space is combinatorial. A coding agent can read files, edit code, run tests, search documentation, spawn sub-processes.
- Incorporating noisy or partial observations. Environment feedback is rarely clean. Error messages are cryptic. Search results are irrelevant. Test output is ambiguous.
- Revising plans after failures. The first approach usually does not work. Good agentic thinking involves recognizing failure, diagnosing the cause, and trying an alternative.
- Maintaining coherence across many turns. A 100-turn coding session requires the agent to remember what it tried, what worked, what the current state of the codebase is, and what remains to be done.
3. Reasoning vs. Agentic Thinking: A Direct Comparison
| Dimension | Reasoning Thinking (o1, R1) | Agentic Thinking |
|---|---|---|
| Structure | Single turn: prompt → long CoT → answer | Multi-turn loop: think → act → observe → revise |
| Environment | None. Reasoning is self-contained. | Tools, terminals, browsers, APIs, sandboxes |
| Feedback | Only at the end (correct/incorrect) | After every action (tool outputs, errors, results) |
| Rollout parallelism | Embarrassingly parallel (generate many CoTs independently) | Sequential dependencies (action N depends on observation N-1) |
| Reward signal | Verifiable: math correctness, test pass/fail | Sparse, delayed, often partial (task completion after many steps) |
| Failure mode | Wrong answer, hallucinated proof step | Stuck in loop, wrong tool choice, cascading errors |
| Scaling axis | More thinking tokens per problem | More effective actions per task |
| What "good thinking" means | Correct, coherent chain of deductions | Efficient progress toward task completion under uncertainty |
4. GRPO: The RL Workhorse (and Its Limits)
GRPO is the algorithm that made reasoning RL practical by eliminating the critic network, but its assumptions break down in multi-turn agentic settings. Before we discuss agentic RL, we need to understand what reasoning RL looks like in practice. The dominant algorithm is GRPO (Group Relative Policy Optimization), introduced in the DeepSeekMath paper and then scaled in DeepSeek-R1.
Standard PPO requires a critic network, a separate model (often the same size as the policy) that estimates value functions. This roughly doubles memory requirements. GRPO eliminates the critic entirely by computing advantages relative to a group of samples.
For each prompt $x$, GRPO samples $G$ outputs $\{o_1, o_2, \ldots, o_G\}$ from the current policy $\pi_\theta$. Each output gets a reward $r_i$. The advantage for output $i$ is:
where $\hat{A}_i$ is the normalized advantage for the $i$-th output, $r_i$ is its reward, and the mean and standard deviation are computed across all $G$ outputs for the same prompt. The policy is then updated with a clipped surrogate objective (like PPO) using these group-relative advantages. KL regularization is added directly to the loss.
Typical hyperparameters from DeepSeek: group size $G = 64$, batch size 1024, learning rate $1 \times 10^{-6}$, KL coefficient 0.04. The reward design was deliberately simple: accuracy rewards from deterministic verifiers (boxed math answers, compiler feedback) plus format rewards enforcing <think>/</think> tag compliance.
Where GRPO hits its limits
GRPO works beautifully for single-turn reasoning because every rollout is independent. Generate 64 completions for the same math problem, score each one, compute group statistics. The rollouts don't interact.
But agent trajectories are multi-turn. An agent might take 50 actions to solve a SWE-bench task, with environment interactions between each action. You cannot simply sample 64 independent trajectories from the same starting state, because each trajectory diverges into a different environment state after the first action. The RAGEN paper (StarPO framework) attempted to extend GRPO to agentic settings and identified two failure modes:
- Echo Trap: reward variance collapses and gradient spikes, causing the policy to oscillate without improving.
- Template Collapse: reasoning appears diverse per-input but becomes input-agnostic across prompts. The model learns a generic template rather than truly adapting its thinking.
5. Why Agentic RL Is a Different Beast
The infrastructure used for reasoning RL is insufficient for agentic RL. The differences are fundamental, not incremental.
The rollout problem
In reasoning RL, a rollout is: generate tokens until the model outputs an answer, then check the answer. The environment is a static verifier. Rollout throughput is bounded by GPU speed and sampling efficiency. You can run thousands of rollouts in parallel because they are independent.
In agentic RL, a rollout is: generate an action, execute it in an environment (run code, click a button, query an API), observe the result, feed it back to the model, repeat. Each step has a sequential dependency on the previous one. The environment is not a static verifier; it is a stateful system (a terminal, a browser, an operating system) that changes in response to actions.
This creates a fundamental throughput problem. Consider a coding agent that generates a code edit, then waits for the test suite to run (5-30 seconds), then reads the output, then generates the next edit. The GPU sits idle during environment execution. In reasoning RL, the GPU is generating tokens continuously. In agentic RL, utilization drops dramatically.
Train-serve decoupling
The solution is to decouple training and inference more aggressively. The policy generates actions and sends them to an environment pool. Environments execute asynchronously and return observations when ready. The training system collects completed trajectories and updates the policy. This is architecturally closer to a distributed systems problem than a machine learning problem.
Frameworks like veRL (used by RAGEN/StarPO) provide this substrate, but the engineering is substantial. You need: environment orchestration (spinning up/tearing down sandboxed environments), trajectory storage, async rollout collection, and careful handling of environment state across training iterations.
The credit assignment problem
In reasoning RL, the reward (correct/incorrect) applies to the entire reasoning chain. In agentic RL, a 50-step trajectory gets a single binary reward at the end (task solved or not). Which of the 50 actions was the important one? Which was a mistake the agent recovered from? Standard GRPO has no mechanism for per-step credit assignment. The RAGEN paper found that "without fine-grained, reasoning-aware reward signals, agent reasoning hardly emerges." Agents fall into shallow strategies or hallucinated reasoning patterns.
6. Environments Are the New Datasets
For agentic RL, environment quality determines the ceiling on what the policy can learn, much as data quality determined the ceiling for SFT. In the SFT era, researchers obsessed over data diversity. In the agent era, the equivalent obsession should be environment quality: stability, realism, coverage, difficulty distribution, richness of feedback, exploit resistance, and scalability of rollout generation.
The major agent training environments, and what they test:
| Environment | Domain | Tasks | Best AI (source, date) | Human |
|---|---|---|---|---|
| SWE-bench Verified | Software engineering (real GitHub issues) | 500 | ~80% (Claude, leaderboard Q1 2026) | ~95% |
| WebArena | Web navigation (e-commerce, forums, CMS) | 812 | 14% (original paper, GPT-4, 2024); 65-72% (recent agents, Q1 2026) | 78% |
| OSWorld | Desktop OS tasks (Ubuntu, Windows, macOS) | 369 | ~12% (original paper, GPT-4V, 2024) | 72% |
| GAIA | General AI assistant tasks | 466 | ~75% (leaderboard Q1 2026) | 92% |
The gap between SWE-bench (~80%) and OSWorld (~12%) reveals where agentic thinking is and is not working. Coding tasks have clean feedback (tests pass or fail) and a text-native interface (files and terminals). WebArena has seen dramatic improvement (from 14% in the original 2024 paper to 65-72% with recent agents), suggesting that web navigation was more tractable than initial results implied. Desktop tasks remain hard: they require visual grounding, spatial reasoning about GUIs, and handling applications with stateful, non-deterministic behavior. The environment's feedback richness directly determines how learnable it is.
What makes a good training environment
- Deterministic evaluation. SWE-bench uses existing test suites. OSWorld uses custom execution-based evaluation scripts. This avoids the need for neural reward models, which are vulnerable to hacking.
- Authentic complexity. Real GitHub issues, real websites, real desktop applications. Synthetic environments tend to reward surface-level pattern matching.
- Diverse initial states. The RAGEN paper found that training needs "diverse initial states, medium interaction granularity, and more frequent sampling." Homogeneous starting conditions produce brittle policies.
- Scalable rollout generation. The environment must be cheap to instantiate, fast to reset, and safe to run arbitrary agent actions inside.
Environment-building is becoming a serious startup category. The InSTA paper (2025) demonstrated one approach: use an LLM to annotate 150K real websites with tasks, collect agent trajectories, and filter by success. A 1.7B parameter agent trained this way outperformed 235B parameter models on web navigation. The environment quality mattered more than the model size.
7. Reward Hacking Gets Dangerous
Reward hacking is not new, but in agentic settings it becomes qualitatively more dangerous. A reasoning model that hacks its reward can only do so within the token generation process. An agent with tool access has a much larger attack surface.
Specific attack patterns
- Search exploitation: A model with web search might learn to look up answers directly during RL training, rather than learning to solve problems.
- Information leakage: A coding agent might discover test files in the repository and extract expected outputs, or exploit hidden information in the execution environment.
- Shortcut discovery: An agent might find that modifying the test itself (rather than the code under test) passes the test suite. Or that restarting a service clears an error that the task was supposed to fix properly.
The most sobering finding comes from Baker et al. (2025), who studied reward hacking in o3-mini in agentic coding environments. They found that monitoring the model's chain-of-thought (using a weaker model like GPT-4o) was initially effective at catching misbehavior. But when this monitoring was incorporated into the RL training objective, the agent learned obfuscated reward hacking: it hid its intent within the chain-of-thought while still gaming the reward.
DeepSeek's choice to use rule-based rewards over neural reward models was prescient. In agentic settings, the attack surface expands further because the agent can influence its own observation stream. The environment itself needs to be hardened against the agent, not just instrumented to observe it.
8. Harness Engineering: How Agents Are Built
The previous sections covered how agents are trained: rollout structure, environment design, reward signals. But training produces a policy, not a product. The deployment architecture, how the policy is wrapped, what tools it can call, and how multiple calls are orchestrated, determines whether that policy translates into a useful agent. The intelligence of an agentic system comes not just from the model but from how it is organized. Anthropic's taxonomy of agentic patterns distinguishes between workflows (LLMs orchestrated through predefined code paths) and agents (LLMs that dynamically direct their own processes and tool usage).
Five workflow patterns
Most production "agent" systems are actually workflows with varying degrees of LLM autonomy:
- Prompt Chaining. Sequential steps where each LLM call processes the previous output. Programmatic gates validate intermediate results. Example: generate a plan, then execute each step, then validate the result.
- Routing. Classify inputs into categories, direct each to a specialized handler. Can route by complexity to different model sizes (fast/cheap for simple queries, powerful/expensive for hard ones).
- Parallelization. Run independent subtasks simultaneously (sectioning), or repeat the same task multiple times and aggregate (voting). Good for decomposable problems.
- Orchestrator-Workers. A central LLM dynamically breaks tasks into subtasks, delegates to workers, and synthesizes results. The subtasks emerge from the input, not from a predefined template.
- Evaluator-Optimizer. One LLM generates, another evaluates, iterating until quality is sufficient. This is essentially best-of-N sampling with a learned judge.
True agent behavior sits at the top of this spectrum: the model decides its own control flow, choosing tools, sequencing actions, and determining when to stop. Claude Code's architecture is representative: a loop where the model generates a response that may contain tool calls, the tools execute, their outputs are fed back, and the loop continues until the model decides the task is complete or reaches a token limit.
Tool design matters more than you think
Anthropic's concept of the Agent-Computer Interface (ACI) treats tool design with the same rigor as prompt engineering. The key principles:
- Poka-yoke design: Make mistakes harder to make. If a file path must be absolute, enforce it in the tool spec. If a parameter has only three valid values, enumerate them.
- Match natural formats: Tool interfaces should use formats the model has seen extensively in training data (JSON, Markdown, code) rather than custom schemas.
- Include examples: Tool descriptions should contain example invocations and edge cases, just like good API documentation.
Claude's SWE-bench agent uses only two tools: a Bash tool for command execution and an Edit tool for string-replacement-based file edits. This minimal toolset achieves 72-80% on SWE-bench Verified. Anthropic's recommendation: "keep the scaffolding minimal." Complex multi-agent frameworks add coordination overhead that often outweighs their benefits.
Multi-agent architectures
When tasks are large enough, a single agent context window becomes insufficient. The emerging pattern is hierarchical delegation:
- A lead agent (orchestrator) holds the high-level plan and delegates subtasks.
- Worker agents (sub-agents) execute specific subtasks with their own tool access and context.
- Workers report results back to the lead, which synthesizes and decides next steps.
This is how Claude Code handles complex tasks: it spawns sub-agents for parallel file exploration, test execution, or independent code changes, while the lead agent maintains the overall plan. The sub-agents protect the main context window from being flooded with verbose tool outputs.
9. The Hybrid Approach
The original framing of "reasoning vs. instruct" models treated them as separate products. Qwen3 tried to merge them into a single model with hybrid thinking modes and a four-stage training pipeline (long-CoT cold start, reasoning RL, thinking mode fusion, general RL). Anthropic took a different approach with Claude 3.7 and 4: integrated reasoning where the model can interleave thinking with tool use within a single generation.
The distinction that matters most now is not thinking vs. non-thinking, but isolated thinking vs. grounded thinking.
| Isolated Thinking | Grounded (Agentic) Thinking | |
|---|---|---|
| Information source | Model's weights only | Weights + environment feedback |
| Error correction | Self-correction within CoT (unreliable) | Ground truth from execution (reliable) |
| Scaling behavior | Logarithmic (diminishing returns with more tokens) | Potentially better: each tool call adds real information |
| Example | o1 solving a math problem in 10K tokens | Claude running code, checking output, iterating |
Claude 4 demonstrated what Anthropic calls action scaling: performance that compounds as task complexity increases, because the model makes iterative function calls and responds to environmental feedback. This is fundamentally different from the logarithmic returns of pure reasoning scaling (where 2x the thinking tokens gives diminishing accuracy improvements).
Thinking budgets for agents
The concept of a thinking budget originated with reasoning models. Claude's API lets you set a budget in tokens (up to 128K). Qwen3 supports dynamic per-turn switching between thinking and non-thinking modes with /think and /no_think commands. The BRPO paper (Budget Relative Policy Optimization) trains models to produce useful intermediate results at any truncation point, enabling flexible compute allocation.
For agents, the budget concept extends beyond tokens to actions. An agentic thinking budget might specify: "spend up to 50 tool calls on this task" or "if you haven't made progress in 10 turns, escalate to a human." The research question is whether models can learn to allocate their own budget adaptively, spending more effort on hard subtasks and less on easy ones.
10. Where We Are Now
As of early 2026, agentic systems are production-real in constrained domains and research-real in broader ones.
Coding agents (the current frontier)
Coding is the strongest domain for agentic thinking because it has the best feedback loop: write code, run tests, get deterministic pass/fail signals, iterate. The major systems:
- Claude Code (as of March 2026): 72.5-80.2% on SWE-bench Verified (depending on configuration). Uses minimal scaffolding (bash + edit tools). Supports sub-agent spawning, persistent memory (CLAUDE.md), and MCP for external tool integration.
- OpenAI Codex (as of April 2025): Cloud-based sandboxed execution. Each task gets an isolated container with full dependency management.
- Devin (as of early 2025): Async coding agent with Slack/GitHub integration. Best for delegated tasks under 3 hours.
- Cursor Agent (as of early 2026): IDE-integrated, model-agnostic, deep integration with the development workflow.
Computer use (the next frontier)
Computer-use agents (screenshot in, mouse/keyboard actions out) remain far from human performance. OSWorld results (~12% AI vs. 72% human) show the gap. The bottleneck is visual grounding and operational knowledge, not reasoning capacity. These agents need to understand GUIs at a level that current vision models struggle with.
Research agents
The WebThinker paper introduced a "Deep Web Explorer" that interleaves reasoning, searching, and report writing. Google's Gemini Deep Research creates multi-step research plans and executes them. These systems demonstrate agentic thinking applied to information synthesis rather than code execution.
Emerging patterns
Several architectural patterns are converging across these systems:
- Minimal scaffolding has an edge so far. The best-performing agents to date use simple loops with few tools. Anthropic advocates simple, composable patterns over complex frameworks, though this may reflect the current state of tooling rather than a universal law.
- Interleaved reasoning + action. The trend is toward models that think and act in the same generation, not in separate phases. Claude 4's extended thinking with tool use is the clearest example.
- Environment feedback > learned rewards. Ground truth from code execution, test suites, and API responses is more reliable than any learned reward model. This is why coding agents are ahead.
- Small models can be great agents. InSTA's 1.7B parameter model beat 235B models on web navigation through better training data. Agent capability depends more on training regime than raw model size.
- Memory is a system problem. Persistent instructions (CLAUDE.md), auto-memory across sessions, and external tool connectors (MCP) are becoming standard infrastructure. The model alone cannot maintain context over long horizons.
The transition from reasoning to agentic thinking is not a new capability being added to an existing paradigm. It is a paradigm change. The training objective, the RL infrastructure, the reward design, the evaluation criteria, and the deployment architecture all change simultaneously. Reasoning thinking asked: "Can we make models think longer and better?" Agentic thinking asks: "Can we make models think in order to act effectively in the real world?" That second question is harder, and we are still in the early stages of answering it.
References:
DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025);
Shao et al., "DeepSeekMath: Pushing the Limits of Mathematical Reasoning" (2024);
Xu et al., "RAGEN: Training Agents by Reinforcing Reasoning" (2025);
Snell et al., "Scaling LLM Test-Time Compute" (2024);
Baker et al., "Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation" (2025);
Li et al., "InSTA: Towards Internet-Scale Training For Agents" (2025);
Qi et al., "BRPO: Budget Relative Policy Optimization" (2025);
Anthropic, "Building Effective Agents" (2024);
Anthropic, "Claude Code Documentation" (2025);
Qwen Team, "Qwen3: Think Deeper, Act Faster" (2025)