The agentic AI industry is currently caught in a hype cycle built on brittle foundations. When we talk about "AI tools" today, we mostly mean a monolithic, generalized Large Language Model (LLM) emitting a JSON payload to an API, and blindly appending the raw response back into its context window. This is just the first wave.

We are in the midst of a rapid evolution toward intelligent, self-optimizing sub-agents. This shift from "dumb pipes" to coordinated cognitive specialists isn't just a software design preference-it is a mandatory transition forced by a convergence of physical limits and algorithmic breakthroughs. Specifically, three forces are driving this evolution: Hardware physics (memory bandwidth and attention costs), Architectural philosophy (the legacy of macro-delegation), and Training methodologies (post-training techniques that make specialization cheap). Together, these drivers are pushing the industry through three distinct waves of tool use.

The Catalyst: Three Forces Driving Delegation

The instinct to avoid monolithic compute isn't new; it is deeply embedded in deep learning architecture. Google's "Pathways" vision theorized years ago that a single AI model shouldn't activate its entire multi-trillion parameter network for every trivial query. This birthed the modern reliance on Mixture-of-Experts (MoE) architectures, which demonstrate delegation within a single model via sparse routing. Speculative Decoding applied this economic logic to the coordination between entirely separate models, pairing a fast, tiny "Draft Model" with a massive "Director Model" to verify tokens in parallel.

This architectural philosophy is now colliding with severe Hardware Imperatives. Generating text with Transformers is inherently memory-bound. The self-attention mechanism scales quadratically with sequence length, generating massive Key-Value (KV) cache requirements. As models scale to million-token context windows, KV caches frequently exceed physical GPU memory limits, forcing inference engines like SGLang to implement complex hierarchical caching (HiCache) across GPU HBM, CPU DRAM, and external storage just to serve requests. And this is happening amidst a wider global RAM shortage, as the voracious appetite for expensive High-Bandwidth Memory (HBM) cannibalizes standard DDR fabrication lines. The economics are stark: using a trillion-parameter model to summarize massive HTML files consumes scarce HBM that should be reserved for actual complex reasoning.

The final catalyst is Training Methodology. If hardware provides the motive for smaller tools, new training techniques-specifically Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO)-provide the means. They have made it radically cheaper to specialized small, hyper-narrow models that vastly outperform generalized giants at specific tasks.

These three forces-Architecture, Hardware, and Training-have initiated three distinct waves of agentic tooling.

Wave 1: APIs as Tools

Much of the current "agentic AI" hype is built on MCP (Model Context Protocol). Using MCP, a generalized LLM generates a JSON payload, an application intercepts it, executes a static Python script or queries a SQL database, and blindly appends the raw string output back into the model's context window. The pattern is simple, universally accessible, and has powered the explosion of MCP tool servers, LangChain integrations, and enterprise "copilot" deployments.

The examples are everywhere. When GitHub Copilot creates a Jira ticket on your behalf, no intelligence exists in the tool-Copilot generates structured parameters, a middleware layer calls the Jira REST API, and the raw JSON response is stuffed back into the context window. When Gemini performs a web search mid-conversation, the same pattern applies: the model emits a search query, a Google Search API returns results, and the serialized response is concatenated into the prompt for the next generation step. The tool is a dumb pipe. All reasoning happens in the orchestrator.

This works-until it doesn't. The pattern is brittle, latency-heavy, and results in massive context pollution. More importantly, it forces the orchestrator to consume thousands of tokens of irrelevant API responses that dilute its attention mechanism and inflate inference costs. Relying on generalized models to parse raw outputs from dumb APIs is directly responsible for the hardware bottlenecks described above. To solve the context problem, we had to stop stuffing raw data into the context window. We needed tools that could reason about the data before handing it back.

Wave 2: Models as Tools

The second wave begins with a recognition: the tools themselves should be intelligent. MCP standardized the transport layer (Wave 1), but it says nothing about what happens inside the tool. The real shift is that small models, trained via SFT, DPO, or GRPO on narrow tasks, are operating as specialized tools ¹. The tool is no longer a script-it's a model.

Jina ReaderLM-v2: Solving the Hardware Bottleneck

Open-web navigation presents severe friction for generalized agents. Feeding raw HTML into a monolithic LLM introduces tens of thousands of tokens of noisy markup. This directly triggers the hardware bottleneck: quadratic attention bloat and massive KV cache memory pressure that inflate inference costs and trigger reasoning hallucinations.

Jina.ai's ReaderLM-v2 ² treats HTML-to-Markdown conversion not as a regex task but as a deep learning translation problem. Built on the Qwen2.5-1.5B-Instruct architecture (1.54 billion parameters), the model was pretrained on a dataset of ten million HTML documents averaging 56,000 tokens each. The output is explicitly optimized for agentic consumption: reliable LaTeX for equations, correctly formatted nested Markdown tables, and a compiled "Buttons & Links" summary specifically designed for autonomous web-navigation agents. Empirically, the markdown output uses 33–40% fewer tokens than raw HTML for identical semantic content. ReaderLM serves as an intelligent compressor, saving the orchestrator from crippling prefill costs and KV cache exhaustion.

FireRed-OCR: The Power of Specialized Training

While ReaderLM handles the open web, document parsing reveals a different problem that highlights the importance of specialized training. Massive Vision-Language Models (VLMs) like GPT-4o and Gemini 3.0 Pro frequently fail at rigorous structural tasks-a phenomenon called "Structural Hallucination" ³. Because foundational RLHF training optimizes for conversational helpfulness, these models tend to misalign table rows or fail to close nested markup tags. Standard RLHF rarely penalizes a misaligned table column in a 10,000-token output.

FireRed-OCR-2B ^{3, 4} solves this via Format-Constrained GRPO. Unlike PPO, GRPO estimates baselines directly from group averages of generated outputs, avoiding the massive VRAM overhead of separate value models. This enables a reward function that is entirely rule-based and absolute: positive reinforcement exclusively for maintaining table integrity (consistent row and column counts) and ensuring hierarchical closure (all opened tags correctly closed). No reward for semantic helpfulness-only for structural precision. On the OmniDocBench v1.5 benchmark, this 2-billion-parameter tool obliterates models 100x its size that were trained for the wrong objective.

The Broader Wave 2 Ecosystem

These are not isolated cases. DeepSeek-OCR achieves 16x vision token compression by reducing a 1024×1024 document page to just 256 tokens-enabling long-document ingestion that would overwhelm conventional pipelines. Perplexity's Sonar models (sonar-deep-research) are specialized retrieval sub-agents engineered for multi-step academic search rather than standard chat. Firecrawl ( $14.5M Series A,$ 100M+ valuation) broadcasts extraction capabilities via MCP servers so any orchestrating agent can plug them in instantly. Each represents the same principle: a narrow model, optimized for a specific task, operating as a tool that larger agents invoke.

Wave 3: Self-Optimizing Sub-Agents

The third wave dissolves the boundary between "tool" and "agent" entirely. In Wave 2, humans design the specialization-we choose the training data, craft the reward functions, and orchestrate the tool calls. In Wave 3, the environment itself uses reinforcement learning to discover the optimal behaviors for these tools. Defined by peer-to-peer protocols like A2A (Agent-to-Agent) ⁵ and continuous RL pipelines, this wave produces sub-agents whose operating strategies emerge from reward optimization, not human engineering.

The paradigm shift: It is no longer just orchestrator models getting better at using tools in RLVR environments; the tools themselves are utilizing pretraining and RL to optimize their own specific behaviors ¹.

Relace.ai: The Tool That Learned to Be a Better Tool

Static Retrieval-Augmented Generation (RAG) has been the default mechanism for injecting external context into LLMs. But generalized models frequently fail to retrieve all relevant files in a single pass, polluting the context with noise.

Relace.ai ⁶ replaces static RAG with "Fast Agentic Search" (FAS)-a specialized SLM trained purely on traversing codebases via an on-policy reinforcement learning pipeline. The reward function is intricately designed for the economics of agentic latency and precision:

$F_{\beta}$ scoring on GitHub data: Relace uses a score that weights recall twice as heavily as precision. The model is penalized for missing files edited in a ground-truth commit, but not for retrieving useful reference files.
Parallelism penalty from the environment: The RL environment aggressively penalizes sequential, step-by-step reasoning. To maximize reward, the model was forced to unlearn linear exploration and instead execute 4 to 12 simultaneous tool calls (grep_search, view_file, bash) per generation turn.
Specialized code-patching: Using discrete diffusion approaches and routing optimizations, Relace's models apply patches at 10,000 tokens per second.

In production, FAS operates as a high-speed filtration membrane. The sub-agent autonomously explores the repository and uses a specialized report_back function to hand a perfectly curated, miniaturized context window to the larger "Oracle" coding agent.

This is the definitional shift from Wave 2 to Wave 3. No human specified "use parallel tool calls." The RL environment discovered that strategy to maximize its reward. The tool learned how to be a better tool.

Beyond Code: RL-Optimized Retrieval

In the episode "The Doctor's Wife," the TARDIS tells the Doctor: "You didn't always take me where I wanted to go... but I always took you where you needed to go" ⁷. The Doctor sets the destination; the TARDIS optimizes the route through its own experience-sometimes overriding the explicit instruction to satisfy a deeper objective.

This is the critical insight of Wave 3: where the gradients flow. In a conventional agentic system, reward signals optimize the orchestrator-the agent learns to call tools more effectively. In Wave 3, the reward propagates through the tool's own parameters. The agent might "mediate" the conversation, but here discovery is still the driver, guiding the conversation towards a deeper objective.

The same logic applies to retrieval. In multimodal product search, an agentic system might receive a direct query for 'running shoes'. A Wave 1 system passes this string to Elasticsearch. A Wave 2 system might use a specialized embedding model to ensure semantic accuracy. A Wave 3 retrieval sub-agent, optimized via RL on successful checkout events, might learn to autonomously inspect the user's past purchases, cross-reference local weather conditions, and retrieve trail-running shoes if it's raining in their area. No human programmed this specific multi-step logic. The environment simply rewarded successful outcomes, and the retrieval tool learned to orchestrate its own complex sub-queries to achieve them.

The Exit From the Hype Cycle

The trajectory is clear. Wave 1 gave us universal connectivity-any model could call any API. Wave 2 gave us intelligent compression-specialized SLMs that pre-digest the world's noise into clean, compact signals. Wave 3 dissolves the distinction between tool and agent entirely, producing sub-agents whose strategies are discovered through reward optimization rather than hand-engineered by humans.

The implications are architectural. The monolithic "one model to rule them all" paradigm is giving way to cognitive supply chains: networks of narrow specialists, each self-optimizing for its niche, coordinated through lightweight protocols like MCP and A2A ⁵. The orchestrator doesn't need to be omniscient-it needs to be a good delegator. And delegation, as MoE and Speculative Decoding proved years ago, is not a workaround. It is the architecture.

For practitioners, the mandate is straightforward. Stop stuffing raw API responses into million-token context windows. Start treating tool development as a training problem-where the reward function, not the prompt, defines what "good" looks like. The hardware constraints aren't going away; HBM will remain scarce and expensive. The organizations that thrive will be those that learn to spend attention wisely, routing each cognitive task to the smallest, most specialized model that can handle it.

The hype cycle promised that scaling a single model would solve everything. The physics disagrees. The future belongs to systems that delegate-and to the tools blurring every line between agent and sub-agent.

Scaling Agents Authors. Scaling Agents via Continual Pre-training [Internet]. 2025. Available from: https://www.researchgate.net/publication/395541293_Scaling_Agents_via_Continual_Pre-training

Jina AI. ReaderLM-v2: Search Foundation Models [Internet]. 2025. Available from: https://jina.ai/models/ReaderLM-v2/

FireRed Team. FireRed-OCR Technical Report [Internet]. 2026. Available from: https://arxiv.org/abs/2603.01840

MarkTechPost. FireRedTeam Releases FireRed-OCR-2B Utilizing GRPO to Solve Structural Hallucinations in Tables and LaTeX for Software Developers [Internet]. 2026. Available from: https://www.marktechpost.com/2026/03/01/fireredteam-releases-firered-ocr-2b-utilizing-grpo-to-solve-structural-hallucinations-in-tables-and-latex-for-software-developers/

OneReach. Top 5 Open Protocols for Building Multi-Agent AI Systems 2026 [Internet]. 2026. Available from: https://onereach.ai/blog/power-of-multi-agent-ai-open-protocols/

Relace. Exploiting Parallel Tool Calls to Make Agentic Search 4x Faster [Internet]. 2026. Available from: https://relace.ai/blog/fast-agentic-search

Gaiman N. The Doctor’s Wife [Internet]. 2011. Available from: https://www.imdb.com/title/tt1721226/