"I had a prediction a while back that I almost feel like we can get cognitive cores that are very good at even like a billion parameters... if you talk to a billion parameter model I think in 20 years you can actually have a very productive conversation, it thinks and it's a lot more like a human, but if you ask it some factual question might have to look it up-but it knows that it doesn't know and it might have to look it up and it will just do all the reasonable things."

Andrej Karpathy, Dwarkesh Podcast ¹

The Prophecy, Fulfilled Early

Karpathy's thesis is precise: trillion-parameter LLMs waste vast capacity on "memory work"-functioning as bloated, blurry databases of internet facts rather than pure logic engines. The optimal architecture strips memorization out of the weights and pushes it into external memory: search indices, vector stores, tool APIs. What remains is a reasoning-dense "Cognitive Core" focused strictly on System 2 thinking-pattern recognition, planning, and real-time retrieval. A model that knows that it doesn't know and reaches for the right tool instead.

He theorized we might achieve this at around 1 billion parameters within a decade. The industry arrived there in under two years. And it changes everything-because a cognitive core without tools is useless. If your AI can reason but can't act, it's just an expensive autocomplete. The model has become commodity. The tool layer has become the entire product.

Qwen 3.5 4B: The Cognitive Core You Can Self-Host

Qwen 3.5 4B ² is a fully dense model-all 4 billion parameters activate on every token. On the Artificial Analysis Intelligence Index, it crosses a threshold that no model this small has reached before.

Artificial Analysis Intelligence Index showing Qwen 3.5 4B punching far above its weight class alongside frontier MoE models. — Qwen 3.5 4B closing the gap with models 50–250x its total parameter count on pure reasoning metrics. The chart also shows the MoE frontier: Kimi K2 (1T total, 32B active), MiniMax M2 (230B total, 10B active), and GLM-4.5V (106B total, 12B active).

To appreciate the scale of what is happening here, consider the timeline. The models Qwen 3.5 4B is closing the gap with were themselves the undisputed state of the art months ago. Claude Opus 4, one of the frontier benchmarks on this chart, was only superseded by Opus 4.1 in August 2025 ³-less than seven months before this article's publication. A 4-billion-parameter model that you can run on a laptop is approaching the reasoning performance of systems that defined the frontier the previous quarter. The velocity of convergence is the story.

For 99% of enterprise applications-customer-facing chatbots, internal Q&A, document triage, order-status queries-this model is good enough. Not as a compromise. As a genuine solution. Tooling ecosystems like Unsloth ⁴ have optimized the inference stack so aggressively that Qwen 3.5 4B fine-tunes in under 5GB of VRAM and runs inference on a single consumer GPU. An enterprise can self-host this on existing infrastructure, behind its own firewall, with its own data governance. No API bills. No vendor lock-in. No latency penalty from round-tripping to a frontier provider.

But this is not just a cost story. It is the fulfillment of Karpathy's cognitive core prophecy. Qwen 3.5 4B doesn't achieve its reasoning performance by memorizing StackOverflow or Wikipedia. Its weights encode algorithms-patterns of logical deduction, code structure, arithmetic. Ask it a factual question and it will often be wrong or vague. Give it a search tool, a database connection, and a structured API, and it becomes formidable. The model is only as capable as the tools it can reach.

This is the fundamental inversion. In the scaling-law era, the model was the product-bigger weights meant more knowledge meant better answers. In the cognitive core era, the model is the engine and the tools are the product. A 4B reasoning core with access to a rich, well-structured tool environment will outperform a 70B monolith with no tools every time. The value has migrated from the weights to the wiring.

The Landscape: Everyone Is Converging on the Same Insight

Qwen 3.5 4B is not an anomaly. It is the most dramatic proof of a convergence visible across every major open-weight lab. The same structural insight-lean reasoning core, heavy tool reliance-is appearing at vastly different scales:

Open-weight language models compared by architecture, parameter count, context length, and optimisation target.

Model	Architecture	Total Params	Active Params	Context	Optimized For
Qwen 3.5 4B ²	Dense	4B	4B	32K	General reasoning, code
MiniMax M2 ⁵	MoE (256 experts)	230B	10B	128K	Coding, agentic tool use
GLM-4.5V ⁶	MoE	106B	12B	64K	Vision-language, GUI agents
Kimi K2 ⁷	MoE (384 experts)	1T	32B	128K	Agentic intelligence, tool use

MiniMax M2 carries 230 billion total parameters but activates only 10 billion per token-and ranks #1 among open-source models on the Artificial Analysis composite intelligence score. Their documentation articulates the economics plainly: "10B activations = responsive agent loops + better unit economics" ⁵. GLM-4.5V packs 106 billion parameters but activates just 12 billion, achieving state-of-the-art across 42 vision-language benchmarks-and ships with a dedicated GUI agent mode designed specifically for tool-mediated screen interaction. Kimi K2 pushes the MoE frontier to a full 1 trillion parameters with 384 routed experts, but activates only 32 billion per token. It was explicitly engineered for "agentic intelligence" with tool use as a first-class training objective-not an afterthought bolted onto a chatbot.

Every one of these models embodies the same thesis. The total parameter count stores distributed knowledge across expert networks; the active parameter count is the cognitive core. Kimi K2 doesn't activate a trillion parameters to answer your question. It routes to 8 of 384 specialists and reasons with 32 billion. The rest is dormant memory, consulted only when routing logic determines relevance. The architecture itself is a delegation system-and the model's usefulness is entirely gated by the quality of what it can delegate to.

Why Tools Become Everything

The cognitive core's relationship to tools is not optional-it is constitutional. A model that discards rote memorization in favor of reasoning density doesn't prefer to use tools. It must use tools. Every factual question, every data retrieval, every action in the world requires an external call. This is the structural consequence of Karpathy's architecture: the leaner the core, the heavier the tool dependency.

Consider the protocol layer. Standards like the Model Context Protocol (MCP) for tool invocation, Agent-to-Agent (A2A) for peer delegation, and AG-UI for interface synchronization were once infrastructure investments-plumbing you might get around to eventually. For a cognitive core, they are the nervous system. A core without MCP-compliant tool endpoints is a reasoning engine with nothing to reason about. A storefront without A2A agent cards is invisible to buyer agents running their own self-hosted cores. The protocols are no longer plumbing-they are the substrate of intelligence itself.

Consider the tools themselves. The evolution from dumb API wrappers to specialized SLMs-models like ReaderLM for HTML compression, FireRed-OCR for structural document extraction-to fully self-optimizing sub-agents was already underway. In a world of self-hosted cognitive cores, it becomes mandatory. When the reasoning engine has 4 billion parameters and no embedded knowledge, every tool it calls must be excellent. A bad search tool doesn't just return poor results-it degrades the entire chain of reasoning built on top of those results. The quality floor for tools rises dramatically because the core has no fallback. It can't compensate with memorized knowledge. It can only work with what its tools give it.

The Self-Hosting Inflection

The enterprise implications are concrete. For the overwhelming majority of production workloads-customer service chatbots, internal knowledge assistants, document classification, order tracking, FAQ routing-a self-hosted 4B reasoning core paired with well-engineered tools meets or exceeds the performance of what was a frontier API call six months ago, at a fraction of the cost. The calculus has shifted:

Data sovereignty. Customer conversations, proprietary documents, and transaction data never leave the firewall. No third-party processing agreements. No residual training risk.
Latency. Local inference eliminates the network round-trip to a frontier provider. For real-time applications-live chat, voice assistants, in-session recommendations-this is the difference between fluid and laggy.
Cost at scale. Frontier API pricing is per-token. A self-hosted 4B model on a single A100 (or even a consumer RTX 4090) serves thousands of concurrent requests at a fixed infrastructure cost. At enterprise scale, the savings compound rapidly.
Customization. Fine-tuning a 4B model on domain-specific data is a weekend project, not a quarter-long engagement. The model adapts to your terminology, your product catalogue, your customer interaction patterns.

But-and this is the critical pivot-self-hosting the model is the easy part. The hard part is building the tool layer that makes the model useful. A self-hosted cognitive core without robust retrieval, without structured APIs, without MCP-compliant tool endpoints, is just a chatbot that confidently hallucinates. The competitive question is no longer "which model do you use?" It is "how good are your tools?"

Agents That Build Their Own Tools

When you combine a cheap, self-hostable reasoning core with self-optimizing sub-agents, something new emerges. The core doesn't just use tools-it can create them.

Anthropic's "skills" framework ⁸ demonstrates this concretely: Claude identifies a recurring multi-step workflow, generates a reusable tool definition, tests it against observed outcomes, and persists it for future invocations. The agent crystallizes a repeated sequence-parsing a specific vendor's invoice format, for instance-into a named, parameterized skill that subsequent runs invoke directly. No human wrote the tool.

The mechanism scales down to self-hosted hardware. A reasoning-dense 4B model tasked with extracting data from an undocumented system:

Analyzes the structure and determines it lacks a native extraction tool.
Generates an MCP-compliant extraction protocol-typed schemas, error handling, retry logic.
Validates the tool in an RLVR (Reinforcement Learning from Verifiable Rewards) loop, iterating on structural accuracy.
Registers the working tool in its external memory for future use.

Because the core runs on commodity hardware, these bootstrapping loops can execute hundreds of parallel experiments at negligible marginal cost. The cognitive core doesn't just depend on tools-it manufactures them. The standard protocols provide the interfaces; specialized models provide the intelligent tool layer; and the cognitive core, self-hosted and cheap, autonomously extends that layer by authoring new tools to fill its own gaps.

Tools that rely on cores that build tools. The system becomes self-extending-but only if the foundational tool infrastructure is solid. A cognitive core can generate a new MCP tool, but only if there is an MCP server to register it against. It can spawn a specialized retrieval sub-agent, but only if a delegation protocol like A2A is in place. Every layer of self-extension depends on the protocol substrate being there first.

The Completed Architecture

The agentic stack has three layers, and all three have matured simultaneously:

Protocols provide the connective tissue. MCP standardized tool invocation. A2A enabled peer-to-peer agent delegation. ACP and UCP codified commerce negotiation. AG-UI synchronized agents with visual interfaces. Together, they solved the n×m integration problem and created a universal substrate for agentic communication.
Specialized models populate the wiring with intelligence. ReaderLM compresses HTML. FireRed-OCR enforces structural precision via GRPO. Relace.ai parallelizes code search through RL. Each replaces a "dumb pipe" with a narrow cognitive specialist-a tool that pre-digests the world's noise into clean, compact signals.
The cognitive core makes tools the entire product. Dense models like Qwen 3.5 4B and lean-activation MoE architectures like MiniMax M2 (10B active), GLM-4.5V (12B active), and Kimi K2 (32B active) deliver self-hostable, enterprise-grade reasoning. But stripped of memorized knowledge, they are entirely dependent on the tool layer beneath them. The model is commodity. The tools are the moat.

The competitive landscape has inverted. In the scaling-law era, organizations competed on model access-who had the biggest, most capable LLM. In the cognitive core era, everyone has access to last-quarter's frontier-class reasoning on commodity hardware. The differentiator is what that reasoning can reach: the depth of your knowledge layer, the density of your tool surface, the quality of your retrieval, and the completeness of your protocol-compliant endpoints. A competitor with a superior model but shallow tools loses to a competitor with a commodity core and a rich, structured environment.

The scaling-law era promised that a single, ever-larger model would solve everything. The physics disagreed. The cognitive core has arrived-lean, self-hostable, and utterly dependent on its tools. The organizations that win are the ones that build the best toolbox, not the biggest brain.

Karpathy A, Patel D. Andrej Karpathy - The Optimal Cognitive Core [Internet]. 2024. Available from: https://www.youtube.com/watch?v=UldqWmyUap4

Team Q. Qwen3.5-4B Release [Internet]. 2026. Available from: https://huggingface.co/Qwen/Qwen3.5-4B

Anthropic. Claude Opus 4.1 [Internet]. 2025. Available from: https://www.anthropic.com/news/claude-opus-4-1

Unsloth. Unsloth Qwen 3.5 Performance Tuning [Internet]. 2026. Available from: https://unsloth.ai/docs/models/qwen3.5

MiniMax Team. MiniMax-M2: A Mini Model Built for Max Coding & Agentic Workflows [Internet]. 2026. Available from: https://github.com/MiniMax-AI/MiniMax-M2

V Team, Hong W, Yu W, Gu X, Wang G. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning [Internet]. 2025. Available from: https://huggingface.co/zai-org/GLM-4.5V

Kimi Team. Kimi K2: Open Agentic Intelligence [Internet]. 2025. Available from: https://github.com/MoonshotAI/Kimi-K2

Anthropic. Anthropic Skills Framework [Internet]. 2026. Available from: https://github.com/anthropics/skills