The Merchant's AI Dilemma
E-commerce merchants now face an existential discovery threat in a token- and GPU-scarce world 1, but the obvious response — hyperscaler chat wrapped around legacy search — forces them to play by the economics of the LLM labs, not the economics of retail.
That is the problem.
The Problem Is Anthropic Have No Moat
This argument has been made about the Metaverse before 2 - that "[Meta are] 'guests' that reach users through the browsers, app stores, devices and search engines owned by Apple, Microsoft and Google." For Anthropic, xAI, DeepSeek and MiniMax this presents a significant threat — not only to their pricing power, but to their route to users. Apple does not hold this position passively: it is currently petitioning the Supreme Court to overturn a ruling that would require it to let App Store developers link to external payment options — litigating to the highest level to keep that toll gate intact 3.
This applies not just to pricing but to feature access. When Apple or Microsoft integrates an LLM natively into iOS, iPadOS and macOS — deciding which capabilities surface through Siri or Copilot, which models get on-device privilege, and which stay in a browser tab — third-party labs lose the ability to ship features, not just charge for tokens 4.
That is why Claude Desktop exists — a search for surface — because if Microsoft, Apple, Google and Amazon own the place where work and shopping happen, the model provider becomes an expensive component inside someone else's interface. The search for surface is not free — and thankfully Anthropic has the pockets for it: a 9 billion to $47 billion ARR in a single year 5.
As many developers have noted, Claude Code's $200/month Max subscription delivers compute equivalent to thousands of dollars at API rates — a structural subsidy Anthropic pays to own the developer surface 6. Famously, when third-party tools like OpenCode discovered they could tap that subsidised budget via subscription OAuth tokens, Anthropic blocked them overnight in January 2026 with no warning and no migration path 7, 8.
For rivals paying API-rate compute — Kiro, Cursor, Windsurf — the margin arithmetic doesn't close at frontier prices, which is why Cursor built Composer: a purpose-trained model designed to match Opus-level benchmarks at a fraction of the cost, severing the dependency on Anthropic's API entirely 6.
For merchants, the upstream drama is almost beside the point. Whether Anthropic wins its surface war or loses it, the merchant building on a hosted wrapper ends up paying a distribution tax to whoever controls the pipe — cross-subsidising user capture on their platforms, if they can get compute at all.
We Have Entered a Compute-Constrained World
This is no longer theoretical. Anthropic planned for a 10× expansion and reportedly got 80× annualised growth in Q1. SemiAnalysis confirms the supply side is equally strained: on-demand GPU prices have risen even for two-generation-old Hopper hardware, every neocloud cluster contacted is locked up, and N3 wafer capacity — the bottleneck for next-generation accelerators — is fully allocated through 2027. Anthropic added $6B of ARR in a single month; in SemiAnalysis's words, "if Anthropic had more compute they would have added more." Demand for Claude has already produced peak-hour reliability and performance strain. The most revealing detail is not the growth number; it is the response: Anthropic struck a deal with SpaceX/xAI to use the capacity of Colossus 1 — a competitor's data centre — with reporting around more than 300MW of capacity and 220,000 Nvidia GPUs. Simon Willison's read is blunt: Anthropic are severely compute-constrained, and leasing from a direct model-layer competitor creates a new kind of supply-chain risk.
This is not a temporary inconvenience; it is the operating environment. In Google's largest GPU data centre in Europe, even L4 capacity can be impossible to provision.Google will not give up compute needed for their own products to subsidise a third-party model provider. This means a world in which OpenRouter models are priced out of the main funnel, and a world in which the "long context" promise is a demo-friendly headline, not a production reality.
Anthropic is not alone. Every frontier lab faces the same wall. The hardware answer is Nvidia's Vera Rubin — due to ship through 2026, promising significant efficiency gains in inference throughput per watt over Blackwell 9. But throughput per watt is not total cost of ownership: N3 wafer capacity remains fully allocated, data centre build-out cycles run two to four years, and any efficiency gain will be competed away across the entire industry before it relieves the scarcity that labs are already inside. Rubin Ultra, the higher-performance variant, is not expected until 2027.
The deeper problem is structural. Dwarkesh Patel notes that AI training compute has scaled at over 4× per year for the last decade 10. Each jump in capability does not suppress demand — it creates it. A model that can do more things is useful to more businesses, in more contexts, more often. In this sense, frontier AI compute behaves like a Giffen good: as the models become more capable and more expensive to serve at scale, the number of viable applications expands and total demand rises with them. Efficiency improvements do not relieve the constraint; they enlarge the market that consumes it.
The Paradox: Models Commoditise Before Compute Does
This is where the argument can look contradictory. If models are becoming commodities, why does compute scarcity matter so much?
Because models and compute are not the same thing.
The frontier has not commoditised. It is becoming more specialised, more capital-intensive, and more detached from ordinary commercial software. The frontier labs are not spending billions so every retailer can answer product-filtering questions more cheaply. They are chasing hard maths, code, physics, biology, planning and super-intelligence. That work will stay expensive because it exists to move the boundary of what machines can do.
But the middle of the market has changed. The Models as Tools and Cognitive Core arguments still hold: useful models have commoditised. Qwen-class cores, lean MoEs, Flash/Mini/Haiku tiers and specialist SLMs mean many commercial tasks no longer require the best model in the world. For a large share of enterprise work, several models are now good enough.
The mistake is to conclude that good enough means abundant. It does not. A cheap model still consumes tokens, HBM, memory bandwidth, power, quota, latency budget and accelerator time. Run it once and it looks trivial. Run it behind every product grid, recommendation rail, search box, chat turn and checkout-adjacent workflow, across every merchant, every day, and it becomes a capacity problem again.
OpenRouter-style access helps with substitution, not abundance. It can route from Claude to Gemini, from Gemini to Qwen, or from a premium model to a cheaper one. But it cannot create new GPUs. It cannot guarantee that the cheapest capable model stays cheap when everyone discovers it. It cannot stop a provider from throttling, repricing, deprecating, region-limiting or reserving capacity for its own products.
Gemini Flash is the clearest illustration. It was introduced as the affordable, high-throughput tier — the model you were supposed to run at scale without worrying about margin. Across four generations the output token price has gone from 9.00 per million — a 30× increase in roughly two years, even as the model nominally stays in the "Flash" commodity bracket.
That is not a price anomaly. It is the model. When a commodity-tier model becomes genuinely useful at scale, demand concentrates, capacity tightens, and the provider reprices. The incentive is not to keep the floor cheap for everyone; it is to move the floor upward as the ceiling rises.
That is the paradox: model choice becomes more liquid at exactly the moment compute becomes more strategic. The weights may be replaceable, but the serving budget is not. The labs and hyperscalers will spend their best capacity on their own products first. What remains for everyone else will be rationed by price, quota, latency and region.
So the question is not whether models commoditise. Many already have. The question is what happens when commodity intelligence has to run on finite global compute.
As The New Stack puts it, "the tokenmaxxing era — spending tokens as a badge of being AI-forward — looks like it's starting to end. The skill that replaces it is token discipline: the right model, in the right amount, for the right job. The workers and companies who learn this process will win. The ones who don't will discover that the AI budget eventually comes out of something else." For a merchant, that discipline cannot live in the prompt. It has to live in the tool that decides what the model sees before it spends a token.
Long Context Is Not a Strategy
Once compute is finite, the next temptation is to buy simplicity with context.
Why build retrieval, ranking, compaction and routing if the model can just see everything? Put the catalogue in context. Add the session. Add behavioural history. Add price, stock, variants, images, policies, reviews and merchandising rules. Let the LLM reason over the whole thing.
That is the seductive version of long context. It turns architecture into prompt size.
For e-commerce, the numbers break immediately. A normal catalogue is not three passages. It is 20,000 to 200,000 SKUs, each with titles, descriptions, facets, prices, variants, stock, behavioural signals and images.
| Catalogue payload | 20k SKUs | 200k SKUs |
|---|---|---|
| Minimal text only (50 tokens/SKU) | 1,000,000 tokens | 10,000,000 tokens |
| Rich metadata (250–500 tokens/SKU) | 5M–10M tokens | 50M–100M tokens |
| One 512×512 image/SKU (~255 vision tokens) | 5.1M tokens | 51M tokens |
| One 1024×1024 image/SKU (~765 vision tokens) | 15.3M tokens | 153M tokens |
So even before reasoning, history, session context, instructions or output tokens, a realistic multimodal catalogue is already one to two orders of magnitude larger than the largest advertised context windows.
The marketing page says million-token context. Production says something else.
| Model | Theoretical max | Practical soft limit | Hard/API ceiling |
|---|---|---|---|
| Qwen 3.6 Plus/Max | 1,000,000 | 262,144 | 1,000,000, often downscaled by API hosts |
| Gemini 3.1 Pro | 2,000,000 | 128,000 | 1,000,000–2,000,000, tier-dependent |
| OpenAI GPT-5.5 | 1,050,000 | 272,000 | 1,050,000 combined prompt/output |
| Claude Opus 4.7 | 1,000,000 | 200,000 | 1,000,000 message ceiling |
But even those practical soft limits overstate what actually works. Epoch AI's context window benchmarks measure effective context length: the point at which a model still scores above 80% on average across Fiction.liveBench and MRCR (2-needle) — tasks that require the model to reliably retrieve and reason over information placed anywhere in the window. The gap between the advertised number and the effective one is not a footnote.
Gemini 2.5 Flash is the illustrative case: 1,000,000 tokens advertised, 18,000 tokens effective. A model that looks like it can hold an entire catalogue in mind can reliably use roughly 1.8% of its claimed window. GPT-4.1 (128k theoretical) reaches 23k effective. Claude Opus 4 (1M theoretical) reaches 103k — the best in the table, and still a fraction of the headline. At longer distances, the model does not simply degrade gracefully; it hallucinates, drops needles, and confuses position with salience.
The gap matters. Past the effective limit you hit context rot, position bias, KV-cache crowding, and failure modes that are hard to debug without dedicated benchmarking infrastructure. A million tokens in a press release is not a million reliable tokens in a shopping session.
This is why RankGPT is the useful reference point. It is not a proof that LLMs should ingest the whole corpus. It is the opposite. RankGPT retrieves a candidate set first, then asks the LLM to produce a better ordering over that small set. When the candidate list is too large, it uses sliding windows. We have seen this in our own experiments, and in small-scale experiments on benchmark datasets as well.
Agentic Discovery Needs Different Tools
Our competitors Solved the Previous Interface problem. This is not because competitors are weak. The opposite. The legacy players are smart, trusted and deeply distributed. They built excellent systems for human search: typo tolerance, fast responses, merchandised browsing, global caching and enterprise SLAs. Those are real strengths.
But those strengths complement human weaknesses: slow typing, short queries, typos and impatience. A 10ms response time matters when the user accidentally types "shoez" and needs to fix their typo quickly. It matters less when an agent is reasoning over a conversation.
The interface changed. McKinsey puts the stakes plainly: by 2028, $750 billion in US revenue will funnel through AI-powered search, half of consumers already prefer it over traditional search engines, and brands unprepared for this shift face a 20–50% decline in traffic from legacy channels. The question is not whether the interface has changed — it is what the new interface actually needs from the tools behind it.
Tools should complement model capabilities and patch model weaknesses.
LLMs are good at language, semantic intent, multimodal perception, flexible reasoning and preference following. They are weak at persistent memory, fresh market state, calibrated uncertainty, exploration and cheap long-horizon context. : a prompt can adapt a response, but it does not create an exploration policy, a regret bound or a persistent preference model.
That changes the job of retrieval. It is no longer enough to be fast and typo-tolerant. The tool has to decide which product, user, market, uncertainty and multimodal context the model should see. It should preserve the visual signal, not force image taste through caption → keyword → bad candidate set. It should make the model's weaknesses smaller before the model spends tokens.
Bad tools turn model strengths into cost. A smart LLM can compensate for bad retrieval by filling context, calling tools and reasoning for longer. The problem is latency, hallucination and margin. A merchant cannot afford that everywhere.
Solenya's Wedge
Solenya is the merchant-native context layer for agentic discovery.
The wedge is not "better search". It is better context for models: product context, user context, uncertainty and multimodal grounding, selected before the model sees a token. Legacy search has accepted a 23% zero-results rate as a cost of doing business. Any question more complex than "blue pants" is a zero-results query. For agent-mediated discovery experiences, that results in repeated calls, high latency, resource contention, timeouts and bad user experience. While the industry has focused on optimizing CLIP models, with hybrid search for better latency, we have preference optimized larger models with better multimodal perception, and worked to scale Solenya to take on the exploration (which LLMs cannot do) and the main-funnel browse experience, readying agentic product discovery for the main-experience - not just the chat box.