The Tokenomics of the Product Listing Page: Why Agentic Commerce is Confined to Chat Bubbles
Chat Bubbles Nobody Asked For
If you've been browsing the internet recently, you've probably had this experience: you've gone to an online store, and a little text bubble has popped up with a message along the lines of "Hello, I'm your friendly AI assistant, what can I help you with today?". The curious among you have probably tried a few of these chat agents, and discovered that whilst their capability can be somewhat impressive, the overall experience is underwhelming. One reason is that the results often appear as a short scrollable carousel of products, or sometimes a small grid, rather than the full product listing pages we are accustomed to.
The reason for this decision is not that product listing pages are unpopular - the fact that desktop conversion rates consistently outperform mobile, with tablets coming in between, is consistent with the fact that larger product listing screens are a better experience 1. Furthermore, while search is a much higher intent channel than browse, browsing makes up a far larger percentage of activity on e-commerce websites 2. So the question is, if large, scrollable product listing pages are the preferred experience, why are e-commerce agents tucked away in the corner of the screen?
The answer is cost.
Tokenomics of a 50-Item Grid
Let's first take a look at the input side. In order to return a 50-item grid, an agent will need to retrieve a larger candidate list of products, either by searching using the e-commerce store's existing search APIs (if they are exposed via an agent tool call) or by retrieving from a vector store. If we assume the agent needs to fetch 150 products, which it will whittle down to the final 50 shown items, we can estimate the number of input tokens required to process this.
The text input is dominated by product metadata. A typical enterprise SKU consisting of title, description, structured attributes, top reviews etc. will be in the order of 250 to 500 tokens 3. That equates to a text payload of 37,500 to 75,000 input tokens, similar to a 40-page essay.
Add product images and things get worse. Frontier vision-language models tile images: GPT-4o, for instance, charges a base of 85 tokens plus 170 tokens per 512×512 tile after scaling to fit a 2048×2048 box 4. This results in the following:
| Image size (px) | Tiles | Vision tokens |
|---|---|---|
| 512 × 512 | 1 | 85 + 170 = 255 |
| 1024 × 1024 | 4 | 85 + 680 = 765 |
| 2048 × 2048 | 16 | 85 + 2,720 = 2,805 |
Therefore, including one high-detail image for each product results in an additional 115,000 input tokens.
The more expensive side of the equation is the output tokens. Due to the sequential and memory-bandwidth-bound nature of auto-regressive decoding, providers typically charge 5× higher for output tokens 5, 6. Rendering the structured JSON of a 50-item product grid, including titles, attributes, prices, variant options etc. will be around 400 output tokens per product, or 20,000 tokens in total. At $15 per million output tokens (Sonnet 4.6 and GPT 5.4 pricing at time of writing), that's $0.30 per session for the rendering step alone, before we count any input tokens.
To put that number in context, consider the expected gross profit of a typical e-commerce session. Take a $90 average order value, a 50% gross margin, and a 2% conversion rate: the expected gross profit per session is $0.90. A full inference bill of around $0.40 consumes over 40% of that. The economics don't hold.
This is why the chat bubble exists - a short carousel of 4-6 products is far more affordable. The chat window wasn't popularised because it's the best UX, it's just the largest result set the unit economics can afford.
Fixing the Output Side: MCP-UI
The output token problem has a clean architectural fix, which is to stop asking the agent to render the product cards.
MCP-UI is a standard that does exactly this. Rather than
generating HTML or structured JSON for each product in the grid, an
MCP-UI-enabled agent returns a resource URI pointing to a pre-compiled
component, which the host application renders locally 7. A tool
response references the UI via a resourceUri in its metadata, something like
ui://catalog/product-card?product_id=SKU_789. The host picks that up and draws
the card, meaning the agent never has to generate a single line of UI.
The components render in sandboxed iframes, so the agent and the frontend stay properly separated. User interactions such as picking a size or clicking add to cart fire structured callbacks back to the agent, which handles them via tool calls. The reasoning lives in the model and the rendering lives in the host, so the two never share a token budget.
The cost effect is significant. Pointing at 50 product cards instead of generating their content collapses those 20,000 output tokens to a handful of URIs and tool calls, around 800 tokens total for the same 50-item grid. At $15 per million output tokens, the output bill drops from $0.30 to a cent per session.
Compressing the Input Bill
MCP-UI solves the output side. The 37,500 to 75,000 token text payload on the input side is a separate problem with three practical levers.
Use a smaller model for the filtering step. A frontier vision-language model is the wrong tool for whittling 150 candidates down to 50 based on text similarity. Small language models - Llama 3.2 3B at roughly $0.06 per million tokens, or Gemini Flash-Lite at around $0.075 per million 8 - are well suited to structured filtering tasks like this. Reserve the frontier model for the final curation pass where the multimodal reasoning justifies the cost. A well-tuned two-step cascade brings the input bill into sub-cent territory per session.
Cache the shared prefix. Most of the input prompt is identical across sessions: system instructions, catalogue context, intent templates. Provider APIs charge cached input tokens at up to 90% off the uncached rate 9. If 80% of the prompt is cacheable, which is typical for a well-structured e-commerce agent, the effective input cost drops by around 70% with no change to the model.
Strip the product metadata. Rather than feeding the model a full product blob for each candidate, extract only the fields that carry information relevant to the ranking decision: title, the top few structured attributes, price, and one image. Irrelevant text in the context window is noise, not signal, but it costs tokens to process regardless.
These three levers compound. Starting from a ~$0.40 per session baseline, a well-optimised stack - small model for filtering, prefix caching, lean metadata - can get the full inference cost for a 50-item grid down to a few cents.
The Unit Economics
Whether a few cents per session makes sense depends on the value of the session. The table below shows the conversion rate uplift required to break even on a $0.05 per session inference cost, using AOV and CVR figures from Triple Whale's 2025 e-commerce benchmarks 10.
| Vertical | Median AOV | Typical CVR | Gross profit / session | Required CVR uplift |
|---|---|---|---|---|
| Pet Supplies | $58.85 | 2.45% | $0.72 | +6.9% |
| Health & Beauty | $60.29 | 2.49% | $0.75 | +6.7% |
| Apparel | $85.45 | 1.82% | $0.78 | +6.4% |
| Consumer Electronics | $106.05 | 1.58% | $0.84 | +6.0% |
| Food & Beverage | $61.99 | 2.73% | $0.85 | +5.9% |
| Home & Garden | $110.24 | 1.68% | $0.93 | +5.4% |
Methodology: Required uplift = inference cost / (AOV × gross margin × CVR). Gross margin assumed at 50% throughout - actual margins vary considerably by category and business model.
What stands out from these figures is how tight the spread is, with all six categories falling between 5.4% and 6.9% required uplift. The vertical matters less than one might expect.
That makes the inference cost the critical variable. At $0.40 per session - roughly what an unoptimised frontier model costs - none of these verticals produce a viable business case at these uplift levels. At $0.05 per session, all of them do, provided the agent is delivering real conversion improvement.
The question is less about whether your category can afford agentic curation on the listing page, and more about whether the inference stack is optimised enough to bring the cost into range.
Closing
The chat bubble was not a considered UX decision. It was the answer to a simple question: how large a result set can an agent render before the inference cost exceeds the expected value of the session? For a while, the answer was four to six items in a carousel.
The combination of output decoupling via MCP-UI, small-model filtering, and prefix caching changes that answer to fifty. Once the rendering is separated from the reasoning, the cost of composing a full product listing page falls into a range where the unit economics can work - which is significant, because the listing page is where most of the e-commerce funnel actually lives.