evaluationretrievale-commercemultimodalbenchmarkssearch

Product Search Needs Better Evals

Marcus Gawronsky

Historically, the e-commerce retrieval community has assessed search models using static text-based benchmarks, and while much of this changing, deep issues still remain in the way e-commerce evaluations are run and constructed.

Three Ways Evals Are Run and Constructed

Modern product-search modelling has made real progress. Marqo's FashionCLIP and FashionSigLIP models move beyond generic image-text matching by treating fashion products as structured, multimodal records: image, title, description, colour, category, keywords, details, and materials 1, 3. RexRerankers and RexBERT push in a different direction, using cross-encoders trained on enormous commerce corpora: Amazebay-Catalog with 37 million products, Amazebay-Relevance with 6 million query-product pairs across roughly 364k queries, and ERESS with 4.7k queries and 72k labelled pairs 4. These systems are not the old keyword stack wearing a new coat. They are meaningfully better machinery.

But their evaluations and training sets still rely on the same three dataset-making tricks. The field can build better models, but it has only three practical ways to decide what counts as relevant:

Dataset styleHow it is madeExamplesWhat it buysWhat it smuggles in
ClickstreamConvert impressions, clicks, carts, purchases, and dwell time into implicit labels.Production search logs; learning-to-rank from biased feedback 5.Real users, real query distribution, huge scale.Exposure bias: behaviour is conditional on what the previous ranker showed.
Manual annotationAsk humans to grade a selected query-product pool.ESCI/KDD Cup and WANDS 6, 8.Inspectable relevance grades, not just behaviour.Selection bias: humans can only label the candidate pool placed in front of them.
Synthetic positive pairsManufacture positives from product fields, image-text pairs, generated queries, or LLM-labelled relevance pairs.Marqo FashionCLIP/FashionSigLIP; RexRerankers / RexBERT 1, 4.Cheap scale, multimodal coverage, controllable tasks.False negatives: neighbouring valid products become negatives because they were not the named pair.

Generalisation Bounds: What a Valid Benchmark Would Need

Before diagnosing what goes wrong in practice, it helps to be precise about what a trustworthy benchmark would require. Classical learning theory gives a clean answer. For a hypothesis class H\mathcal{H} with VC-dimension dd, evaluated over nn samples drawn independently and identically distributed (iid) from the true query-product distribution, the gap between empirical and population risk is bounded with probability at least 1δ1 - \delta by:

suphHR(h)R^(h)O(dlog(n/d)+log(1/δ)n)\sup_{h \in \mathcal{H}} |R(h) - \hat{R}(h)| \le \mathcal{O}\left(\sqrt{\frac{d \log(n/d) + \log(1/\delta)}{n}}\right)

If the labelled set OqO_q were a genuinely random sample of the catalogue PP, the benchmark would be a noisy but bounded proxy for catalogue-level relevance. Generalisation bounds would hold, and a score on the benchmark would reliably predict behaviour in production search. More labels would tighten the bound. The benchmark would be slow and incomplete, but trustworthy in direction.

The requirement is precisely the iid condition. Both πpool(pq)\pi_{\text{pool}}(p \mid q) - the distribution from which the observed pool is drawn - and πcatalogue(pq)\pi_{\text{catalogue}}(p \mid q) - the distribution we actually care about - must be the same. Define the two risks:

Rpool(M)=EqEpπpool(q)[loss(M,q,p)]R_{\text{pool}}(M) = \mathbb{E}_{q} \mathbb{E}_{p \sim \pi_{\text{pool}}(\cdot \mid q)}[\operatorname{loss}(M, q, p)] Rcatalogue(M)=EqEpπcatalogue(q)[loss(M,q,p)]R_{\text{catalogue}}(M) = \mathbb{E}_{q} \mathbb{E}_{p \sim \pi_{\text{catalogue}}(\cdot \mid q)}[\operatorname{loss}(M, q, p)]

If πpool=πcatalogue\pi_{\text{pool}} = \pi_{\text{catalogue}}, a small RpoolR_{\text{pool}} implies a small RcatalogueR_{\text{catalogue}} and the benchmark is valid. If they diverge, the bound breaks entirely. VC theory gives you a precise ruler. Whether the measurement is about the catalogue, the old ranking system, or a synthetic pairing policy depends entirely on how the dataset was assembled.

Clickstream: Behaviour Is Not Ground Truth

The first shortcut is clickstream data. It is tempting because it looks like reality: users searched, saw products, clicked, carted, purchased, or bounced. At sufficient scale, these traces feel more honest than a labelling spreadsheet. They are not invented. They happened.

The problem is that clickstream data is never a random sample of product relevance. It is a sample of behaviour after a ranking system has already decided what the user was allowed to see. A product that appears in position one receives more attention than a product buried on page six. A cold-start product that was never surfaced cannot be clicked. A niche product that the old ranker never trusted has no behavioural evidence, not because shoppers rejected it, but because shoppers were never offered it.

This is the classic exposure-bias problem in learning to rank 5. Clicks can be useful telemetry, but they are conditional telemetry. They answer: given yesterday's retrieval policy, how did users behave? They do not answer: which products in the catalogue would satisfy this query if a different model had surfaced them? In commerce, that distinction is fatal. A clickstream benchmark can reward a model for reconstructing the old storefront layout while missing products the old layout suppressed.

Manual Annotation: The Cost of Truth

The second shortcut is manual annotation. This is the most respectable version of evaluation: pay humans to inspect query-product pairs and assign relevance labels. It is also where the same selection problem reappears with cleaner stationery. The observed pool OqO_q is usually assembled by the retrieval system that ran before the model under evaluation - the exact system the new model is supposed to outperform. That is not uniform random sampling. It is, structurally, the predecessor's opinion about what is worth judging.

In machine learning, high-quality, human-labelled data is gold. In e-commerce search, that gold is expensive to mine at the scale a valid benchmark would require. For any query qq, an ideal evaluation would want a relevance label yqpy_{qp} for every product pp in the full catalogue PP. In practice, human relevance feedback is bounded by annotator budget and attention.

This constraint produces severe label sparsity. Consider Wayfair's WANDS dataset, an explicit example of expensive human annotation efforts 8. WANDS contains 42,99442,994 products and 480480 queries, resulting in 233,448233,448 query-product judgements graded as Exact, Partial, or Irrelevant. Out of over 20 million possible pairs in Q×PQ \times P, only about 1.1%1.1\% are actually judged. Even the massive ESCI dataset, with its 1.11.1 million Task 1 judgements, covers only a tiny, highly sparse fraction of the wider search space 6.

Formally, instead of observing the complete matrix, a benchmark only observes:

Yobserved={(q,p,yqp):pOq}Y_{\text{observed}} = \{(q, p, y_{qp}) : p \in O_q\}

where OqPO_q \subset P is the highly-restricted subset of observed, judged products for query qq.

Fig: label_distribution - judged relevance labelsy-axis shared at 0-70%
ESCI label distributionTest set · all locales · n = 336,373 judgements
ESCI label distributionExact 44.1% (148,379), Substitute 34.4% (115,576), Irrelevant 16.5% (55,384), Complement 5.1% (17,034)0%20%40%60%SHARE OF LABELS44.1%E34.4%S16.5%I5.1%C
  • Exact44.1%148,379
  • Substitute34.4%115,576
  • Irrelevant16.5%55,384
  • Complement5.1%17,034

E+S = 78.5% — the pool skews toward items the historical retriever already surfaced.

Source: ESCI reduced Task 1 test split.

WANDS label distributionFull dataset · n = 233,448 judgements
WANDS label distributionPartial 62.8% (146,633), Irrelevant 26.2% (61,201), Exact 11.0% (25,614)0%20%40%60%SHARE OF LABELS62.8%Partial26.2%Irrelevant11.0%Exact
  • Partial62.8%146,633
  • Irrelevant26.2%61,201
  • Exact11.0%25,614

Partial dominates at 62.8% — annotator uncertainty propagates directly into model training noise.

Source: Wayfair WANDS label.csv.

The core methodological problem is that OqO_q is not a uniform random sample of PP. It is produced by previous retrieval systems, historical click exposure in search logs, sampling heuristics, or academic challenge pipelines. This means that relevance labels are missing-not-at-random (MNAR). If a retailer has a 1-million-product catalogue, judging 40 pre-filtered products for a query is feasible; judging all 1 million is impossible. Labels are like gold leaf, not paint-you apply them sparingly where you expect them to matter. You do not wallpaper the entire warehouse.

Because the observed labelled pool OqO_q is constructed by historical candidate generators, it suffers from a systematic bias: the true global optimal products frequently lie entirely outside the judged set. When an evaluation framework forces a model to rank products within a closed pool, it measures performance conditional on that pool rather than open-ended discovery across the entire catalogue:

E[metric(M,q)pOq]E[metric(M,q)pP]\mathbb{E}[\operatorname{metric}(M, q) \mid p \in O_q] \ne \mathbb{E}[\operatorname{metric}(M, q) \mid p \in P]

This creates a structural penalty for advanced systems. As search algorithms have progressed from exact-token lexical matching to dense semantic vector spaces, multimodal representations 9, and commerce-aligned outcomes 6, they escape the bounds of historical candidate pools. When a modern model retrieves an unlabelled, high-quality "green shirt" from the catalogue, the offline benchmark treats it as irrelevant because there is no label for it.

The shopper finds the perfect item, yet the model is penalized in the spreadsheet.

Synthetic Positive Pairs and the Muddled Modality Pitfall

The third shortcut is synthetic positive-pair construction. To work around static label sparsity, practitioners have historically relied on data augmentation techniques to construct artificial training signals. The most common motif splits a single product record into positive elements for self-supervised contrastive learning 10. These patterns typically pair attributes such as:

  1. A product image and its corresponding title or short caption.
  2. An online search query and its eventually clicked or purchased product SKU.
  3. A product title and its detailed descriptive bullets or attributes.
  4. A synthetically generated query and its matching source product.

These pairs are fed into contrastive objectives like the InfoNCE loss, pulling matched positive pairs closer and pushing unlabelled random items away as negatives:

L(q,p+)=logexp(s(q,p+)τ)exp(s(q,p+)τ)+jexp(s(q,pj)τ)\mathcal{L}(q, p^+) = -\log \frac{\exp\left(\frac{s(q, p^+)}{\tau}\right)}{\exp\left(\frac{s(q, p^+)}{\tau}\right) + \sum_{j} \exp\left(\frac{s(q, p_j^-)}{\tau}\right)}

While this generates clean gradient signals, simple image-text partitioning is highly limited in practice. Standard studio and posed garment images are inherently ambiguous. For instance, a high-fashion photo of a "green cotton t-shirt" might look visually identical to a performance sports jersey or an olive linen top. It represents an ambiguous scene where the textual metadata fields-its brand name, specific weave, category, or fabric texture-disambiguate the exact product.

Forcing a model to learn from isolated image-caption splits mangles features by treating different views of the same product as separate, disconnected entities, or worse, treating other valid substitutes as hard negatives. Marqo-FashionCLIP and Marqo-FashionSigLIP attempt to address the first issue by optimising a Generalised Contrastive Learning (GCL) loss across seven distinct metadata aspects, replacing binary binary labels with continuous GPT-4-derived ranking scores 3. This is a genuine improvement: richer positives, graded signals, multiple product facets.

But GCL does not solve the off-diagonal problem. In any training batch of NN query-product pairs, the full relevance matrix is N×NN \times N. GCL assigns a learned score rqpr_{qp} to the named positive along the diagonal - but every off-diagonal cell, query qiq_i against product pjp_j from a different pair, is still implicitly assigned zero weight. Formally, the in-batch loss is:

LGCL=(q,p)Brqplogexp ⁣(s(q,p)τ)pBexp ⁣(s(q,p)τ)\mathcal{L}_{\text{GCL}} = -\sum_{(q,p) \in \mathcal{B}} r_{qp} \cdot \log \frac{\exp\!\left(\frac{s(q,p)}{\tau}\right)}{\displaystyle\sum_{p' \in \mathcal{B}} \exp\!\left(\frac{s(q,p')}{\tau}\right)}

The denominator sweeps over the entire batch. Any product pp' that is genuinely relevant to qq but was not explicitly paired with it contributes to the denominator with no compensating numerator weight - it is treated identically to a true irrelevant.

This issue becomes even more acute when we consider actual shopper behaviour. Real-world search logs are highly skewed. They follow a classic Pareto distribution (Zipf's Law) where the query frequency for rank rr scales as:

frequency(r)rα\operatorname{frequency}(r) \propto r^{-\alpha}

A small cluster of highly popular, ambiguous head queries ("shoes", "dress", "lamp", "Nike") accounts for the vast majority of web traffic, while an incredibly long, sparse tail represents highly-specific individual searches 11, 12.

Loading diagram…

Synthetic generators ground their queries on verbose, highly descriptive product titles. This produces hyper-specific, long-tail queries like "navy Adidas Ultraboost men's running shoe size 10".

These synthetic queries are easy to evaluate because they point directly to a single, unambiguous SKU. But they are completely unrepresentative of real shopper demand. A real, high-traffic head query like "shoes" can comfortably accept hundreds of sneakers, boots, and loafers. Generating data solely from product titles teaches the model that only one exact shoe is relevant, while turning every other valid footwear option into an apparent error.

The fundamental mathematical issue with binary positive-pair training in e-commerce search is that discovery relevance is multi-positive and graded, not single-positive and binary. In the real market, a query qq has a set of multiple highly relevant products R(q)={pP:y(q,p)>0}\mathcal{R}(q) = \{p \in P : y(q, p) > 0\}.

When we train contrastive models, we pair query qq with source product p+p^+ and sample a random subset of negative candidates pp^- from P{p+}P \setminus \{p^+\}. This formulation systematically introduces severe label noise:

Whenever pR(q)p^- \in \mathcal{R}(q) the loss function penalises the model for predicting a highly relevant product.

Because the generated query inherits the exact title specificity of the source product, the model learns brittle, overly sensitive boundaries rather than robust commercial intent. If the source item is a "green cotton t-shirt," and the model returns an identical, premium cotton green shirt from a different manufacturer, the system is punished.

The Closed-Loop and Its Bias

The pathologies described above do not exist in isolation. They reinforce each other inside a closed circuit that the industry has largely accepted rather than dismantled.

The pool OqO_q is assembled by a historical retrieval system. Human annotators label what that system surfaced. The benchmark then scores new models against those labels. Models that resemble the historical retriever - staying close to its token overlap, its candidate ordering, its implicit definition of relevance - are structurally rewarded. Models that escape it, retrieving genuine catalogue-level hits the historical system never considered, are penalised for finding things that nobody labelled.

This is the admissions-policy problem made systemic. Lexical overlap is not merely a feature; on a pool built by a lexical retriever, it is part of the admission criteria. The 0.7200.720 random-in-pool nDCG on ESCI is the fingerprint: the candidate generator did most of the relevant filtering before any ranking model arrived. Hybrid methods that stay close to that filter look strong. Models that diverge from it look weak, even when they are finding better products.

The result is a coherent but self-confirming story. The benchmark is not lying. It is answering a real question about performance within the labelled pool. The problem is that the labelled pool is a strict subset of the catalogue, assembled under the prior system's assumptions about what products are worth inspecting - assumptions that multimodal, semantic, and preference-aligned models are specifically designed to challenge.

Public benchmarks are useful lab equipment. They expose whether a model can rank pre-curated candidates sensibly. What they cannot measure is whether the candidates were the right ones to curate in the first place - and that is precisely the question that matters for commerce.

1.
Marqo. Marqo FashionCLIP [Internet]. 2024. Available from: https://github.com/marqo-ai/marqo-FashionCLIP
2.
Marqo. Marqo-FashionSigLIP [Internet]. 2024. Available from: https://huggingface.co/Marqo/marqo-fashionSigLIP
3.
Zhu T, Jung MC, Clark J. Generalized Contrastive Learning for Multi-Modal Retrieval and Ranking. In: Proceedings of the ACM Web Conference 2025 (WWW 2025) Industry Track [Internet]. 2025. Available from: https://arxiv.org/abs/2404.08535
4.
RexRerankers Authors. RexRerankers: A Family of Cross-Encoders for E-commerce Search [Internet]. 2025. Available from: https://huggingface.co/blog/thebajajra/rexrerankers
5.
Joachims T, Swaminathan A, Schnabel T. Unbiased Learning-to-Rank with Biased Feedback. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. 2017.
6.
Amazon Science. Amazon Science ESCI Dataset for Improving Product Search [Internet]. 2022. Available from: https://github.com/amazon-science/esci-data
7.
KDD Cup. KDD Cup 2022: ESCI Challenge for Improving Product Search [Internet]. 2022. Available from: https://www.aicrowd.com/challenges/esci-challenge-for-improving-product-search
8.
Wayfair. Smarter Shopping Starts Here: How AI Understands What You’re Looking For [Internet]. 2024. Available from: https://www.aboutwayfair.com/careers/tech-blog/smarter-shopping-starts-here-how-ai-understands-what-youre-looking-for
9.
Crossing Minds. Shopping Queries Image Dataset (SQID): An Image-Enriched ESCI Dataset for Exploring Multimodal Learning in Product Search [Internet]. 2024. Available from: https://arxiv.org/abs/2405.15190
10.
Oord A van den, Li Y, Vinyals O. Representation Learning with Contrastive Predictive Coding. In: arXiv preprint arXiv:180703748 [Internet]. 2018. Available from: https://arxiv.org/abs/1807.03748
11.
Newman MEJ. Power Laws, Pareto Distributions and Zipf’s Law. Contemporary Physics. 2005;46(5):323–351.
12.
Anderson C. The Long Tail: Why the Future of Business Is Selling Less of More. Hyperion; 2006.