A New Age in Evaluation of e-Commerce Search: The LLM Judge
Almost every commercial search team reports the same pattern: an offline metric improves, a model ships, and the revenue line does not move. We argue that this is a measurement-validity problem, not a measurement-noise problem. That is, the offline measurement was never a faithful proxy for the commercial outcome to begin with.
The pattern is consistent enough to have a name. A team's embedding model improves NDCG@10 by 6% in offline tests. Confident in the result, they ship. Two months later the business metrics are flat. In the post-mortem, someone traces the labels: the offline eval was built from click logs on the previous retrieval system. Products that ranked highly before received more clicks, so clicks confirmed that ranking, so the new model was rewarded for recovering the old ranking. The measurement was internally consistent and entirely circular. Nobody was careless. The instrument was wrong.
Relevance Is Commercial Intent, Not Similarity
The products a search model retrieves are not plain text documents. They are rich, multimodal catalogue objects - titles, descriptions, brands, categories, images - where some information is explicit in text and some is only visible in the image. The search model is ultimately part of the product, and eventually it must move a business metric. Business metrics are noisy, lagging, and evaluated online. We need an offline evaluation proxy we can trust well enough to ship confidently - which means the labels feeding that proxy must actually reflect commercial relevance.
Before asking how to evaluate product search, we have to ask what product relevance actually is. A lot of retrieval evaluation still behaves as if relevance means semantic similarity: query and result are close in language or embedding space, therefore the result must be good. That assumption breaks quickly in e-commerce.
A user searching for running shoes is not asking for the document closest to the phrase "running shoes". They are asking for a product that fits a commercial intent.
That intent has structure. Trail running shoes may be a good substitute for running shoes. Running socks are related, but they are a complement, not the thing requested. A product photo of someone running is semantically close but commercially useless if the actual product is not a shoe.
This is why product search cannot rely on generic embedding similarity alone. E-commerce relevance is query-conditional, asymmetric, and taxonomy-sensitive. A query for running shoes may tolerate trail variants; a query for trail running shoes should not return every generic running shoe with equal confidence. That asymmetry - broad queries are tolerant, specific queries are not - is a property any valid evaluation rubric must encode.
The Metric Is Fine. The Labels Are the Problem.
In search, offline metrics are used to evaluate search relevancy by measuring how reliably a system surfaces documents (products) that satisfy a user's intent. For example, a ranking metric like Normalized Discounted Cumulative Gain (NDCG) scores a result page based on how well it packs the most relevant products at the very top. The aim of these metrics is to provide a fast, repeatable proxy for live business outcomes - giving engineering teams the confidence that a lift in NDCG measured today will translate to clicks and engagement.
The mathematical formulas driving these standard offline metrics are not broken by themselves. Metrics like NDCG@k are in-fact a powerful way to calculate that top-of-page reward 1. However, they carry an important assumption: that you know the ideal ranked list - all relevant products, all grades, the full ideal DCG denominator. For a query like "trainers" with thousands of acceptable products across a live catalogue, that list does not exist. Teams respond by stacking metrics - NDCG@1, @5, @10, @25, Recall, Precision, MRR - and the resulting dashboard no longer tells you anything about user experience; it tells you about the gap between your annotation coverage and the actual search space. The problem is what we feed them.
In e-commerce, relevance labels are often improvised. A click becomes a relevance label. A historical purchase becomes a relevance label. Each shortcut brings its own bias.
Clicks are not pure judgements of relevance; they reflect what the old system chose to show. Products at the top get more attention because they are at the top. New products, niche products, and long-tail catalogue items often have little or no exposure, so click-log estimators such as IPS and SNIPS have almost nothing reliable to say about them 2, 3. Browsing metrics such as RBP and ERR improve the model of user attention 4, 5, but they still need trustworthy relevance grades underneath.
To bypass the exposure bias of click logs, teams often adopt hand-labelled or synthetic positive pairs, where a query and a product (or set of products) are explicitly labelled as relevant. While this is less noisy than click data, we've found that it has a major drawback: it fails to capture real-world queries, or, if it does, it fails to capture the full range of relevant products.
For example, users often use general queries like "shoes" or "Adidas", which could be satisfied by thousands of products, but most labelled datasets only contain annotations for a handful of them. To get around this, researchers often use synthetic data, where specific queries are generated from product metadata, producing a more manageable annotation task but an unrealistic distribution of queries.
The problem is structural: e-commerce queries are short. The modal search is one or two words. A query like "shoes" or "Adidas" may have hundreds of commercially acceptable products, yet annotation exercises almost always cover only a handful. Synthetic query generation inverts this - it produces hyper-specific queries like "navy blue slim-fit Adidas running shoe size 10" that resolve annotator ambiguity but distort the distribution so badly that a model tuned to them can regress on the actual traffic it will face.
The public benchmark shelf is equally limited. General-purpose suites measure retrieval across diverse domains; they say nothing about your catalogue or your users' query distribution. Commerce-specific datasets are more relevant but bounded: Amazon ESCI and Wayfair WANDS are impressive feats of manual labelling - WANDS alone required an estimated 3,500 hours for 233,000 query-product pairs 6, 8 - and that effort cannot be replicated at the pace of catalogue change. Eventually the public eval shelf runs dry, and it runs dry faster than a live product catalogue does.
Models as Judges
Language-model-as-judge evaluation is a recent habit with an older measurement problem underneath it. By 2022 and 2023, generative systems had moved beyond tasks where BLEU, ROUGE, or exact-match scores were good proxies for quality. The hard questions were open-ended: is this answer useful, faithful, safe, complete, and actually responsive to the user's intent? Human review could answer those questions, but not at the pace needed for model iteration.
The pattern became visible in 2023, when strong chat models were used to score or compare other model outputs. MT-Bench and Chatbot Arena made the method feel almost obvious: ask a capable model to act as a grader, critic, or pairwise preference judge, then check whether its decisions track human preferences 9.
A practical, modern example of that idea is GDPval, an evaluation dataset for real-world, economically valuable knowledge-work tasks. OpenAI needed a way to grade complex deliverables, like financial analysis, without reducing them to a single "looks good" judgement. Their answer was to make the judging task inspectable: each deliverable is scored against explicit criteria. One spreadsheet task is graded by an itemised rubric that reads more like an audit checklist than a vibe check.
In GDPval, a model solves a particular problem set and then a judge model is prompted to grade against a rubric that reads like an audit checklist:
[+2] the submitted deliverable is an Excel workbook named "Sample".
[+2] the workbook contains a worksheet named exactly "Sample Size Calculation".
[+2] the worksheet states a confidence level of 90% and a tolerable error of 10%.
...The rubric is the contract: a judge that cannot explain its deductions is not a measuring instrument.
There is a deeper reason why prompting generative models with strict rubrics is effective. Large language models - and increasingly vision-language models aligned with human preferences via RLHF or DPO - already possess the necessary latent knowledge from their pretraining corpora. A rubric does not teach the model new reasoning capabilities; it simply constrains that existing knowledge to match a strict definition of quality. A well-designed rubric turns a general conversational agent into a targeted measurement device.
Even with this latent capability, the field has become less enchanted and more engineering-shaped since 2023. The useful question is no longer "can an LLM judge?" but "under what rubric, with which biases measured, against which human labels, and with what re-test schedule?" Surveys and bias studies make the warning plain: judges can be sensitive to prompt wording, verbosity, ordering, brand familiarity, and model-family preferences 10, 11.
Prior Work in Product Search
In e-commerce, product search is not generic text generation. Products are structured, multimodal objects where small attribute differences - brand, size, material, compatibility - determine whether a result is an exact match, a substitute, or irrelevant. Prior work confirms LLM judges can substitute for human labels in this domain - but each result is a bespoke artefact. Soviero et al. 8 established feasibility on WANDS; Mehrdad et al. 12 validated ship/no-ship substitution at Walmart scale (89% agreement across 718 experiments, trained on 6M Walmart pairs); Dey et al. 13 showed that LLM-only signals outperformed click-trained models on eBay sales outcomes.
The pattern is consistent: LLM judges work. But none of these systems are zero-shot - each fine-tunes on millions of labelled pairs from a single retailer's category taxonomy, whether fashion, homeware, toys, or big-box. The judge is the weights, not the rubric, which means when the retailer changes, the labelling exercise starts over.
Two design constraints emerge from this literature that point toward what a portable, rubric-based approach would need. First, prompt specificity is decisive: adding domain-specific criteria, few-shot examples, and reference values lifts Spearman correlation with human labels from ρ ≈ 0.38 to ρ ≈ 0.90 14 - a generic prompt is not an instrument. Second, multimodality is not free: vision-language judges are unreliable for absolute scoring without calibration, exhibiting position bias, verbosity bias, and hallucination 15.
Meeting both constraints requires no fine-tuning on retailer-specific data - but it does require something the literature consistently underspecifies: calibration against human judgement. An LLM judge is not a neutral measuring device; without that calibration step, replacing click logs with LLM scores risks trading one flavour of measurement noise for another.
A further set of practical questions goes unanswered. The studies above mostly rely on GPT-3.5, GPT-4, or purpose-built fine-tunes from 2023 and 2024 - closed-source models whose weights, versions, and pricing can change without notice. None of the literature establishes a scaling curve: at what model size does rubric-calibrated agreement with human labels become acceptable, and where does it plateau? This matters enormously in practice. Running a frontier model over every top-k result set for every query, continuously, is expensive and slow - latency and compute cost are the real bottleneck to large-scale offline evaluation, not the quality ceiling of the judge. A viable production eval system must work well at small model sizes too. Here is how we addressed that.
Calibration
We initially used a simple system prompt — following the pattern of early LLM-as-judge work 8, 9 — which contextualised the problem (scoring products in the e-commerce domain) and defined what the judge should return: a score between 0 and 9, where 0 is completely irrelevant and 9 is a perfect match. We quickly found that this approach had some major problems when it comes to calibration and overall reliability.
You are a highly skilled information retrieval annotator for the e-commerce domain.
Your task is to evaluate the relevance of a given product to a specific query.
You will be provided with a query and a product description.
Your goal is to determine how relevant the product is to the query
on a scale from 0 to 9, where 0 means 'not relevant at all' and 9 means 'perfectly relevant'.
You are evaluating *direct relevance*.
In your rating, prioritize first whether it's an exact
match of the product type, then brand, then features.
Your response should be formatted as JSON as follows:
{
"rating": <integer from 0 to 9>,
}Two problems became clear immediately. First, without explicit score definitions, the judge produced high variance across queries - low inter-annotator agreement with human labels and no way to diagnose where it was going wrong. Second, the scores were uninterpretable: no explanation, no audit trail, no way to identify systematic disagreements.
For product search, we realised we needed an equivalent contract to the strict rubrics seen in modern benchmarks: a rubric defining constraints for exact matches, substitutes, complements, visual evidence, metadata evidence, head-query breadth, and long-tail specificity.
In our second iteration, we introduced a rubric-based approach with a tighter 1-4 scale, clear definitions for each score, and a requirement for the judge to provide an explanation. It looked like this:
...
- A rating of 4 indicates that the product perfectly matches the query i.e. the product is exactly what the user is looking for: it is exactly the right type of product, it is exactly the right brand, and it has the right features.
- A rating of 3 indicates that the product is mostly relevant but may have minor differences from what the user is searching for: e.g. it is the right type of product (e.g. the same type of shoe as requested - 'sneakers' for a `high heel` query would not be a 3) and has the right features but is a different brand.
- A rating of 2 indicates that the product is somewhat relevant but has significant differences from what the user is searching for e.g. the wrong brand.
- A rating of 1 indicates that the product is irrelevant e.g. if the product is in a completely different category than what the user is searching for.
...
{
"rating": <integer from 1 to 4>,
"explanation": "<brief explanation>"
}While this was a major improvement over the blind 0-9 scale, it was not our final rubric. It still failed on a problem none of the published rubrics in this space address directly: query hierarchy. Product search queries are asymmetrically specific - a broad query like "sweater" can be satisfied by a "Red Knitted Sweater", but a narrow query for "Red Knitted Sweater" is not satisfied by a generic "Sweater". An LLM without explicit instruction on this point scores both cases the same way. Our rubric introduces an explicit asymmetric-specificity constraint - broad queries tolerate specific results, specific queries do not tolerate overly general ones - which resolved the largest single category of disagreement between the judge and human annotators. After further iterations-tuning constraints around exact matches, substitutes, and head-query breadth-we arrived at our current approach.
Your task is to evaluate the textual and visual relevance of a given product to a specific query.
# Rubric
You will be provided with a query, product metadata, and optionally a product image. Your goal is to determine how relevant the product is to the query based on the following criteria.
...
1. Use a hierarchical approach to ratings, where the query defines the top-level criteria. For example, if a query is for "Running Shoes", and the product is "Trail Running Shoes", you can reasonably say that trail running is a subset of running, and is therefore a match based on the information available in the query. On the other hand, if the query is for "Trail Running Shoes" and the product is "Running Shoes", you should rate it lower, since the query defines a more specific requirement than what is in the product.
2. Important: Your scores for textual relevance and visual relevance should be independent, not correlated!
3. Provide a brief explanation for each of your ratings.
4. Your response should be formatted as JSON as follows:
{
"textual_relevance": <integer from 1 to 4>,
"textual_relevance_explanation": "<brief explanation for textual relevance>",
"visual_relevance": <integer from 1 to 4 or null>,
"visual_relevance_explanation": "<brief explanation for visual relevance, or null if no image>"
}This continual refinement, which would not have been possible without calibration against human labels, resulted in an evaluation that:
- is trustworthy (through calibration with human judgements),
- is interpretable (through brief explanations for each score),
- is configurable (through prompting),
- is aligned (through a e-commerce-specific rubric), and
- reflects real query demand (through weighted sampling of production queries).
Ablation
Spearman ρ between judge scores and human labels, plotted against active parameter count on a log₂ scale. The dashed orange line marks the ~9B reliability floor. MoE models are positioned by active parameters; their total and active sizes are shown in the model labels.
With the rubric in place, we ran the calibrated judge across Qwen3.5 open-weights checkpoints: dense models from 0.8B to 27B, plus MoE variants up to 397B total parameters. For each size, we measured Spearman correlation with human labels. The choice of open-weights models is deliberate on two counts. First, reproducibility: any team can run the same experiment against the same public checkpoints. Second, interpretability of the scaling behaviour itself: because the architectures are fully documented - including MoE sparsity ratios, activation counts, and training methodology - we can draw architectural inferences about why the numbers look the way they do, rather than attributing everything to opaque scaling.
The results answer two questions the literature leaves open.
There is a reliability floor at roughly 9B active parameters. Below it, Spearman correlation collapses - 0.106 at 2B, 0.449 at 4B - regardless of rubric quality. Above it, agreement stabilises in the 0.65-0.70 range. The floor is set by active compute per forward pass, not nominal parameter count (a 35B Qwen3.5 MoE model with 3B active parameters performs worse than a 9B dense model). Headline parameter count is a misleading proxy for judge capability: what matters is how much of the model actually participates in scoring each query-product pair.
Above the 9B floor, rubric specification dominates. Prior work either required fine-tuning on millions of retailer-specific pairs 12 or relied implicitly on large proprietary models 8, 14 - and through mid-2024, that dependency on frontier scale was arguably justified. The alignment gap between open and proprietary models was wide enough that rubric-following at 9-27B was unreliable. In our results, every Qwen3.5 checkpoint above the 9B active-parameter floor achieves Spearman correlation above 0.65 with no fine-tuning - only a calibrated rubric. Whether this reflects better open-weights alignment quality or simply better rubric design is difficult to isolate cleanly, but the operational consequence is the same either way: a well-specified rubric now extracts reliable scoring from self-hosted models with manageable inference costs.
What this makes possible is a confident ship/no-ship decision on a retrieval change before it touches production - evaluated against your actual query distribution, your live catalogue, and cold-start products that click logs have never reliably covered. Not a fine-tuned model trained once on someone else's inventory, but a judgement instrument calibrated to your relevance definition, your category taxonomy, and your exposure problem.
Evaluation
Once the judge is calibrated, it becomes the labelling layer in a repeatable offline search evaluation.
The pipeline is simple:
- Sample real production queries, weighted by how often they occur, so the evaluation reflects actual demand rather than an artificial uniform sample.
- Run every candidate retrieval model against the same query set.
- Ask the calibrated judge to score the returned products.
- Finally, compute aggregate ranking metrics such as NDCG@k from those judged relevance grades.
This gives us a way to compare search models before shipping them, using the same queries, the same catalogue, the same rubric, and the same calibrated judge. It is still offline, but it is no longer pretending that noisy clicks or mined positive pairs are the same thing as product relevance.
Because the annotation happens after the retrieval, we don't need to annotate every product in the catalogue, just the top-k candidates for each query.
Better Search Starts with Better Measurement
The defensible claim is not "our retrieval model is better". The defensible claim is "our evaluation is good enough to tell when a retrieval model is better".
That is the shift. Offline evaluation should not be a checkbox before shipping. It should be a validated measurement system: human-calibrated where it matters, automated where it scales, and honest about the cases where clicks and rankings are too blunt to trust.
Cold-Start and Long-Tail Discovery
For a discovery system, that matters most on the hard parts of the catalogue: cold-start products, long-tail products, and queries where the difference between an exact match, a substitute, and a complement is commercially meaningful.
To make this concrete: a retailer adding 5,000 new seasonal products has no click history on any of them. A click-log estimator has nothing to say - the products have never been surfaced, so they have never been clicked. An embedding model trained on historical pairs may not have encountered similar items at query time. The calibrated judge, however, scores any cold-start product against any production query using the same rubric it applies to established catalogue items. In a typical evaluation run, NDCG on cold-start queries differs materially from NDCG on head-catalogue queries - a gap that click-based offline metrics are structurally blind to. That gap is where unrealised revenue lives.
Zero-Shot Deployment
The same logic extends to the retailer level. A new deployment has no click history, no labelled dataset, and no prior model to calibrate against. That is precisely the situation where click-based evaluation produces nothing and fine-tuned judges require millions of labelled pairs before they are useful. A prompt-based judge calibrated against a small human-annotated sample of the new catalogue can produce a valid evaluation baseline from day one - because the measurement instrument is separable from the retrieval model it evaluates. Rubric-first evaluation is not just a better approach for mature deployments; it is the only viable approach at the start.
For product search, especially in cold-start and long-tail discovery, that is the difference between optimising the metric and improving the shop. What we built is that measurement system: a calibrated evaluation layer grounded in a product-relevance rubric, validated against human judgement, and weighted by real production queries - not a benchmark leaderboard position, but a valid instrument for deciding when search actually got better.