Beyond the Search Bar: The Multimodal Future of E-Commerce

In the world of e-commerce, discovery has always been a journey of translation. A shopper has an idea, a feeling, or a fleeting image in their mind, and they must translate it into a language the machine understands: the keyword. For years, this has been the primary language of online retail. It’s functional, familiar, and often, profoundly limited.

A Venn Diagram of different modalities in e-Commerce.

For too long, the different “senses” of e-commerce-text, images, user behavior, and product context-have been treated as separate tools, siloed into different products with their own APIs and integrations. But the real magic, the truly novel user experiences, happens in the overlaps, where these modalities are not just combined, but unified.

The Evolution from Keywords to Context

Product discovery didn’t happen overnight. It evolved in layers, with each new layer adding sophistication but also complexity.

Phase 1: The Reign of the Keyword

The first phase was purely lexical. You typed “red running shoes,” and the engine hunted for those exact words. It was a rigid system of matching strings. A simple misspelling, a synonym, or a more conceptual query could easily lead to a dead end, leaving shoppers frustrated and products unseen.

Phase 2: The Dawn of Understanding (and Silos)

The industry’s first major leap was from lexical to semantic search. Instead of matching words, we started matching meaning. Around the same time, other powerful, but separate, tools began to emerge. However, adopting each of these required retailers to take on new integrations, manage new API endpoints, and accept a longer time-to-value as they stitched together disparate systems.

Semantic Search: Platforms like Constructor and Bloomreach began indexing catalogs into vector spaces, allowing a search for a “breathable running jacket” to return a “lightweight trail windbreaker.” This was a huge step forward, but it meant adding a completely separate search vendor with its own complex data requirements.
Reverse-Image Search: At the forefront of innovation, vendors like Algolia and Syte enabled shoppers to bridge the physical and digital worlds. Snapping a photo of a lamp in a café to find similar items became possible-a game-changer for visually-driven domains, but one that required yet another specialized integration.
Item-Item Recommendations: Often, these recommendations were static, even hard-coded. “People who viewed this also viewed that.” While useful, they lacked dynamic context and were often a blunt instrument, typically managed by a third system.

Each of these innovations was powerful, but they were treated as distinct products. You bought a “search solution” or a “recommendations engine.” The modalities remained in their lanes, and the engineering overhead grew with each new addition.

Phase 3: The Blurring of the Lines

More recently, the most forward-thinking retailers realized that these separate senses needed to work together. This led to the first generation of true multimodal experiences:

Text + User → Hyper-Personalized Search: What if the search bar knew who you were? By blending a shopper’s preferences and history with a text query, search results became tailor-made. A search for “running shoes” could surface trail-ready models for an outdoor enthusiast, while a city commuter would see fashion-forward sneakers first.
Image + Text → Composite Image Retrieval: The next frontier was refining an image with text-a concept that has, until recently, lived mostly in academic research. True composite retrieval, like asking for “this sofa, but in green,” requires a deep, nuanced understanding of how visual and textual features relate. Implementing this from scratch demands custom hosting, complex indexing, and significant development effort. As a result, many vendors opt for simpler workarounds like “averaging” or “late fusion,” where image and text embeddings are blurred together rather than truly entailed, often failing to capture the specific modification the user intended. While developer-focused platforms like Marqo have shown it’s possible, making it accessible and performant for mainstream e-commerce has remained a major challenge.

These hybrid experiences are incredibly powerful, but they also reveal the cracks in a siloed architecture. Bolting a personalization engine onto a search index or piping a visual model into a text query creates immense integration complexity. Each new combination requires more engineering effort, more data mapping, and another API to wrangle.

Phase 4: Starting to Imagine New Recipes

When the friction of integration is removed, developers and UX engineers can start to create entirely new discovery experiences that were previously too complex to build.

Composite ‘Product’ Retrieval: Imagine being on a product page for a jacket and being able to ask for “this, but waterproof” or “something like this, but in navy.” This takes the context of a specific product and uses natural language to modify it, retrieving variants or complementary items that perfectly match the user’s intent.
Hyper-personalized Item-Item Recommendations: The traditional “Similar Products” carousel is impersonal. What if it was personalized to you? By blending the current product’s attributes with your unique browsing history and style preferences, the recommendations become dynamically tailored, feeling serendipitous yet deeply relevant.

These aren’t just incremental improvements; they represent a fundamental shift in how shoppers can interact with a product catalog.

Solenya’s Kitchen: A Unified Foundation

This is where the old model breaks down. Most vendors excel in one or two of these overlaps, forcing you to stitch together a patchwork of solutions. Solenya’s approach is fundamentally different. We don’t just combine modalities; we unify them in a single, coherent space.

In our multimodal embedding space, Image, Text, User, and Product context all sit at the same table. This isn’t a patchwork of different services; it’s one integrated “kitchen” that understands the relationship between all of them.

This unified foundation enables Multimodal Zero-Shot Discovery. It’s a mouthful, but it means three simple things:

It works across all discovery surfaces: Search, recommendations, and product listings are all powered by the same core intelligence.
It understands all context: User intent is understood across images, text, and product interactions simultaneously.
It works from day one: Our models deliver relevant experiences for new products and anonymous users in real-time, solving the “cold-start” problem that plagues traditional systems.

By building on a single, flexible API, developers are no longer forced to choose between search, recommendations, or visual discovery. They can create entirely new experiences that seamlessly blend all of them. From plain toast to a full-course fusion, it’s all multimodality-and Solenya serves it from a single kitchen.

It's All Multimodality