tokensarchitecturemultimodaldesign

Models are Markup, Tokens are Features...

Marcus Gawronsky

Introduction

While we often speak about 'the great convergence' in machine learning as a convergence in architecture, rather than blocks, it is crucial to recognize that this convergence has not simplified model variation entirely. Despite the shift from specialized architectures such as LSTMs, CNNs, and RNNs towards the transformer block, transformers themselves have varied significantly both in terms of architecture and pre-training methodology. Early architectures included the 1 encoder-only masked language models approach, 2, 3 GAN-based discriminative training, and 4 span corruption methods. However, recent developments have moved predominantly towards GPT-style next-token-prediction models, where embedding, vision, and image generation architectures converge around a decoder-only autoregressive design, leading to shared block structures across LLMs, diffusion transformers (DiTs), and vision transformers (ViTs).

Inductive Biases

When discussing inductive biases, it's critical to acknowledge that all models inherently possess biases suited to specific data or tasks. CNNs exhibit translation equivariance, RNNs leverage Markovian dependencies, and modern LLMs introduce a different set of biases entirely. Modern transformer models predominantly rely on an inductive bias grounded in tokenization: they implicitly assume text comprises whitespace-delimited sub-word units (tokens), upon which statistical relationships like chi-squared statistics can be computed, and sequential token predictions can be performed.

An illustrative example is observed in the embedding layers (model.embed_tokens.weight) and prediction heads (lm_head.weight) across model evolutions:

  • 'OPT 6.7B' (released 3 years ago) 5 comprised 🤏 6% of the model parameters,
  • 'Meta-Llama-3-8B' (released 11 months ago) 6 comprised 📏 13% of the model parameters, and
  • 'Gemma-2-2B' (released 5 months ago) 7 comprises 22% of model parameters-even with weight-sharing!

The reason behind this steady increase, as shown by ByteDance’s analysis on over-tokenization 8, is simple: performance. The the larger the token vocabulary, the smaller the context length, the lower the compute and latency for a given user query. The more tokens, the fewer word-pieces, and the higher the likelihood that - given a large enough pre-training corpora - the model will learn rare and domain-specific words which it can use to generate more accurate and relevant responses - rather than relying on multi-head attention or other attention mechanisms to learn the combinatorial relationships between sub-world tokens.

This is a crucial point for LLM product managers to understand: the tokenization process is not just a preprocessing step, but a fundamental part of the model's architecture and performance. The choice of tokenization method can significantly impact the model's ability to understand and generate text, in-line with key application or business use cases.

cls➡️instruct➡️user➡️think

Loading diagram…

The story of LLM advancement, has been a story of TOKENS 9, 15.

Token(s)First Prominent UseKey Literature / ReleaseFeature(s) Unlocked
<cls>BERT (2018)Devlin et al. 9Sequence‑level tasks: sentiment, intent, similarity
<instruct>InstructGPT (2022)Ouyang et al. 10Instruction‑following chatbots (ChatGPT)
<user> / <assistant>ChatGPT (2022)OpenAI 11Multi‑turn, context‑aware conversational agents
Image tokens (<img>…>)CLIP → Flamingo / LLaVA (2021‑23)Radford et al.; Alayrac et al. 12Multimodal search, VQA, GPT‑4V
Code‑aware delimitersCodex / GPT‑4 (2021‑)Chen et al. 13Copilot, code generation & refactoring
"Thinking" tokens (<think>)R1, Chain‑of‑Thought (2023)Narang et al. 14Intermediate reasoning, improved math & planning
Tool‑use tokens (<tool>…>)Toolformer, Gorilla (2023‑24)Schick et al.; Patel et al. 15Agents that call APIs, browse, execute code

As models introduce new tokens, they unlock new features. For example, the <cls> token in BERT 9 sequence-level tasks like sentiment analysis and intent classification. The <instruct> token in InstructGPT 10 allowed for instruction-following chatbots like ChatGPT 11. The <user> and <assistant> tokens in ChatGPT enabled multi-turn, context-aware conversations. Image tokens in CLIP and Flamingo unlocked multimodal search and visual question answering. Code-aware delimiters in Codex and GPT-4 facilitated code generation and refactoring. "Thinking" tokens in R1 and Chain-of-Thought improved intermediate reasoning, math, and planning. Finally, tool-use tokens in Toolformer and Gorilla enabled agents to call APIs, browse the web, and execute code.

For GPT-style models, the space of inputs tokens defines the sensory landscape of the model, and the world-model that is developed. For text-only GPT-style pre-trained models, the token-space covers only Unicode or ASCII characters, and the model is trained to predict the next token in a sequence. For these models the sensory landscape is limited to the text and the model's world-model covers only the space of entities and actions which that sensory landscape can describe. Given that world-model, the model is able to embody an agent, but cannot interpret a meta-narrative about it's purpose or behaviour.

With the introduction of InstructGPT, the <instruct> token, and the <user> and <assistant> tokens in ChatGPT, the sensory landscape of the model is expanded to include instructions and conversational context. This allows the model to understand and respond to user queries in a more meaningful way, effectively embodying an agent that can interpret a meta-narrative about its purpose and behaviour. With the introduction of these tokens and post-training, the world-model of the LLM is expanded to include the space of entities and actions that can be described by the instructions and conversational context. This allows the model to perform tasks and satisfy features that allow it to be used in a wider range of applications, such as chatbots, virtual assistants, copilots and various agentic workflows.

The introduction of image tokens in CLIP and LLava further expands the sensory landscape of the model to include visual information, re-orienting the model's world-model to include the space of entities and actions that can be described by visual information. As these images map onto a pre-defined token embedding space, a local minima may be found that limits the model's ability to understand and interpolate between colors, textures and shapes - as explored in 16. However, in staged fine-tuning, the model is able to learn to interpolate between these tokens and adapt its internal representation to accommodate the new sensory landscape. Using this landscape, the model is able to perform tasks such as visual question answering and classification, and use that information to inform its responses in a conversational context. To users and product managers, this unlocks new features and applications, which may solve business problems in OCR, scene understanding, and image captioning.

Similar to <user> and <assistant> tokens, <think> and <tool> look to develop a world-model that may be self-referential, and allows the model to understand and respond to its own reasoning process. The <think> token is used in R1 and Chain-of-Thought prompting to improve intermediate reasoning, math, and planning. This allows the model to break down complex tasks into smaller steps, improving its ability to solve problems and perform reasoning tasks. The <tool> token is used in Toolformer and Gorilla to enable agents to call APIs, browse the web, and execute code. This expands the model's capabilities beyond just text generation, allowing it to interact with external tools (or entities) and resources to perform tasks that require real-time information or complex computations.

Throughout these developments, the introduction of new tokens serves to expand the sensory landscape of the model, this new landscape defines a new world-model that might understand objectness (<user>), shades (<img>), or reasoning (<think>). This new world-model allows the model to perform tasks and satisfy features that were previously impossible, such as multi-turn conversations, visual question answering, and API interactions. As models continue to evolve, we can expect to see even more innovative uses of tokens to enhance their performance and capabilities.

Conclusion

Ultimately, the evolution of large language models has hinged fundamentally on advancements in tokenization. Each new token introduced reshapes the sensory landscape, redefining the model's internal world-model, and unlocking new functionalities. From simple sequence-level embeddings to multimodal tokens and self-referential reasoning capabilities, tokens continue to shape LLMs’ developmental trajectory profoundly. Looking forward, innovations like Byte Latent Transformers 15 and Super-BPE 5 future tokens might transcend traditional limitations, driving further innovation in machine learning applications and expanding their business potential.

For LLM product managers, comprehending token-driven innovation is crucial-not only for recognizing present limitations but also for harnessing the full potential of emerging technological capabilities.

1.
Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: North American Chapter of the Association for Computational Linguistics [Internet]. 2019. Available from: https://arxiv.org/pdf/1810.04805
2.
Xiao S, Liu Z, Shao Y, Cao Z. RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder. arXiv preprint arXiv:220512035 [Internet]. 2022; Available from: https://arxiv.org/abs/2205.12035
3.
Clark K, Luong M-T, Le QV, Manning CD. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. International Conference on Learning Representations [Internet]. 2020; Available from: https://huggingface.co/google/electra-small-generator
4.
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research [Internet]. 2020; Available from: https://arxiv.org/pdf/1910.10683
5.
Meta AI. OPT 6.7B Model Weights [Internet]. 2022. Available from: https://huggingface.co/pkarypis/opt-6.7b-sft
6.
Meta AI. Meta-Llama-3-8B Model Weights [Internet]. 2024. Available from: https://huggingface.co/meta-llama/Meta-Llama-3-8B
7.
Google DeepMind. Gemma-2-2B-it Model Weights [Internet]. 2024. Available from: https://huggingface.co/google/gemma-2-2b-it
8.
Zhu W, others. Over-Tokenization and Its Impacts on Large Language Models [Internet]. arXiv preprint arXiv:2501.16975. 2025. Available from: https://arxiv.org/abs/2501.16975
9.
ML6. The Art of Pooling Embeddings [Internet]. 2023. Available from: https://blog.ml6.eu/the-art-of-pooling-embeddings-c56575114cf8
10.
Ouyang L, others. Aligning Language Models to Follow Instructions [Internet]. 2022. Available from: https://openai.com/index/instruction-following/
11.
OpenAI. Introducing ChatGPT [Internet]. 2022. Available from: https://openai.com/index/chatgpt/
12.
DeepSeek AI. DeepSeek-R1 Tokenizer [Internet]. 2025. Available from: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/raw/main/tokenizer.json
13.
NexaAI. Octo-Net Tokenizer [Internet]. 2024. Available from: https://huggingface.co/NexaAIDev/octo-net/raw/main/tokenizer.json
14.
Li XL, Liang P. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In: Association for Computational Linguistics [Internet]. 2021. Available from: https://arxiv.org/abs/2101.00190
15.
Meta AI. Byte Latent Transformer [Internet]. 2024. Available from: https://www.youtube.com/watch?v=loaTGpqfctI
16.
Hu Y, others. ColorBench: Can VLMs See and Understand Colors? [Internet]. 2025. Available from: https://huggingface.co/papers/2504.10514