aieconomicshardwaremultimodal

The Five Technologies driving AI 3.0

Marcus Gawronsky

A new third Era has started, led by Five AI revolutions that will increase global GDP by 55% 1, which will lead to significant creation, capture, and consolidation of seemingly unrelated markets, geographies, and customer segments which will reshape proptech and cross-industry consolidation 2.

Multimodal, Multi-task, Zero-shot, and Open-Vocabulary


A line chart showing the growth on model parameters over time

The era of AI 1.0 was narrow. AI 1.0 Models were small, trained on hundreds of thousands of private training examples in order to classify or predict a narrow set of predefined labels. AI 1.0 was driven by the development of new hardware and model architectures which demonstrated state-of-the-art results in image classification, object detection, and semantic segmentation.

In 2018, 3 introduced domain-agnostic pretraining, a technique which researchers to train models on a mix of both cheap and ubiquitous domain-agnostic data and expensive and limited datasets covering a single task and a narrow set of defined labels. This marked AI 2.0, which enabled, with larger datasets, the development of larger, more accurate models, well suited to their narrow, predefined tasks. AI 2.0 allowed researchers to scale models using large datasets while reducing the marginal cost of training data needed for training.

AI 3.0 isn’t the era of Large Language Models (LLMs) but the era of Multimodal, Multi-task, Zero-shot and Open-Vocabulary models. These pre-trained models can be designed to take as inputs from images, text, video, depth-maps, or point-clouds (multimodal), and be designed to solve a broad range of tasks, without a limited set of classification labels (open-vocabulary), or task-specific training data (zero-shot). In AI 3.0, new techniques allow researchers the ability to amortize the cost of model development and serving across tasks and modalities, reducing the need for task-specific infrastructure or labelled training data.

Architecture, Quality and Inverse-Scaling


In AI 2.0, modalities were tied to particular model architectures: LSTMs became popular for language modelling, CNNs for image modelling, FFNs for tabular data, and so on. In AI 3.0, the Transformer architecture 4 has allowed the same architecture to be reused across increasingly larger datasets, spanning a diverse set of modalities from text to images, and even video.

A venn diagram showing the relationship between common model architectures.

Transformers are not without flaws, however; Transformers are very memory intensive and hard to train. These memory and training requirements have demanded increasingly large amounts of datasets and have practically limited the size of input Transformers can ingest. Despite these challenges, Transformers have changed the economics of innovation as innovations like FlashAttention 5, ALiBi 6 and Multi-Query Attention 7 which benefit one modality, benefit all modalities. This is deeply profound and has largely characterized the ‘arms race’ which took place between 2017 and 2022 as industrial labs sought to acquire increasingly large data centers in order to scale up their transformer models to larger and larger datasets.

The growth in model parameters for language models over time.

While these increases in model size, data and compute have all driven progress in the past, it’s not obvious that scale is still the answer. Recent works like Chinchilla 8 on model size, Galactica 9, LLaMA 10, and Phi-1 11 on pretraining, and Alpaca 12, LIMA 13 and Orca 14 on fine-tuning all point to the importance of quality over quantity. Furthermore, beyond the practical limits to data acquisition 15, papers like 16 and 17 demonstrate the limits and harms to scale as, given the capacity, models tend to memorize responses, rather than understand their inputs.

Retrieval and Prompting


A illustration showing the intermediate representations of a vision model.

Deep Learning models are simply stacks of smaller, shallow models, which are optimized jointly during the training process to minimize the discrepancy between its models’ final predictions and some labels. Each layer in a Deep Learning model extracts increasingly abstract features from the input data, gradually transforming it into a more meaningful representation. The depth of the model allows for hierarchical learning, where lower layers capture low-level patterns, and higher layers capture more abstract and semantically meaningful mathematical representations.

With the development of modern vector databases 18, in 2019 semantic search 19 came to disrupt almost 20 years of search stagnation ruled by the BM25 20 and PageRank 21 algorithms. Now, AI 3.0 is again disrupting search, powering new experiences like multimodal and generative search 22.

While Large AI 3.0 Models can often complete tasks without task-specific training data, examples are often necessary to reach the levels of performance and reliability needed in end-user applications. Here, while AI models are disrupting search, search is empowering AI models with live knowledge bases, and ‘textbooks’ of example responses. These examples prompt the models with context on the style of answer required and provide the up-to-date information needed to provide answers which are factually correct.

Parameter Efficient Fine-tuning (PeFT), Adaption and Pretrained Foundational Models


The trajectory of AI is a trajectory of economics: how can we minimize our cost or unit of accuracy? In AI 2.0 we managed to reduce the marginal cost of data using large domain agnostic datasets for unsupervised pretraining, and we managed to better amortize the cost of pretraining across tasks using techniques like transfer learning and fine-tuning to repurpose low and intermediate parts of pre-trained AI models. This unlocked fertile ground for pre-trained model repositories like Tensorflow Hub, PyTorch Hub 23, HuggingFace, PaddleHub 24, TiMM and Kaggle Models 25, and later, in 2020, AdapterHub 26 to share and compare pre-trained and fine-tuned models.

In AI 3.0 we have not only amortized the cost of pre-training but reduced and amortized the cost of fine-tuning models across modalities and tasks. This next shift is the reason we are seeing the explosion of AI-as-service platforms like OpenAI’s API, Replicate and OctoML, which allows users to share large serverless pre-trained model endpoints.

Quantization, Acceleration and Cost


A diagram showing the size of LLMs relative to other code bases or software services.

In the 2000s, the Cloud, Microservices, and Serverless changed the economics of the web, unlocking tremendous value for hardware vendors, big tech and small startups. The Cloud reduced the fixed and upfront costs of web hosting, Microservices reduced the unit of development and Serverless reduced the unit of scaling. Large Language Models (LLMs) cannot work with Serverless! Serverless is driven by cold start times, the cold start of a typical AWS Lambda function is around 250ms 27; the cold start of Banana.dev (a serverless AI hosting platform) is around 14 seconds 28, or 50x. This is obvious and unavoidable when we think about the size, complexity and dependencies of modern AI models. While the MPT 7B LLM is roughly a third the size of the Yandex codebase user queries may only require 0.1% of the overall codebase at a given time. To generate text from MPT 7B, all 13GBs are needed multiple times for each word. Here, recent innovations in Sparsification (with SparseGPT 29), Runtime Compilation (with ONNX 30 and TorchScript), Quantization (with GPTQ 31, QLoRA 32 and FP8 33), Hardware (with Nvidia Hopper 34) and Frameworks (with Triton 35 and PyTorch 2.0 36) serve to reduce model size and latency more 8x while preserving 98% of model performance on downstream tasks, as Pruning, Neural Architecture Search (NAS) and Distillation did in AI 2.0. This radically changes the economics of model serving and may be the driver for OpenAI reducing their API costs twice in one year 37.

1.
GitHub. Research: Quantifying GitHub Copilot’s Impact on Developer Productivity and Happiness [Internet]. 2022. Available from: https://github.blog/2022-09-07-research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/
2.
Forbes. Four Leading Trends in the Proptech Industry [Internet]. 2022. Available from: https://www.forbes.com/sites/forbesbusinesscouncil/2022/11/29/four-leading-trends-in-the-proptech-industry/?sh=38cfa9d4122a
3.
Howard J, Ruder S. Universal Language Model Fine-tuning for Text Classification. In 2018. Available from: https://scholar.google.co.za/citations?view_op=view_citation&hl=en&user=ZWdEJ54AAAAJ&citation_for_view=ZWdEJ54AAAAJ:u5HHmVD_uO8C
4.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need. In 2017. Available from: https://arxiv.org/abs/1706.03762
5.
Dao T, Fu DY, Ermon S, Rudra A, Ré C. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In 2022. Available from: https://proceedings.neurips.cc/paper_files/paper/2022/file/67d57c32e20fd0a7a302cb81d36e40d5-Paper-Conference.pdf
6.
Press O, Smith NA, Lewis M. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. 2021; Available from: https://arxiv.org/abs/2108.12409
7.
Shazeer N. Fast Transformer Decoding: One Write-Head is All You Need. 2019; Available from: https://arxiv.org/abs/1911.02150
8.
Hoffmann J, Borgeaud S, Mensch A, Buchatskaya E, Cai T, Rutherford E, et al. Training Compute-Optimal Large Language Models. 2022; Available from: https://arxiv.org/abs/2203.15556
9.
Taylor R, Kardas M, Cucurull G, Scialom T, Hartshorn A, others. Galactica: A Large Language Model for Science. 2022; Available from: https://galactica.org/static/paper.pdf
10.
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, others. LLaMA: Open and Efficient Foundation Language Models. 2023; Available from: https://arxiv.org/abs/2302.13971
11.
Gunasekar S, Zhang Y, Anber J, Hejazinia CCT, others. Textbooks Are All You Need. 2023; Available from: https://arxiv.org/abs/2306.11644
12.
Taori R, Gulrajani I, Zhang T, Dubois Y, Li X, others. Stanford Alpaca: An Instruction-following LLaMA Model. 2023; Available from: https://arxiv.org/pdf/2303.16199
13.
Zhou C, Liu P, Xu P, Iyer S, Sun J, others. LIMA: Less Is More for Alignment. 2023; Available from: https://arxiv.org/abs/2305.11206
14.
Mukherjee S, Mitra A, Jawahar G, Agarwal S, Palangi H, Awadallah A. Orca: Progressive Learning from Complex Explanation Traces of GPT-4. 2023; Available from: https://arxiv.org/abs/2306.02707
15.
Villalobos P, Sevilla J, Heim L, Besiroglu T, Hobbhahn M, Ho A. Will We Run Out of Data? An Analysis of the Limits of Scaling Datasets in Machine Learning. 2022; Available from: https://arxiv.org/abs/2211.04325
16.
Schaeffer R, Miranda B, Koyejo S. Are Emergent Abilities of Large Language Models a Mirage? 2023; Available from: https://arxiv.org/abs/2304.15004
17.
McKenzie IR, Lyzhov A, Pieler M, others. Inverse Scaling: When Bigger Isn’t Better. 2023; Available from: https://arxiv.org/pdf/2306.09479.pdf
18.
Bridgwater A. The Rise of Vector Databases [Internet]. 2023. Available from: https://www.forbes.com/sites/adrianbridgwater/2023/05/19/the-rise-of-vector-databases/
19.
Nayak P. Understanding Searches Better Than Ever Before [Internet]. 2019. Available from: https://blog.google/products/search/search-language-understanding-bert/
20.
Robertson S, Walker S, Jones S, Hancock-Beaulieu MM. Okapi at TREC-3. 2000; Available from: https://www.sciencedirect.com/science/article/abs/pii/S0306457300000157
21.
Brin S, Page L. The Anatomy of a Large-Scale Hypertextual Web Search Engine. 1998; Available from: https://www.sciencedirect.com/science/article/abs/pii/S016975529800110X
22.
Armano D. More Than Search: The AI Arms Race Is Also About the Tech Stack [Internet]. 2023. Available from: https://www.forbes.com/sites/davidarmano/2023/02/14/more-than-search-the-ai-arms-race-is-also-about-the-tech-stack/
23.
PyTorch. PyTorch Hub [Internet]. 2019. Available from: https://pytorch.org/hub/
24.
PaddlePaddle. PaddleHub: Pre-trained Models Toolkit [Internet]. 2020. Available from: https://github.com/PaddlePaddle/PaddleHub
25.
Kaggle. Kaggle Models [Internet]. 2023. Available from: https://www.kaggle.com/discussions/product-feedback/391200
26.
Pfeiffer J, Rücklé A, Poth C, Kamath A, Vulić I, Ruder S, et al. AdapterHub: A Framework for Adapting Transformers. In 2020. Available from: https://adapterhub.ml/blog/2020/11/adapting-transformers-with-adapterhub/
27.
Shilkov M. AWS Lambda Cold Starts [Internet]. 2023. Available from: https://mikhail.io/serverless/coldstarts/aws/
28.
Banana.dev. Serverless AI Model Hosting Platform [Internet]. 2023. Available from: https://www.banana.dev/
29.
Frantar E, Alistarh D. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. 2023; Available from: https://arxiv.org/abs/2301.00774
30.
Hugging Face. Accelerate Transformers Training with ONNX Runtime [Internet]. 2022. Available from: https://huggingface.co/blog/optimum-onnxruntime-training
31.
Frantar E, Ashkboos S, Hoefler T, Alistarh D. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. 2022; Available from: https://arxiv.org/abs/2210.17323
32.
Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: Efficient Finetuning of Quantized Language Models. 2023; Available from: https://github.com/artidoro/qlora
33.
Micikevicius P, Stosic D, Burgess N, Cornea M, others. FP8 Formats for Deep Learning. 2022; Available from: https://arxiv.org/abs/2209.05433
34.
Lambda Labs. NVIDIA Hopper H100 and FP8 Support [Internet]. 2022. Available from: https://lambdalabs.com/blog/nvidia-hopper-h100-and-fp8-support
35.
Tillet P, Kung HT, Cox D. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. In 2019. Available from: https://openai.com/research/triton
36.
PyTorch. PyTorch 2.0 Release [Internet]. 2023. Available from: https://pytorch.org/blog/pytorch-2.0-release/
37.
Wiggers K. OpenAI Intros New Generative Text Features While Reducing Pricing [Internet]. 2023. Available from: https://techcrunch.com/2023/06/13/openai-intros-new-generative-text-features-while-reducing-pricing/