A new third Era has started, led by Five AI revolutions that will increase global GDP by 55% ¹, which will lead to significant creation, capture, and consolidation of seemingly unrelated markets, geographies, and customer segments which will reshape proptech and cross-industry consolidation ².

Multimodal, Multi-task, Zero-shot, and Open-Vocabulary

A line chart showing the growth on model parameters over time

The era of AI 1.0 was narrow. AI 1.0 Models were small, trained on hundreds of thousands of private training examples in order to classify or predict a narrow set of predefined labels. AI 1.0 was driven by the development of new hardware and model architectures which demonstrated state-of-the-art results in image classification, object detection, and semantic segmentation.

In 2018, ³ introduced domain-agnostic pretraining, a technique which researchers to train models on a mix of both cheap and ubiquitous domain-agnostic data and expensive and limited datasets covering a single task and a narrow set of defined labels. This marked AI 2.0, which enabled, with larger datasets, the development of larger, more accurate models, well suited to their narrow, predefined tasks. AI 2.0 allowed researchers to scale models using large datasets while reducing the marginal cost of training data needed for training.

AI 3.0 isn’t the era of Large Language Models (LLMs) but the era of Multimodal, Multi-task, Zero-shot and Open-Vocabulary models. These pre-trained models can be designed to take as inputs from images, text, video, depth-maps, or point-clouds (multimodal), and be designed to solve a broad range of tasks, without a limited set of classification labels (open-vocabulary), or task-specific training data (zero-shot). In AI 3.0, new techniques allow researchers the ability to amortize the cost of model development and serving across tasks and modalities, reducing the need for task-specific infrastructure or labelled training data.

Architecture, Quality and Inverse-Scaling

In AI 2.0, modalities were tied to particular model architectures: LSTMs became popular for language modelling, CNNs for image modelling, FFNs for tabular data, and so on. In AI 3.0, the Transformer architecture ⁴ has allowed the same architecture to be reused across increasingly larger datasets, spanning a diverse set of modalities from text to images, and even video.

A venn diagram showing the relationship between common model architectures.

Transformers are not without flaws, however; Transformers are very memory intensive and hard to train. These memory and training requirements have demanded increasingly large amounts of datasets and have practically limited the size of input Transformers can ingest. Despite these challenges, Transformers have changed the economics of innovation as innovations like FlashAttention ⁵, ALiBi ⁶ and Multi-Query Attention ⁷ which benefit one modality, benefit all modalities. This is deeply profound and has largely characterized the ‘arms race’ which took place between 2017 and 2022 as industrial labs sought to acquire increasingly large data centers in order to scale up their transformer models to larger and larger datasets.

The growth in model parameters for language models over time.

While these increases in model size, data and compute have all driven progress in the past, it’s not obvious that scale is still the answer. Recent works like Chinchilla ⁸ on model size, Galactica ⁹, LLaMA ¹⁰, and Phi-1 ¹¹ on pretraining, and Alpaca ¹², LIMA ¹³ and Orca ¹⁴ on fine-tuning all point to the importance of quality over quantity. Furthermore, beyond the practical limits to data acquisition ¹⁵, papers like ¹⁶ and ¹⁷ demonstrate the limits and harms to scale as, given the capacity, models tend to memorize responses, rather than understand their inputs.

Retrieval and Prompting

A illustration showing the intermediate representations of a vision model.

Deep Learning models are simply stacks of smaller, shallow models, which are optimized jointly during the training process to minimize the discrepancy between its models’ final predictions and some labels. Each layer in a Deep Learning model extracts increasingly abstract features from the input data, gradually transforming it into a more meaningful representation. The depth of the model allows for hierarchical learning, where lower layers capture low-level patterns, and higher layers capture more abstract and semantically meaningful mathematical representations.

With the development of modern vector databases ¹⁸, in 2019 semantic search ¹⁹ came to disrupt almost 20 years of search stagnation ruled by the BM25 ²⁰ and PageRank ²¹ algorithms. Now, AI 3.0 is again disrupting search, powering new experiences like multimodal and generative search ²².

While Large AI 3.0 Models can often complete tasks without task-specific training data, examples are often necessary to reach the levels of performance and reliability needed in end-user applications. Here, while AI models are disrupting search, search is empowering AI models with live knowledge bases, and ‘textbooks’ of example responses. These examples prompt the models with context on the style of answer required and provide the up-to-date information needed to provide answers which are factually correct.

Parameter Efficient Fine-tuning (PeFT), Adaption and Pretrained Foundational Models

The trajectory of AI is a trajectory of economics: how can we minimize our cost or unit of accuracy? In AI 2.0 we managed to reduce the marginal cost of data using large domain agnostic datasets for unsupervised pretraining, and we managed to better amortize the cost of pretraining across tasks using techniques like transfer learning and fine-tuning to repurpose low and intermediate parts of pre-trained AI models. This unlocked fertile ground for pre-trained model repositories like Tensorflow Hub, PyTorch Hub ²³, HuggingFace, PaddleHub ²⁴, TiMM and Kaggle Models ²⁵, and later, in 2020, AdapterHub ²⁶ to share and compare pre-trained and fine-tuned models.

In AI 3.0 we have not only amortized the cost of pre-training but reduced and amortized the cost of fine-tuning models across modalities and tasks. This next shift is the reason we are seeing the explosion of AI-as-service platforms like OpenAI’s API, Replicate and OctoML, which allows users to share large serverless pre-trained model endpoints.

Quantization, Acceleration and Cost

A diagram showing the size of LLMs relative to other code bases or software services.

In the 2000s, the Cloud, Microservices, and Serverless changed the economics of the web, unlocking tremendous value for hardware vendors, big tech and small startups. The Cloud reduced the fixed and upfront costs of web hosting, Microservices reduced the unit of development and Serverless reduced the unit of scaling. Large Language Models (LLMs) cannot work with Serverless! Serverless is driven by cold start times, the cold start of a typical AWS Lambda function is around 250ms ²⁷; the cold start of Banana.dev (a serverless AI hosting platform) is around 14 seconds ²⁸, or 50x. This is obvious and unavoidable when we think about the size, complexity and dependencies of modern AI models. While the MPT 7B LLM is roughly a third the size of the Yandex codebase user queries may only require 0.1% of the overall codebase at a given time. To generate text from MPT 7B, all 13GBs are needed multiple times for each word. Here, recent innovations in Sparsification (with SparseGPT ²⁹), Runtime Compilation (with ONNX ³⁰ and TorchScript), Quantization (with GPTQ ³¹, QLoRA ³² and FP8 ³³), Hardware (with Nvidia Hopper ³⁴) and Frameworks (with Triton ³⁵ and PyTorch 2.0 ³⁶) serve to reduce model size and latency more 8x while preserving 98% of model performance on downstream tasks, as Pruning, Neural Architecture Search (NAS) and Distillation did in AI 2.0. This radically changes the economics of model serving and may be the driver for OpenAI reducing their API costs twice in one year ³⁷.

GitHub. Research: Quantifying GitHub Copilot’s Impact on Developer Productivity and Happiness [Internet]. 2022. Available from: https://github.blog/2022-09-07-research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/

Forbes. Four Leading Trends in the Proptech Industry [Internet]. 2022. Available from: https://www.forbes.com/sites/forbesbusinesscouncil/2022/11/29/four-leading-trends-in-the-proptech-industry/?sh=38cfa9d4122a

Howard J, Ruder S. Universal Language Model Fine-tuning for Text Classification. In 2018. Available from: https://scholar.google.co.za/citations?view_op=view_citation&hl=en&user=ZWdEJ54AAAAJ&citation_for_view=ZWdEJ54AAAAJ:u5HHmVD_uO8C

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need. In 2017. Available from: https://arxiv.org/abs/1706.03762

Dao T, Fu DY, Ermon S, Rudra A, Ré C. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In 2022. Available from: https://proceedings.neurips.cc/paper_files/paper/2022/file/67d57c32e20fd0a7a302cb81d36e40d5-Paper-Conference.pdf

Press O, Smith NA, Lewis M. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. 2021; Available from: https://arxiv.org/abs/2108.12409

Shazeer N. Fast Transformer Decoding: One Write-Head is All You Need. 2019; Available from: https://arxiv.org/abs/1911.02150

Hoffmann J, Borgeaud S, Mensch A, Buchatskaya E, Cai T, Rutherford E, et al. Training Compute-Optimal Large Language Models. 2022; Available from: https://arxiv.org/abs/2203.15556

Taylor R, Kardas M, Cucurull G, Scialom T, Hartshorn A, others. Galactica: A Large Language Model for Science. 2022; Available from: https://galactica.org/static/paper.pdf

10.

Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, others. LLaMA: Open and Efficient Foundation Language Models. 2023; Available from: https://arxiv.org/abs/2302.13971

11.

Gunasekar S, Zhang Y, Anber J, Hejazinia CCT, others. Textbooks Are All You Need. 2023; Available from: https://arxiv.org/abs/2306.11644

12.

Taori R, Gulrajani I, Zhang T, Dubois Y, Li X, others. Stanford Alpaca: An Instruction-following LLaMA Model. 2023; Available from: https://arxiv.org/pdf/2303.16199

13.

Zhou C, Liu P, Xu P, Iyer S, Sun J, others. LIMA: Less Is More for Alignment. 2023; Available from: https://arxiv.org/abs/2305.11206

14.

Mukherjee S, Mitra A, Jawahar G, Agarwal S, Palangi H, Awadallah A. Orca: Progressive Learning from Complex Explanation Traces of GPT-4. 2023; Available from: https://arxiv.org/abs/2306.02707

15.

Villalobos P, Sevilla J, Heim L, Besiroglu T, Hobbhahn M, Ho A. Will We Run Out of Data? An Analysis of the Limits of Scaling Datasets in Machine Learning. 2022; Available from: https://arxiv.org/abs/2211.04325

16.

Schaeffer R, Miranda B, Koyejo S. Are Emergent Abilities of Large Language Models a Mirage? 2023; Available from: https://arxiv.org/abs/2304.15004

17.

McKenzie IR, Lyzhov A, Pieler M, others. Inverse Scaling: When Bigger Isn’t Better. 2023; Available from: https://arxiv.org/pdf/2306.09479.pdf

18.

Bridgwater A. The Rise of Vector Databases [Internet]. 2023. Available from: https://www.forbes.com/sites/adrianbridgwater/2023/05/19/the-rise-of-vector-databases/

19.

Nayak P. Understanding Searches Better Than Ever Before [Internet]. 2019. Available from: https://blog.google/products/search/search-language-understanding-bert/

20.

Robertson S, Walker S, Jones S, Hancock-Beaulieu MM. Okapi at TREC-3. 2000; Available from: https://www.sciencedirect.com/science/article/abs/pii/S0306457300000157

21.

Brin S, Page L. The Anatomy of a Large-Scale Hypertextual Web Search Engine. 1998; Available from: https://www.sciencedirect.com/science/article/abs/pii/S016975529800110X

22.

Armano D. More Than Search: The AI Arms Race Is Also About the Tech Stack [Internet]. 2023. Available from: https://www.forbes.com/sites/davidarmano/2023/02/14/more-than-search-the-ai-arms-race-is-also-about-the-tech-stack/

23.

PyTorch. PyTorch Hub [Internet]. 2019. Available from: https://pytorch.org/hub/

24.

PaddlePaddle. PaddleHub: Pre-trained Models Toolkit [Internet]. 2020. Available from: https://github.com/PaddlePaddle/PaddleHub

25.

Kaggle. Kaggle Models [Internet]. 2023. Available from: https://www.kaggle.com/discussions/product-feedback/391200

26.

Pfeiffer J, Rücklé A, Poth C, Kamath A, Vulić I, Ruder S, et al. AdapterHub: A Framework for Adapting Transformers. In 2020. Available from: https://adapterhub.ml/blog/2020/11/adapting-transformers-with-adapterhub/

27.

Shilkov M. AWS Lambda Cold Starts [Internet]. 2023. Available from: https://mikhail.io/serverless/coldstarts/aws/

28.

Banana.dev. Serverless AI Model Hosting Platform [Internet]. 2023. Available from: https://www.banana.dev/

29.

Frantar E, Alistarh D. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. 2023; Available from: https://arxiv.org/abs/2301.00774

30.

Hugging Face. Accelerate Transformers Training with ONNX Runtime [Internet]. 2022. Available from: https://huggingface.co/blog/optimum-onnxruntime-training

31.

Frantar E, Ashkboos S, Hoefler T, Alistarh D. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. 2022; Available from: https://arxiv.org/abs/2210.17323

32.

Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: Efficient Finetuning of Quantized Language Models. 2023; Available from: https://github.com/artidoro/qlora

33.

Micikevicius P, Stosic D, Burgess N, Cornea M, others. FP8 Formats for Deep Learning. 2022; Available from: https://arxiv.org/abs/2209.05433

34.

Lambda Labs. NVIDIA Hopper H100 and FP8 Support [Internet]. 2022. Available from: https://lambdalabs.com/blog/nvidia-hopper-h100-and-fp8-support

35.

Tillet P, Kung HT, Cox D. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. In 2019. Available from: https://openai.com/research/triton

36.

PyTorch. PyTorch 2.0 Release [Internet]. 2023. Available from: https://pytorch.org/blog/pytorch-2.0-release/

37.

Wiggers K. OpenAI Intros New Generative Text Features While Reducing Pricing [Internet]. 2023. Available from: https://techcrunch.com/2023/06/13/openai-intros-new-generative-text-features-while-reducing-pricing/