Cerebras Systems Sets Record for Largest AI Models Ever Trained on a Single Device

Single CS-2 system trains multi-billion parameter NLP models including GPT-3XL 1.3B, GPT-J 6B, GPT-3 13B and GPT-NeoX 20B; provides simple setup, faster training and enables switching between models with a few keystrokes

Cerebras Systems, the pioneer in high performance artificial intelligence (AI) computing, today announced, for the first time ever, the ability to train models with up to 20 billion parameters on a single CS-2 system – a feat not possible on any other single device. By enabling a single CS-2 to train these models, Cerebras reduces the system engineering time necessary to run large natural language processing (NLP) models from months to minutes. It also eliminates one of the most painful aspects of NLP — namely the partitioning of the model across hundreds or thousands of small graphics processing units (GPU).

“In NLP, bigger models are shown to be more accurate. But traditionally, only a very select few companies had the resources and expertise necessary to do the painstaking work of breaking up these large models and spreading them across hundreds or thousands of graphics processing units,” said Andrew Feldman, CEO and Co-Founder of Cerebras Systems. “As a result, only very few companies could train large NLP models – it was too expensive, time-consuming and inaccessible for the rest of the industry. Today we are proud to democratize access to GPT-3XL 1.3B, GPT-J 6B, GPT-3 13B and GPT-NeoX 20B, enabling the entire AI ecosystem to set up large models in minutes and train them on a single CS-2.”

“GSK generates extremely large datasets through its genomic and genetic research, and these datasets require new equipment to conduct machine learning,” said Kim Branson, SVP of Artificial Intelligence and Machine Learning at GSK. “The Cerebras CS-2 is a critical component that allows GSK to train language models using biological datasets at a scale and size previously unattainable. These foundational models form the basis of many of our AI systems and play a vital role in the discovery of transformational medicines.”

These world first capabilities are made possible by a combination of the size and computational resources available in the Cerebras Wafer Scale Engine-2 (WSE-2) and the Weight Streaming software architecture extensions available via release of version R1.4 of the Cerebras Software Platform, CSoft.

When a model fits on a single processor, AI training is easy. But when a model has either more parameters than can fit in memory, or a layer requires more compute than a single processor can handle, complexity explodes. The model must be broken up and spread across hundreds or thousands of GPU. This process is painful, often taking months to complete. To make matters worse, the process is unique to each network compute cluster pair, so the work is not portable to different compute clusters, or across neural networks. It is entirely bespoke.

The Cerebras WSE-2 is the largest processor ever built. It is 56 times larger, has 2.55 trillion more transistors, and has 100 times as many compute cores as the largest GPU. The size and computational resources on the WSE-2 enables every layer of even the largest neural networks to fit. The Cerebras Weight Streaming architecture disaggregates memory and compute allowing memory (which is used to store parameters) to grow separately from compute. Thus a single CS-2 can support models with hundreds of billions even trillions of parameters.

Graphics processing units on the other hand have a fixed amount of memory per GPU. If the model requires more parameters than fit in memory, one needs to buy more graphics processors and then spread work over multiple GPUs. The result is an explosion of complexity. The Cerebras solution is far simpler and more elegant: by disaggregating compute from memory, the Weight Streaming architecture allows support for models with any number of parameters to run on a single CS-2.

Powered by the computational capacity of the WSE-2 and the architectural elegance of the Weight Streaming architecture, Cerebras is able to support, on a single system, the largest NLP networks. By supporting these networks on a single CS-2, Cerebras reduces set up time to minutes and enables model portability. One can switch between GPT-J and GPT-Neo, for example, with a few keystrokes, a task that would take months of engineering time to achieve on a cluster of hundreds of GPUs.

“Cerebras’ ability to bring large language models to the masses with cost-efficient, easy access opens up an exciting new era in AI. It gives organizations that can’t spend tens of millions an easy and inexpensive on-ramp to major league NLP,” said Dan Olds, Chief Research Officer, Intersect360 Research. “It will be interesting to see the new applications and discoveries CS-2 customers make as they train GPT-3 and GPT-J class models on massive datasets.”

With customers in North America, Asia, Europe and the Middle East, Cerebras is delivering industry leading AI solutions to a growing roster of customers in the enterprise, government, and high performance computing (HPC) segments including GlaxoSmithKline, AstraZeneca, TotalEnergies, nference, Argonne National Laboratory, Lawrence Livermore National Laboratory, Pittsburgh Supercomputing Center, Leibniz Supercomputing Centre, National Center for Supercomputing Applications, Edinburgh Parallel Computing Centre (EPCC), National Energy Technology Laboratory, and Tokyo Electron Devices.

For more information about the Cerebras Software Platform, please visit https://www.cerebras.net/product-software/.

About Cerebras Systems

Cerebras Systems is a team of pioneering computer architects, computer scientists, deep learning researchers, and engineers of all types. We have come together to build a new class of computer system, designed for the singular purpose of accelerating AI and changing the future of AI work forever. Our flagship product, the CS-2 system is powered by the world’s largest processor – the 850,000 core Cerebras WSE-2, enables customers to accelerate their deep learning work by orders of magnitude over graphics processing units.

Contacts