By mid-2022, Meta will control what it believes will be the world's fastest artificial intelligence (AI) supercomputer. Dubbed the AI Research SuperCluster (RSC), the system is already running and is among the world's fastest AI supercomputers, the company said in a blog post on 24 January.
Development of the RSC is ongoing, but once the second phase is completed by the second half of this year, the system will deliver nearly 5 exaflops of mixed-precision computing.
Meta, formerly known as Facebook, is already using the supercomputer to train large models in natural language processing (NLP) and computer vision for research. The company uses large-scale AI models for ongoing priorities, such as detecting harmful content on its social platforms. Ultimately, though, it wants to train models with trillions of parameters to help it power the metaverse, the virtual world that Meta intends to support with its platforms and products.
"The experiences we're building for the metaverse require enormous compute power (quintillions of operations/second!) and RSC will enable new AI models that can learn from trillions of examples, understand hundreds of languages, and more," Meta CEO Mark Zuckerberg said in a statement.
Currently, RSC comprises a total of 760 Nvidia DGX A100 systems as its compute nodes, for a total of 6,080 GPUs. The GPUs communicate via an Nvidia Quantum 200Gb/s InfiniBand two-level Clos fabric that has no oversubscription. RSC's storage tier has 175 petabytes of Pure Storage FlashArray, 46 petabytes of cache storage in Penguin Computing Altus systems, and 10 petabytes of Pure Storage FlashBlade.
By comparison, the US Energy Department's Perlmutter AI supercomputer, unveiled last summer as the world's fastest AI supercomputer, delivers nearly four exaflops of mixed-precision performance with 6,159 Nvidia A100 Tensor Core GPUs.
By the time Meta's RSC is complete, the InfiniBand network fabric will connect 16,000 GPUs as endpoints, making it one of the largest such networks deployed to date. Additionally, Meta designed a caching and storage system that can serve 16 TB/s of training data. The company plans to scale it up to 1 exabyte; that's equivalent to 36,000 years of high-quality video.