|Meta Builds AI Supercomputer|
|Written by Lucy Black|
|Thursday, 27 January 2022|
Meta, formerly known as Facebook, has announced that its researchers have designed and built an AI Research SuperCluster (RSC) that they believe is among the fastest AI supercomputers running today and will be the fastest AI supercomputer in the world when, in mid-2022, it’s fully built.
Announcing the new supercomputers, Kevin Lee, Technical Program Manager, and Shubho Sengupta, Software Engineer at Meta, said that Meta researchers have already started using RSC to train large models in natural language processing (NLP) and computer vision for research, with the aim of one day training models with trillions of parameters.
The need for the supercomputer is driven by the creation of increasingly large, complex, and adaptable models that are being trained in areas including vision, speech, language, or for critical use cases like identifying harmful content.
Like other AI supercomputers, the Meta machine has been built by combining multiple GPUs into compute nodes, which are then connected by a high-performance network fabric to allow fast communication between those GPUs. RSC today comprises a total of 760 NVIDIA DGX A100 systems as its compute nodes, for a total of 6,080 GPUs. RSC’s storage tier has 175 petabytes of Pure Storage FlashArray, 46 petabytes of cache storage in Penguin Computing Altus systems, and 10 petabytes of Pure Storage FlashBlade.
The researchers say that early benchmarks on RSC, compared with Meta’s legacy production and research infrastructure, show it runs computer vision workflows up to 20 times faster, runs the NVIDIA Collective Communication Library (NCCL) more than nine times faster, and trains large-scale NLP models three times faster. That means a model with tens of billions of parameters can finish training in three weeks, compared with nine weeks before.
One question raised by the need for data to train such a system is that models have to be taught using real-world data from Meta's production systems. This raises questions on privacy and security, which the researchers say is handled by RSC being isolated from the larger internet, with no direct inbound or outbound connections, and traffic can flow only from Meta’s production data centers.
"To meet our privacy and security requirements, the entire data path from our storage systems to the GPUs is end-to-end encrypted"
The data is also anonymized, and only decrypted at one endpoint.
or email your comment to: firstname.lastname@example.org
|Last Updated ( Thursday, 27 January 2022 )|