|Microsoft Releases DeepSpeed For PyTorch|
|Written by Kay Ewbank|
|Thursday, 13 February 2020|
Microsoft Research has released an open source library that's compatible with PyTorch. DeepSpeed is a deep learning optimization library that makes it easier to work with large models for training, making it possible to train 100-billion-parameter models.
Microsoft says that the new library uses memory optimization technology to improve PyTorch model training, meaning researchers can use more parameters. The library makes better use of memory that is local to the GPU, and can be used with existing PyTorch applications with only minor changes to the app.
Advantages offered by DeepSpeed include distributed training, mixed precision, and checkpointing, through lightweight APIs that are compatible with PyTorch.
One part of the DeepSpeed library, ZeRO, is the parallelized optimizer that is responsible for the reduction in resource use. Microsoft says researchers have used these breakthroughs to create Turing Natural Language Generation (Turing-NLG), the largest publicly known language model at 17 billion parameters. The Zero Redundancy Optimizer (ZeRO) can train deep learning models with 100 billion parameters on the current generation of GPU clusters at three to five times the throughput of the current best system.
Other optimization techniques included in DeepSpeed include constant buffer optimization and smart gradient accumulation. Constant Buffer Optimization (CBO) enables high network and memory throughput while restricting memory usage to a constant size. The way this works is that for most memory- and network-bound operations, the performance depends on the size of the operand. CBO in DeepSpeed fuses smaller operands into a pre-defined sized buffer big enough to improve performance without unnecessary memory overhead.
The third optimization technique is Smart Gradient Accumulation. This can be used to run larger batch size with limited memory by breaking an effective batch into several sequential micro-batches, and averaging the parameter gradients across these micro-batches.
The researchers asy DeepSpeed supports all forms of model parallelism including tensor slicing based approaches such as the Megatron-LM, or a pipelined parallelism approach such as PipeDream or GPipe. It does so by only requiring the model parallelism framework to provide a model parallelism unit (mpu) that implements a few bookkeeping functionalities.
DeepSpeed is available for download on GitHub.
|Last Updated ( Thursday, 13 February 2020 )|