| PyTorch Team Introduces Cluster Programming |
| Written by Kay Ewbank | |||
| Tuesday, 04 November 2025 | |||
|
The developers of PyTorch have introduced Monarch, a distributed programming framework that can be used to program distributed systems in the same way you’d program a single machine. Normal PyTorch has an HPC-style multi-controller model, where multiple copies of the same script are launched across different machines, each running its own instance of the application. This hasn't been easily usable for machine learning workflows.
To provide a better model, the PyTorch team has created a framework to mimic the simplicity of single-machine PyTorch to entire clusters. Monarch provides a single controller programming model, in which a single script orchestrates all distributed resources, making them feel almost local. This simplifies distributed programming because code looks and feels like a single-machine Python program, but can scale across thousands of GPUs. It also means developers can can directly use Pythonic constructs such as classes, functions, loops, tasks, and futures to express complex distributed algorithms. Monarch organizes hosts, processes, and actors into scalable meshes that can be manipulated directly. You can operate on entire meshes (or slices of them) with simple APIs, and Monarch handles the distribution and vectorization automatically. It offers progressive fault handling, so when something does fail, Monarch stops the whole program, just like an uncaught exception in a simple local script. Developers can progressively add fine-grained fault handling exactly where it is needed, catching and recovering from failures just like you'd catch exceptions. Monarch splits the control plane (messaging) from the data plane (RDMA transfers), enabling direct GPU-to-GPU memory transfers across your cluster. It lets you send commands through one path, while moving data through another, optimized for what each does best. Monarch integrates with PyTorch to provide tensors that are sharded across clusters of GPUs. Monarch tensor operations look local but are executed across distributed large clusters, with Monarch handling the complexity of coordinating across thousands of GPUs. There are two key APIs, for Process and Actor Meshes, alongside two more advanced APIs for the tensor engine and RDMA buffer. Monarch organizes resources into multidimensional arrays, or meshes. A process mesh is an array of processes spread across many hosts; an actor mesh is an array of actors, each running inside a separate process. The launch version of Monarch supports process meshes over GPU clusters, typically one process per GPU, onto which you can spawn actors into actor meshes. Monarch's tensor engine brings distributed tensors to process meshes. It lets you write PyTorch programs as if the entire cluster of GPUs were attached to the machine running the script. For bulk data movement, Monarch also provides an RDMA buffer API, enabling direct, high-throughput transfers between processes on supported NICs. Monarch is available now on GitHub.
More InformationRelated ArticlesTo be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.
Comments
or email your comment to: comments@i-programmer.info |
|||
| Last Updated ( Tuesday, 04 November 2025 ) |


