RoCE networks for distributed AI coaching at scale
- AI networks play an vital function in interconnecting tens of hundreds of GPUs collectively, forming the foundational infrastructure for coaching, enabling massive fashions with a whole lot of billions of parameters akin to LLAMA 3.1 405B.
- This week at ACM SIGCOMM 2024 in Sydney, Australia, we’re sharing particulars on the community we’ve constructed at Meta over the previous few years to help our large-scale distributed AI coaching workload.
- Our paper, “RDMA over Ethernet for Distributed AI Training at Meta Scale,” supplies the main points on how we design, implement, and function one of many world’s largest AI networks at scale.
The rising prevalence of AI has launched a brand new period of communication calls for. Distributed coaching, particularly, imposes essentially the most vital pressure on information heart networking infrastructure. As an example, a typical generative AI (GenAI) job could necessitate tight coordination of tens of hundreds of GPUs over the course of a number of weeks. Developing a dependable, high-performance community infrastructure able to accommodating this burgeoning demand necessitates a reevaluation of information heart community design.
When Meta launched distributed GPU-based coaching, we determined to assemble specialised information heart networks tailor-made for these GPU clusters. We opted for RDMA Over Converged Ethernet model 2 (RoCEv2) because the inter-node communication transport for almost all of our AI capability.
We have now efficiently expanded our RoCE networks, evolving from prototypes to the deployment of quite a few clusters, every accommodating hundreds of GPUs. These RoCE clusters help an intensive vary of manufacturing distributed GPU coaching jobs, together with rating, content material suggestion, content material understanding, pure language processing, and GenAI mannequin coaching, amongst different workloads.
Topology
We constructed a devoted backend community particularly for distributed coaching. This allowed us to evolve, function, and scale independently from the remainder of the info heart community. To help massive language fashions (LLMs), we expanded the backend community in the direction of the DC-scale, e.g., incorporating topology-awareness into the coaching job scheduler.
The separation
The coaching cluster depends on two unbiased networks: the frontend (FE) community for duties akin to information ingestion, checkpointing, and logging, and the backend (BE) community for coaching, as depicted beneath.
A coaching rack is related to each the FE and BE of the info heart community. The FE has a hierarchy of community layers – rack switches (RSWs), material switches (FSWs), and better – that homes the storage warehouse, which supplies GPUs with the mandatory enter information for coaching workloads. We guarantee that there’s sufficient ingress bandwidth on the rack swap to not hinder the coaching workload.
The BE is a specialised material that connects all RDMA NICs in a non-blocking structure, offering excessive bandwidth, low latency, and lossless transport between any two GPUs within the cluster, no matter their bodily location. This backend material makes use of the RoCEv2 protocol, which encapsulates the RDMA service in UDP packets for transport over the community.
AI Zone
Our BE networks have undergone a number of transformations. Initially, our GPU clusters used a easy star topology with a number of AI racks related to a central Ethernet swap working the non-routable RoCEv1 protocol. This setup had clear limitations in GPU scale and swap redundancy. Due to this fact, we swiftly transitioned to a fabric-based structure for prolonged scalability and better availability.
We designed a two-stage Clos topology for AI racks, often known as an AI Zone. The rack coaching swap (RTSW), serving because the leaf swap, provides scale-up connectivity for GPUs inside the rack utilizing copper-based DAC cables. The backbone tier, composed of modular cluster coaching switches (CTSW), supplies scale-out connectivity amongst all racks within the cluster. The CTSW has deep buffers statically divided over the ports within the chassis. The RTSWs hook up with CTSWs through single-mode fiber and 400G pluggable transceivers.
The AI Zones are designed to help numerous interconnected GPUs in a non-blocking method. Nonetheless, rising AI developments, akin to LLMs like Llama, demand a GPU scale bigger than what a single AI zone supplies. To accommodate this, we designed an aggregator coaching swap (ATSW) layer that connects the CTSWs in an information heart constructing, increasing the RoCE area past a single AI Zone.
Observe, the cross-AI Zone connectivity is oversubscribed by design, with community site visitors balanced utilizing ECMP. To mitigate the efficiency bottleneck for cross-AI Zone site visitors, we enhanced the coaching job scheduler to discover a “minimal reduce” when dividing the coaching nodes into totally different AI Zones, decreasing the cross-AI Zone site visitors and thus collective completion time. The scheduler does this by studying the place of GPU servers within the logical topology to advocate a rank project.
Routing
The scaling of compute energy and community topology mentioned above led to the query of the right way to effectively stability and route the large coaching site visitors. Particularly, the AI coaching workloads had a number of difficult traits:
- Low entropy: In comparison with conventional information heart workloads, the quantity and the variety of flows for AI workloads are a lot smaller and the stream patterns are normally repetitive and predictable.
- Burstiness: On the time dimension, the flows normally exhibit the “on and of”’ nature within the time granularity of milliseconds.
- Elephant flows: For every burst, the depth of every stream might attain as much as the road price of NICs.
ECMP and path pinning
We initially thought-about the broadly adopted ECMP, which locations flows randomly based mostly on the hashes on the five-tuple: supply and vacation spot IPs, supply and vacation spot UDP ports, and protocol. Nonetheless, and as anticipated, ECMP rendered poor efficiency for the coaching workload as a result of low stream entropy.
Alternatively, we designed and deployed a path-pinning scheme within the preliminary years of our deployment. This scheme routed packets to particular paths based mostly on the vacation spot “slice” (the index of the RTSW downlink). This labored properly if every rack was absolutely assigned to the identical job and there was no failure within the community. Nonetheless, this was seldom true. We noticed that the rack might be partially allotted to a job, with solely one of many two hosts within the rack utilizing the uplink bandwidth. This fragmented job placement brought about uneven site visitors distribution and congestion on the uplinks of the actual RTSW and degraded the coaching efficiency as much as greater than 30%. Additional, community failures on a uplink or a CTSW brought about the affected flows to be erratically reassigned to different CTSWs by ECMP. These reassigned flows collided with different present flows and slowed down the entire coaching job.
We mitigated the rapid influence of those stream collisions by upgrading the bandwidth of the RTSW uplinks bandwidth by 2x. Therefore we allowed for the RTSW uplink capability to be 1:2 under-subscribed in comparison with the RTSW downlink capability. Whereas this mitigated the rapid efficiency influence, this was an costly resolution because it required 2x community capability. Thus, we acknowledged this as a short-term mitigation and proceeded to additional levels of routing evolution.
Queue pair scaling
We subsequent revisited ECMP with an intent to extend the variety of flows for hierarchical collectives by the queue pair (QP) scaling software program function within the collective library.
To account for this, we configured switches to carry out Enhanced ECMP (E-ECMP) to moreover hash on the vacation spot QP subject of a RoCE packet utilizing the UDF functionality of the swap ASIC. This elevated entropy and, in comparison with baseline ECMP with out QP scaling, we noticed that E-ECMP together with QP scaling confirmed efficiency enchancment of as much as 40% for the AllReduce collective.
We evaluated two QP scaling methods. The primary concerned splitting every message meant to be posted over a single QP, as a substitute onto a number of QPs leading to a number of flows. Nevertheless it additionally produced smaller message sizes on material in addition to a number of ACKs. The second method concerned posting every message to a distinct queue, in a round-robin style. For the NIC message sizes demonstrated in our manufacturing with NCCL, we noticed the latter to be performing properly. This function has been vital for ECMP scalability by rising the community flows for hierarchical collectives like AllReduce.
Whereas we improved ECMP efficiency with QP scaling, the underlying probabilistic nature of hashing was a persistent draw back of this routing scheme. Additionally, the necessity to customise the QP scaling issue and methodology based mostly on the workload sort, whereas workable within the short-term, offered long-term operational complexity.
Congestion management
As we transitioned to 400G deployments, we tried to tune DCQCN to adapt to new community speeds and topology. Nonetheless, with default DCQCN settings and doubled ECN thresholds in comparison with 200G networks, efficiency was degraded. Additional investigation revealed that DCQCN implementation in firmware has modified, introducing bugs and lowered visibility with issues referring to right CNP counting.
We proceeded with out DCQCN for our 400G deployments. Right now, we’ve had over a yr of expertise with simply PFC for stream management, with out some other transport-level congestion management. We have now noticed secure efficiency and lack of persistent congestion for coaching collectives.
Receiver-driven site visitors admission
To mitigate the congestion for 400G and past, we co-designed the collective library and RoCE transport to implement receiver-driven site visitors admission for higher efficiency. The diagram beneath exhibits that the GPU-to-GPU communication structure in our manufacturing coaching clusters predominantly makes use of two-stage copy and receiver-initiated communication through the NCCL collective library. Every GPU’s excessive bandwidth reminiscence (HBM) maintains a number of channels for parallel transmission of chunked collective messages. The sender GPU threads first copy information from the compute buffer to an accessible channel buffer. The sender CPU proxy thread can solely publish an RDMA write request after receiving a clear-to-send (CTS) packet from the receiver, which incorporates the dimensions and reminiscence data. The receiver’s GPU threads then copy the channel buffer contents to the vacation spot compute buffer. Lastly, CPU proxy threads on each side recycle the channel buffer, and the receiver CPU proxy sends one other CTS packet as soon as the channel buffer is prepared.
We successfully leverage this mechanism as a receiver-driven site visitors admission to restrict the quantity of in-flight site visitors on the community, particularly when congestion begins to construct up. Nonetheless, configuring the best setting might be difficult as:
- The variety of channels is proscribed as a result of useful resource competition on GPU threads with concurrent compute operations;
- Setting the channel buffer dimension requires a extra cautious stability between congestion spreading and bandwidth under-utilization than Infiniband as a consequence of RoCE’s extra coarse-grained stream management and attainable end-host slowness.
Thus, we took two steps to enhance the efficiency. First, we experimentally decided the best parameter settings for the variety of channels and channel buffer dimension throughout varied coaching job sizes and collective varieties. Second, we carried out excessive precedence queuing at switches for CTS packets to expedite the notifications and mitigate potential bandwidth hunger.
Congestion management has been a focus of analysis in RDMA networks. DCQCN has been the gold commonplace for storage-focused networks. Nonetheless, our expertise with distributed AI coaching workloads supplies a distinct perspective on tailoring the congestion management algorithms. Regardless of turning off DCQCN and a number of cases of RTSW sending PFC to a deep-buffer CTSW, we’ve not encountered a situation during the last 4 years the place manufacturing AI coaching site visitors causes the CTSW to ship PFCs to RTSWs persistently.
Our present resolution depends upon cautious coordination between the collective communication library and the community. It might depend upon the relative throughput between GPU and community, which is probably not relevant to all situations. We encourage the analysis group to place extra deal with this matter.
Shifting ahead
The design and operation of large-scale RoCE networks for distributed AI coaching workloads have advanced to fulfill the rising calls for of computational density and scale. By segregating FE and BE networks, using varied routing schemes, and optimizing collective site visitors patterns, we’ve been in a position to construct a performant and dependable community infrastructure. These designs and insights underline the significance of deeply understanding the coaching workload and translating these implications into community part design, in the end contributing to the development of distributed AI coaching infrastructure.
With the quick rising development of GenAI workload, our community infrastructure will evolve quickly.
Learn the paper
RDMA over Ethernet for Distributed AI Training at Meta Scale
Acknowledgements
We want to thank all contributors to the paper, together with Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Adi Mohammad Riftadi, Ashmitha Jeevaraj Shetty, Jingyi Yang, Shuqiang Zhang, Mikel Jimenez Fernandez, Shashi Gandham, Omar Baldonado. Many present and former individuals within the Community Infrastructure staff at Meta have contributed to productionizing RoCE networks for AI coaching through the years. Specifically, we want to acknowledge Srinivas Sridharan, Petr Lapukhov, Jose Leitao, and Brandon Taylor. This work is an in depth collaboration with our companions in Meta’s AI Manufacturing Engineering, AI and Methods Co-design, and AI {Hardware} Methods groups.