By David Wong
Founder of Claruspon
Preface
The continue expansion of the Cloud and the AI computing drive the change of the datacenter networks, as it requires a set of features than the Clos topology alone can provide, notably the requirements of low congestion, high throughput and good scalability.
In a series of articles, I am about to show you the fundamentals of a leading-edge technology with more desirable networking characteristics than the Clos network alone resulting in more effective network supporting various business goals.
Infiniband and Ethernet
Many large AI models are trained on expensive Infiniband-based systems. But this may not be the only option. When make a comparison between Infiniband and Ethernet technologies differences are identified. InfiniBand and Ethernet are drastically two different technologies, and do not interoperable with each other.
Table 1 – Comparisons of Infiniband and Ethernet Technologies
InfiniBand itself is a low-level interconnect technology designed for high-performance computing (HPC) and data center environments. Infiniband is commonly used in Clos network with adaptive routing for AI computing. Examples of this are Nvidia DGX A100 Server, DGX POD, DGX SuperPOD, and Nvidia Selene (Supercomputer based on DGX SuperPOD). Lack of the Ethernet-like evolution and accumulation Infiniband does not natively support VPN, VxLAN, VRF-lite, EVPN technologies, making Infiniband hard to interoperable to the existing cloud technology. Supporting business model such as multi-tenancy is also hard for Infiniband.
Ethernet technology plays a vital role in data centers as it provides the primary connectivity and networking infrastructure for modern data center environments. Ethernet provides the foundation for network virtualization technologies in data centers. Technologies like Virtual Extensible LAN (VXLAN) and Network Virtualization using Generic Routing Encapsulation (NVGRE) allow for the creation of virtual networks overlaying the physical infrastructure, enabling more efficient use of resources and better network segmentation, which are the underlaying technologies for the Cloud. Overall, Ethernet technology's scalability, high performance, and much wider interoperable to many networking devices and industry support making it the backbone of modern data center networks, providing the connectivity required to support the ever-increasing demands of cloud computing, big data, artificial intelligence, and other data-intensive workloads. Ethernet is commonly used in Clos network with BGP routing for the Cloud, and AI/HPC computing.
Infiniband has detailed flow control, congestion control as well as various routing mechanisms supporting topologies to match up with the specific workloads - good for the AI/HPC computing. However, these flexibilities also bring complexities to network operation & administration. Ethernet is playing catch-up in all dimensions in AI and supercomputing. However, it is the author’s belief and opinion that advanced technologies can enable the Ethernet/IP to do detail routing similar to what Infiniband is doing (non-minimal routing, stuff like that). And through Ethernet industry’s collaborative efforts to bring AI/supercomputing cost down.