Technology

Cloud and AI Computing – the Cross Road

Apr 16, 2024

Technology

Cloud and AI Computing – the Cross Road

Apr 16, 2024

By David Wong
Founder of Claruspon

Preface

The continue expansion of the Cloud and the AI computing drive the change of the datacenter networks, as it requires a set of features than the Clos topology alone can provide, notably the requirements of low congestion, high throughput and good scalability.

In a series of articles, I am about to show you the fundamentals of a leading-edge technology with more desirable networking characteristics than the Clos network alone resulting in more effective network supporting various business goals.

Infiniband and Ethernet

Many large AI models are trained on expensive Infiniband-based systems. But this may not be the only option. When make a comparison between Infiniband and Ethernet technologies differences are identified. InfiniBand and Ethernet are drastically two different technologies, and do not interoperable with each other.


Table 1 – Comparisons of Infiniband and Ethernet Technologies

InfiniBand itself is a low-level interconnect technology designed for high-performance computing (HPC) and data center environments. Infiniband is commonly used in Clos network with adaptive routing for AI computing. Examples of this are Nvidia DGX A100 Server, DGX POD, DGX SuperPOD, and Nvidia Selene (Supercomputer based on DGX SuperPOD). Lack of the Ethernet-like evolution and accumulation Infiniband does not natively support VPN, VxLAN, VRF-lite, EVPN technologies, making Infiniband hard to interoperable to the existing cloud technology. Supporting business model such as multi-tenancy is also hard for Infiniband.

Ethernet technology plays a vital role in data centers as it provides the primary connectivity and networking infrastructure for modern data center environments. Ethernet provides the foundation for network virtualization technologies in data centers. Technologies like Virtual Extensible LAN (VXLAN) and Network Virtualization using Generic Routing Encapsulation (NVGRE) allow for the creation of virtual networks overlaying the physical infrastructure, enabling more efficient use of resources and better network segmentation, which are the underlaying technologies for the Cloud. Overall, Ethernet technology's scalability, high performance, and much wider interoperable to many networking devices and industry support making it the backbone of modern data center networks, providing the connectivity required to support the ever-increasing demands of cloud computing, big data, artificial intelligence, and other data-intensive workloads. Ethernet is commonly used in Clos network with BGP routing for the Cloud, and AI/HPC computing.

Infiniband has detailed flow control, congestion control as well as various routing mechanisms supporting topologies to match up with the specific workloads - good for the AI/HPC computing. However, these flexibilities also bring complexities to network operation & administration. Ethernet is playing catch-up in all dimensions in AI and supercomputing. However, it is the author’s belief and opinion that advanced technologies can enable the Ethernet/IP to do detail routing similar to what Infiniband is doing (non-minimal routing, stuff like that). And through Ethernet industry’s collaborative efforts to bring AI/supercomputing cost down.

Biography

David Wong is the founder of Claruspon Systems, Inc., a California company. He is the inventor of 5 US patents in non-minimal path routing, and the Mesh topological networks for datacenter; designer of datacenter switch and modular fabric chassis products. He built a datacenter mini-POD using datacenter switches, and modular fabric chassis as development and validation platform. David is currently building a corporate go-to-market strategy, and venture financing plan. He believes in the can-do spirit, the spirit of innovation, rewards & motivations. David can be contacted at email: [email protected]