Innovation

Cloud and AI Computing - the Roman Market Place

Apr 11, 2024

Innovation

Cloud and AI Computing - the Roman Market Place

Apr 11, 2024

By David Wong
Founder of Claruspon

Preface

The continue expansion of the Cloud and the AI computing drive the change of the datacenter networks, as it requires a set of features than the Clos topology alone can provide, notably the requirements of low congestion, high throughput and good scalability.

In a series of articles, I am about to show you the fundamentals of a leading-edge technology with more desirable networking characteristics than the Clos network alone resulting in more effective network supporting various business goals.

The Roman Market Place

Long time ago, there is only one road to Rome. This road connected to a village in the suburb of Rome. Other villages could only pass through this village before getting to Rome. When the market was started, this road became particularly congested. It was difficult to arrive at the market in time for those who were far away from Rome. Eventually, everyone came up with a solution to set traffic restriction at each village entrance. The traffic restriction reduced the flow of traffic from each village to a low enough level so that the only road to Rome was no longer congested. This change had improved fairness, allowing people some from each village to catch up with the market in time. Still most people could not arrive in time due to traffic control.


Figure 1 – Early Roman roadway system connecting suburban villages

The approach in today's data centers is no different, using congestion control to reduce the source traffic (application throughput) to a reasonable low level to avoid network congestion. But it comes at the price of sacrificing application throughput. Is there a better way?

In the case of the Roman market place, it was to ensure that more people from each village can catch up with the market in time. Eventually people came up a solution. Every village built a road to Rome. Villages built roads between each other. Villages with lot of people going to the market were first diverted to other villages and then transferred to Rome.


Figure 2 – A 4-node Mesh roadway system

More and more villages participated in the Mesh roadway system. With the continuous increase of total traffic volume, it was found that by diverting traffic to other villages and then transferring to Rome was the mechanism for carrying volume traffic and alleviating burst traffic. With the improved roadway system and business, the Roman villagers lived a happy life thereafter. This is the story of the "Roman Market Place".


Figure 3 – A multi-node Mesh roadway system

Technical Talk

To appreciate how well the Clos and Mesh are carrying volume traffic and alleviating burst traffic let’s consider their abstract models of path.

For those who understands TCP transport layer knows that a data packet must have a returning ACK-packet. When the TCP is communicating with another TCP over a network, the data packet is traversing a path of two interconnect segments (2-hop), the same is true for the ACK-packet. The path the data packet traversing does not overlap with other path the other data packet traversing – non-overlapping data path. The path the ACK-packet traversing does not overlap with other path the other ACK-packet traversing – non-overlapping ACK path. So, there is no congestion in the topologies shown. The throughput simply depends on the round-trip time.

The question we should ask: How many such non-overlapping (non-congested) paths are there in a (p,n)-node Clos network and in an n-node Mesh network?


Figure 4 – Max. number of non-overlapping paths, and diversity ratios of Clos and Mesh topologies.

Let’s take an abstract approach by counting the number of interconnects a topology has and divide it by 2. The reason is that two interconnects form a path (in a 2-hop network, and this is about 2-hop routing for both Clos and Mesh networks). Since each interconnect is duplex making the previous statement valid for both the data path and the ACK path. We then compare the number of non-overlapping paths between a Mesh and Clos topology by dividing the Mesh from the Clos to create a metric called diversity ratio. A diversity ratio larger than 1 means that the Mesh has more non-overlapping paths than the Clos counterpart.

In a large useful range of Clos network deployments in the datacenters worldwide, Mesh has more non-overlapping paths than the Clos counterpart.

For those who is experienced with the ECMP multipathing, knows that a random number generator is working behind the scene to distribute the large number of TCP connections to the finite number of the multipaths. With a small set of the non-overlapping paths (with small diversity ratio), more TCP connections will get wrapped into the same path creating overlapped path and inducing network congestion to occur.

A diversity matrix is show in Table 1 where a large range (in green color where diversity ratio >= 1) of datacenter Clos networks can be replaced by its Mesh counterparts to benefit from improved congestion characteristic.

Table 1 – Diversity Matrix where Diversity Ratio=0.5(n-1)/p

Conclusion

Today, we have inherited the wisdom of early Roman villagers in what we do in the data center. We have made breakthroughs in network topology and routing technology, and proposed two schemes Mesh+Mesh and Clos+Mesh topologies for the data center. Like the ultimate solution in the Roman market story, Mesh networks inherently have lots of non-overlapping paths, and is superior in terms of carrying volume traffic and alleviating burst traffic. If the cloud architecture is built on top of the Mesh network, it can greatly enhance its ability to carry volume traffic and alleviate burst traffic. If kernel bypass and congestion control are implemented on each computing nodes interconnected by a mesh network, zero congestion, ultra-low latency and higher throughput than the Clos counterpart can be achieved for AI computing. Details of the Mesh+Mesh and the Clos+Mesh will appear in future conference papers.

Biography

David Wong is the founder of Claruspon Systems, Inc., a California company. He is the inventor of 5 US patents in non-minimal path routing, and the Mesh topological networks for datacenter; designer of datacenter switch and modular fabric chassis products. He built a datacenter mini-POD using datacenter switches, and modular fabric chassis as development and validation platform. David is currently building a corporate go-to-market strategy, and venture financing plan. He believes in the can-do spirit, the spirit of innovation, rewards & motivations. David can be contacted at email: [email protected]