TCP, the ubiquitous IP transport protocol used for virtually all data exchanges in a storage context is designed to eternally probe for higher bandwidth. Eventually, between the vast number of hosts all trying to deliver data ever more quickly, and the network you operate, something has to give.
With todays deployed networking gear which is hardly ever configured any different than the factory default settings in this regard, when this happens, packet loss is the dire consequence. This is the only option available to your switches, routers and WLAN access points, to shed some load and get themselves a (very) short break.
However, not every packet lost is terminated equally. From the viewpoint of the switch, any packet loss (euphemistically also called drop or discard, with the relevant counters often hidden from plain sight) is just taking away a very tiny fraction of possible bandwidth – and congestion happens when the load on a link is constantly right there at 100%, correct?
But not so fast, this simplistic view is missing the bigger picture. As hinted above, the prevailing protocol nowadays to connect any networked device with each other is TCP. And TCP not only delivers data in-order and reliably (but not necessarily timely), it has also co-evolved over the last 30 years to deal with the harsh realities of packet networks.
In today’s age, packet loss virtually always is a sign of network congestion; even the low-layer WiFi links have sophisticated mechanisms, to try very hard and get a packet delivered. Even there, packets typically get discarded not while “on the air”, but when waiting for some earlier packet to get properly delivered, e.g. At a much lower transmission rate. Again, the incoming data has to queue up, and eventually the queue is full – packet loss and congestion caught in the act.
But think of your datacenter, where you have only two servers in need of writing data to your NetApp storage system of choice. As soon as these two hosts each transmit at 501 MB/s towards a 10G LIF in the very same microsecond, the switch capabilities, and link bandwidth towards the storage are overloaded, and at least some data has to queue up again.
So, isn’t more queueing buffer in all participating devices the solution? At least there are fashions in network technology – some time ago, deep buffered switches were all the rage; then you had shallow buffered switches. Nowadays, these appear to be no significant talking points any more – the fashion train has move to different marketing statements (while switches still come in shallow and deep buffered varieties).
But again, this simplistic view – more buffers mitigate packet losses, and all is good – misses the bigger picture.
A short primer on how TCP works, in very broad strokes: Unless TCP understands, that there may be an issue with the bandwidth towards the other end host, it continues to increase the sending bandwidth – always. And when your network device buffers more data, during the entire time, the sender only increases the sending rate, thus filling up the buffer ever more quickly. Until, that is, an indication of network overload (yes, this is an allusion to packet loss) arrives back at the sender. But typically, the loss happens on enqueue – that is, for the freshest packet that happens to arrive when the queue is filled up. And the sender will only know about this having happened, *after* all previous packets in the queue have been delivered to the client. But with a huge queue, it takes more time until the receive knows about the lost packet, and only then it can inform the sender. Which just kept on increasing the sending rate until now….
In summary, let me conclude with the following observations: It’s a false goal to try to avoid packet loss at all costs (deep buffered switches, priority or link layer flow control) when you are running TCP. TCP will just try to go even faster, inducing unnecessary buffering delays. Instead, do away with the legacy drop-tail queueing disciple that is factory default everywhere and interacts poorly with latency sensitive, but reliable data transfers as we have in storage. Moving to an AQM like RED / WRED (random-detect) is also the first step towards enabling truly lossless networks with today’s technology. But more about Explicit Congestion Notification (enabled by default on more hosts in your environment that you are aware of) in a later installment.

No comments:
Post a Comment