Investigation of a Cross-regional Community Efficiency Challenge | by Netflix Expertise Weblog

10 min learn

Apr 24, 2024

Hechao Li, Roger Cruz

Netflix operates a extremely environment friendly cloud computing infrastructure that helps a big selection of functions important for our SVOD (Subscription Video on Demand), stay streaming and gaming providers. Using Amazon AWS, our infrastructure is hosted throughout a number of geographic areas worldwide. This international distribution permits our functions to ship content material extra successfully by serving site visitors nearer to our clients. Like every distributed system, our functions sometimes require knowledge synchronization between areas to keep up seamless service supply.

The next diagram reveals a simplified cloud community topology for cross-region site visitors.

Our Cloud Community Engineering on-call crew acquired a request to handle a community difficulty affecting an software with cross-region site visitors. Initially, it appeared that the appliance was experiencing timeouts, doubtless resulting from suboptimal community efficiency. As everyone knows, the longer the community path, the extra units the packets traverse, rising the chance of points. For this incident, the shopper software is situated in an inner subnet within the US area whereas the server software is situated in an exterior subnet in a European area. Due to this fact, it’s pure accountable the community since packets must journey lengthy distances by means of the web.

As community engineers, our preliminary response when the community is blamed is often, “No, it may well’t be the community,” and our activity is to show it. Given that there have been no current adjustments to the community infrastructure and no reported AWS points impacting different functions, the on-call engineer suspected a loud neighbor difficulty and sought help from the Host Community Engineering crew.

On this context, a loud neighbor difficulty happens when a container shares a number with different network-intensive containers. These noisy neighbors eat extreme community assets, inflicting different containers on the identical host to undergo from degraded community efficiency. Regardless of every container having bandwidth limitations, oversubscription can nonetheless result in such points.

Upon investigating different containers on the identical host — most of which had been a part of the identical software — we rapidly eradicated the opportunity of noisy neighbors. The community throughput for each the problematic container and all others was considerably under the set bandwidth limits. We tried to resolve the difficulty by eradicating these bandwidth limits, permitting the appliance to make the most of as a lot bandwidth as crucial. Nonetheless, the issue continued.

We noticed some TCP packets within the community marked with the RST flag, a flag indicating {that a} connection needs to be instantly terminated. Though the frequency of those packets was not alarmingly excessive, the presence of any RST packets nonetheless raised suspicion on the community. To find out whether or not this was certainly a network-induced difficulty, we carried out a tcpdump on the shopper. Within the packet seize file, we noticed one TCP stream that was closed after precisely 30 seconds.

SYN at 18:47:06

After the 3-way handshake (SYN,SYN-ACK,ACK), the site visitors began flowing usually. Nothing unusual till FIN at 18:47:36 (30 seconds later)

The packet seize outcomes clearly indicated that it was the shopper software that initiated the connection termination by sending a FIN packet. Following this, the server continued to ship knowledge; nonetheless, for the reason that shopper had already determined to shut the connection, it responded with RST packets to all subsequent knowledge from the server.

To make sure that the shopper wasn’t closing the connection resulting from packet loss, we additionally carried out a packet seize on the server facet to confirm that each one packets despatched by the server had been acquired. This activity was difficult by the truth that the packets handed by means of a NAT gateway (NGW), which meant that on the server facet, the shopper’s IP and port appeared as these of the NGW, differing from these seen on the shopper facet. Consequently, to precisely match TCP streams, we would have liked to determine the TCP stream on the shopper facet, find the uncooked TCP sequence quantity, after which use this quantity as a filter on the server facet to search out the corresponding TCP stream.

With packet seize outcomes from each the shopper and server sides, we confirmed that all packets despatched by the server had been accurately acquired earlier than the shopper despatched a FIN.

Now, from the community viewpoint, the story is evident. The shopper initiated the connection requesting knowledge from the server. The server saved sending knowledge to the shopper with no drawback. Nonetheless, at a sure level, regardless of the server nonetheless having knowledge to ship, the shopper selected to terminate the reception of information. This led us to suspect that the difficulty is likely to be associated to the shopper software itself.

As a way to totally perceive the issue, we now want to know how the appliance works. As proven within the diagram under, the appliance runs within the us-east-1 area. It reads knowledge from cross-region servers and writes the info to customers throughout the similar area. The shopper runs as containers, whereas the servers are EC2 cases.

Notably, the cross-region learn was problematic whereas the write path was easy. Most significantly, there’s a 30-second application-level timeout for studying the info. The appliance (shopper) errors out if it fails to learn an preliminary batch of information from the servers inside 30 seconds. Once we elevated this timeout to 60 seconds, every part labored as anticipated. This explains why the shopper initiated a FIN — as a result of it misplaced endurance ready for the server to switch knowledge.

May it’s that the server was up to date to ship knowledge extra slowly? May it’s that the shopper software was up to date to obtain knowledge extra slowly? May it’s that the info quantity grew to become too massive to be utterly despatched out inside 30 seconds? Sadly, we acquired unfavourable solutions for all 3 questions from the appliance proprietor. The server had been working with out adjustments for over a yr, there have been no vital updates within the newest rollout of the shopper, and the info quantity had remained constant.

If each the community and the appliance weren’t modified just lately, then what modified? Actually, we found that the difficulty coincided with a current Linux kernel improve from model 6.5.13 to six.6.10. To check this speculation, we rolled again the kernel improve and it did restore regular operation to the appliance.

Actually talking, at the moment I didn’t consider it was a kernel bug as a result of I assumed the TCP implementation within the kernel needs to be stable and steady (Spoiler alert: How flawed was I!). However we had been additionally out of concepts from different angles.

There have been about 14k commits between the nice and unhealthy kernel variations. Engineers on the crew methodically and diligently bisected between the 2 variations. When the bisecting was narrowed to a few commits, a change with “tcp” in its commit message caught our consideration. The ultimate bisecting confirmed that this commit was our wrongdoer.

Apparently, whereas reviewing the e-mail historical past associated to this commit, we discovered that another user had reported a Python test failure following the same kernel upgrade. Though their resolution was in a roundabout way relevant to our state of affairs, it instructed that an easier take a look at may also reproduce our drawback. Utilizing strace, we noticed that the appliance configured the next socket choices when speaking with the server:

[pid 1699] setsockopt(917, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
[pid 1699] setsockopt(917, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
[pid 1699] setsockopt(917, SOL_SOCKET, SO_SNDBUF, [131072], 4) = 0
[pid 1699] setsockopt(917, SOL_SOCKET, SO_RCVBUF, [65536], 4) = 0
[pid 1699] setsockopt(917, SOL_TCP, TCP_NODELAY, [1], 4) = 0

We then developed a minimal client-server C software that transfers a file from the server to the shopper, with the shopper configuring the identical set of socket choices. Throughout testing, we used a 10M file, which represents the quantity of information usually transferred inside 30 seconds earlier than the shopper points a FIN. On the previous kernel, this cross-region switch accomplished in 22 seconds, whereas on the brand new kernel, it took 39 seconds to complete.

With the assistance of the minimal replica setup, we had been finally in a position to pinpoint the basis reason behind the issue. As a way to perceive the basis trigger, it’s important to have a grasp of the TCP obtain window.

TCP Obtain Window

Merely put, the TCP obtain window is how the receiver tells the sender “That is what number of bytes you may ship me with out me ACKing any of them”. Assuming the sender is the server and the receiver is the shopper, then we have now:

The Window Dimension

Now that we all know the TCP obtain window measurement may have an effect on the throughput, the query is, how is the window measurement calculated? As an software author, you may’t resolve the window measurement, nonetheless, you may resolve how a lot reminiscence you wish to use for buffering acquired knowledge. That is configured utilizing SO_RCVBUF socket choice we noticed within the strace end result above. Nonetheless, notice that the worth of this feature means how a lot software knowledge could be queued within the obtain buffer. In man 7 socket, there may be

SO_RCVBUF

Units or will get the utmost socket obtain buffer in bytes.
The kernel doubles this worth (to permit area for
bookkeeping overhead) when it’s set utilizing setsockopt(2),
and this doubled worth is returned by getsockopt(2). The
default worth is about by the
/proc/sys/internet/core/rmem_default file, and the utmost
allowed worth is about by the /proc/sys/internet/core/rmem_max
file. The minimal (doubled) worth for this feature is 256.

This implies, when the consumer offers a price X, then the kernel stores 2X in the variable sk->sk_rcvbuf. In different phrases, the kernel assumes that the bookkeeping overhead is as a lot because the precise knowledge (i.e. 50% of the sk_rcvbuf).

sysctl_tcp_adv_win_scale

Nonetheless, the belief above will not be true as a result of the precise overhead actually depends upon lots of elements akin to Most Transmission Unit (MTU). Due to this fact, the kernel supplied this sysctl_tcp_adv_win_scale which you should utilize to inform the kernel what the precise overhead is. (I consider 99% of individuals additionally don’t know set this parameter accurately and I’m undoubtedly considered one of them. You’re the kernel, in case you don’t know the overhead, how are you going to anticipate me to know?).

In response to the sysctl doc,

tcp_adv_win_scale — INTEGER

Out of date since linux-6.6 Rely buffering overhead as bytes/2^tcp_adv_win_scale (if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale), whether it is <= 0.

Potential values are [-31, 31], inclusive.

Default: 1

For 99% of individuals, we’re simply utilizing the default worth 1, which in flip means the overhead is calculated by rcvbuf/2^tcp_adv_win_scale = 1/2 * rcvbuf. This matches the belief when setting the SO_RCVBUF worth.

Let’s recap. Assume you set SO_RCVBUF to 65536, which is the worth set by the appliance as proven within the setsockopt syscall. Then we have now:

  • SO_RCVBUF = 65536
  • rcvbuf = 2 * 65536 = 131072
  • overhead = rcvbuf / 2 = 131072 / 2 = 65536
  • obtain window measurement = rcvbuf — overhead = 131072–65536 = 65536

(Observe, this calculation is simplified. The true calculation is extra complicated.)

In brief, the obtain window measurement earlier than the kernel improve was 65536. With this window measurement, the appliance was in a position to switch 10M knowledge inside 30 seconds.

The Change

This commit obsoleted sysctl_tcp_adv_win_scale and launched a scaling_ratio that may extra precisely calculate the overhead or window measurement, which is the suitable factor to do. With the change, the window measurement is now rcvbuf * scaling_ratio.

So how is scaling_ratio calculated? It’s calculated utilizing skb->len/skb->truesize the place skb->len is the size of the tcp knowledge size in an skb and truesize is the overall measurement of the skb. That is absolutely a extra correct ratio primarily based on actual knowledge moderately than a hardcoded 50%. Now, right here is the following query: through the TCP handshake earlier than any knowledge is transferred, how can we resolve the preliminary scaling_ratio? The reply is, a magic and conservative ratio was chosen with the worth being roughly 0.25.

Now we have now:

  • SO_RCVBUF = 65536
  • rcvbuf = 2 * 65536 = 131072
  • obtain window measurement = rcvbuf * 0.25 = 131072 * 0.25 = 32768

In brief, the obtain window measurement halved after the kernel improve. Therefore the throughput was minimize in half, inflicting the info switch time to double.

Naturally, you might ask, I perceive that the preliminary window measurement is small, however why doesn’t the window develop when we have now a extra correct ratio of the payload later (i.e. skb->len/skb->truesize)? With some debugging, we finally came upon that the scaling_ratio does get updated to a more accurate skb->len/skb->truesize, which in our case is round 0.66. Nonetheless, one other variable, window_clamp, isn’t up to date accordingly. window_clamp is the maximum receive window allowed to be advertised, which can also be initialized to 0.25 * rcvbuf utilizing the preliminary scaling_ratio. In consequence, the obtain window measurement is capped at this worth and might’t develop greater.

In principle, the repair is to replace window_clamp together with scaling_ratio. Nonetheless, with a purpose to have a easy repair that doesn’t introduce different sudden behaviors, our final fix was to increase the initial scaling_ratio from 25% to 50%. This may make the obtain window measurement backward appropriate with the unique default sysctl_tcp_adv_win_scale.

In the meantime, discover that the issue isn’t solely brought on by the modified kernel habits but in addition by the truth that the appliance units SO_RCVBUF and has a 30-second application-level timeout. Actually, the appliance is Kafka Join and each settings are the default configurations (receive.buffer.bytes=64k and request.timeout.ms=30s). We additionally created a kafka ticket to change receive.buffer.bytes to -1 to permit Linux to auto tune the obtain window.

This was a really attention-grabbing debugging train that coated many layers of Netflix’s stack and infrastructure. Whereas it technically wasn’t the “community” accountable, this time it turned out the wrongdoer was the software program elements that make up the community (i.e. the TCP implementation within the kernel).

If tackling such technical challenges excites you, contemplate becoming a member of our Cloud Infrastructure Engineering groups. Discover alternatives by visiting Netflix Jobs and trying to find Cloud Engineering positions.

Particular due to our gorgeous colleagues Alok Tiagi, Artem Tkachuk, Ethan Adams, Jorge Rodriguez, Nick Mahilani, Tycho Andersen and Vinay Rayini for investigating and mitigating this difficulty. We might additionally prefer to thank Linux kernel community knowledgeable Eric Dumazet for reviewing and making use of the patch.