在數(shù)據(jù)中心網(wǎng)絡(luò)內(nèi),機(jī)器之間數(shù)據(jù)傳輸?shù)耐禃r(shí)間(rtt)一般在10ms以內(nèi),為此調(diào)內(nèi)部服務(wù)的超時(shí)時(shí)間一般會(huì)設(shè)置成50ms、200ms、500ms等,如果在傳輸過程中出現(xiàn)丟包,這樣的服務(wù)超時(shí)時(shí)間,tcp層有機(jī)會(huì)發(fā)現(xiàn)并重傳一次數(shù)據(jù)么?如果設(shè)置成200ms以內(nèi),答案是沒有機(jī)會(huì),原因是linux系統(tǒng)下第一次重傳時(shí)間等于傳輸?shù)耐禃r(shí)間上至少加上200ms的預(yù)測(cè)偏差值,即如果rtt值是7ms,第一次重傳超時(shí)時(shí)間至少是207ms,這樣如果對(duì)某個(gè)接口的超時(shí)時(shí)間設(shè)置成200ms以內(nèi), 即便是rtt時(shí)間很小,仍然無法容忍一次丟包,因?yàn)樵趖cp發(fā)現(xiàn)丟包之前,該接口已經(jīng)超時(shí)了。
本文針對(duì)linux系統(tǒng)tcp數(shù)據(jù)包第一次重傳時(shí)間的計(jì)算進(jìn)行探究,結(jié)果會(huì)讓人大吃一驚。提出的優(yōu)化方法,理論上能夠降低內(nèi)部服務(wù)調(diào)用時(shí)延和出錯(cuò)量。
tcp發(fā)送數(shù)據(jù)包后,會(huì)設(shè)置一個(gè)定時(shí)器,到期后如果還沒有收到對(duì)方的回復(fù)(ack),就會(huì)重傳數(shù)據(jù)包。從發(fā)出數(shù)據(jù)包到第一次重傳之間的間隔時(shí)間稱為retransmission timeout(RTO),rto由數(shù)據(jù)包的往返時(shí)間(rtt)加上rtt的預(yù)測(cè)偏差(波動(dòng)值)計(jì)算出來。
即 rto = srtt + rttvar,其中srtt是rtt的平滑值,而rttvar是波動(dòng)值,代表可能的預(yù)測(cè)偏差。
接下來我們做一個(gè)試驗(yàn)。
先ping一下,看一下數(shù)據(jù)包的往返時(shí)間,如下:
[xiaohong@localhost ~]$ ping
PING (123.125.104.197) 56(84) bytes of data.
64 bytes from 123.125.104.197: icmp_seq=1 ttl=55 time=3.65 ms
64 bytes from 123.125.104.197: icmp_seq=2 ttl=55 time=3.38 ms
64 bytes from 123.125.104.197: icmp_seq=3 ttl=55 time=4.34 ms
64 bytes from 123.125.104.197: icmp_seq=4 ttl=55 time=7.82 ms
再看一下tcp對(duì)到的rtt相關(guān)數(shù)據(jù),下面的命令是針對(duì)centos7(如果是以下的版本,運(yùn)行的命令是ip route list tab cache)如下:
[xiaohong@localhost ~]$ sudo ip tcp_metrics
123.125.104.197 age 22.255sec rtt 7375us rttvar 7250us cwnd 10
由上面看出,平滑后的rtt值約為7ms,rttvar約為7ms,那按理說rto值應(yīng)該是14ms左右,也就是等14ms后,如果沒有收到對(duì)方的響應(yīng),就會(huì)重傳數(shù)據(jù)。實(shí)際的情況會(huì)是這樣么?
在一個(gè)命令窗口里,運(yùn)行下面的命令:
[xiaohong@localhost ~]$ nc 80
GET / HTTP/1.1
Host:
Connection:
同時(shí)再開一個(gè)命令行窗口里,運(yùn)行下面的命令:
[xiaohong@localhost iproute2-3.19.0]$ ss -eipn '( dport = :www )'
tcp ESTAB 0 0 10.209.80.111:56486 123.125.104.197:80 users:(("nc",1713,3)) uid:1000 ino:14243 sk:ffff88002c992d00 <->
ts sack cubic wscale:0,7 rto:207 rtt:7.375/7.25 mss:1448 cwnd:10 send 15.7Mbps rcv_space:14600
從上面的結(jié)果可以看出,實(shí)際的rto值是207ms,相當(dāng)于rtt值加上200ms,為什么呢?
下面從內(nèi)核tcp源代碼中分析原因。
設(shè)置超時(shí)時(shí)間的函數(shù)是tcp_set_rto,在net/ipv4/tcp_input.c中,如下:
static inline void tcp_set_rto(struct sock *sk)
{
const struct tcp_sock *tp = tcp_sk(sk);
inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);
tcp_bound_rto(sk);
}
可以看出,重傳的定時(shí)值isck_rto實(shí)際上是調(diào)用 __tcp_set_rto,接著看它的源碼,這個(gè)在文件include/tcp/net/tcp.h中,如下:
static inline u32 __tcp_set_rto(const struct tcp_sock *tp)
{
return (tp->srtt >> 3) + tp->rttvar;
}
為了避免浮點(diǎn)數(shù)運(yùn)算,rtt乘以8保存在socket數(shù)據(jù)結(jié)構(gòu)中,從代碼可以確認(rèn):
icsk_rto = srtt + rttvar
而計(jì)算和影響srtt和rttvar的函數(shù)是tcp_rtt_estimator,在文件net/ipv4/tcp_input.c中,代碼如下:
static void tcp_rtt_estimator(struct sock *sk, const __u32 mrtt)
{
struct tcp_sock *tp = tcp_sk(sk);
long m = mrtt; /* RTT */
/* The following amusing code comes from Jacobson's
* article in SIGCOMM '88. Note that rtt and mdev
* are scaled versions of rtt and mean deviation.
* This is designed to be as fast as possible
* m stands for "measurement".
*
* On a 1990 paper the rto value is changed to:
* RTO = rtt + 4 * mdev
*
* Funny. This algorithm seems to be very broken.
* These formulae increase RTO, when it should be decreased, increase
* too slowly, when it should be increased quickly, decrease too quickly
* etc. I guess in BSD RTO takes ONE value, so that it is absolutely
* does not matter how to _calculate_ it. Seems, it was trap
* that VJ failed to avoid. 8)
*/
if (m == 0)
m = 1;
if (tp->srtt != 0) {
m -= (tp->srtt >> 3); /* m is now error in rtt est */
tp->srtt += m; /* rtt = 7/8 rtt + 1/8 new */
if (m < 0) {
m = -m; /* m is now abs(error) */
m -= (tp->mdev >> 2); /* similar update on mdev */
/* This is similar to one of Eifel findings.
* Eifel blocks mdev updates when rtt decreases.
* This solution is a bit different: we use finer gain
* for mdev in this case (alpha*beta).
* Like Eifel it also prevents growth of rto,
* but also it limits too fast rto decreases,
* happening in pure Eifel.
*/
if (m > 0)
m >>= 3;
} else {
m -= (tp->mdev >> 2); /* similar update on mdev */
}
tp->mdev += m; /* mdev = 3/4 mdev + 1/4 new */
if (tp->mdev > tp->mdev_max) {
tp->mdev_max = tp->mdev;
if (tp->mdev_max > tp->rttvar)
tp->rttvar = tp->mdev_max;
}
if (after(tp->snd_una, tp->rtt_seq)) {
if (tp->mdev_max < tp->rttvar)
tp->rttvar -= (tp->rttvar - tp->mdev_max) >> 2;
tp->rtt_seq = tp->snd_nxt;
tp->mdev_max = tcp_rto_min(sk);
}
} else {
/* no previous measure. */
tp->srtt = m << 3; /* take the measured time to be rtt */
tp->mdev = m << 1; /* make sure rto = 3*rtt */
tp->mdev_max = tp->rttvar = max(tp->mdev, tcp_rto_min(sk));
tp->rtt_seq = tp->snd_nxt;
}
}
從上面的代碼可以看出,srtt = 7/8 old srtt + 1/8 new rtt,這個(gè)跟RFC一致,沒有啥可以說的。
獲得第一個(gè)往返時(shí)間數(shù)據(jù)時(shí)(一般是建立連接完成時(shí),對(duì)于客戶端就是發(fā)出sync請(qǐng)求,收到服務(wù)端的回應(yīng)時(shí),而對(duì)于服務(wù)器端就是發(fā)出syc+ack后,收到客戶端的ack時(shí))的計(jì)算分析如下:
} else {
/* no previous measure. */
/* 以前沒有rtt的數(shù)據(jù),這是收到第一個(gè)rtt的樣本數(shù)據(jù)的代碼邏輯 */
/* m是本次的rtt值,乘以8保存到 srtt中 */
tp->srtt = m << 3; /* take the measured time to be rtt */
/* rtt的初始偏差值mdev是 2倍rtt值 */
tp->mdev = m << 1; /* make sure rto = 3*rtt */
/* 設(shè)置rttvar和rtt偏差的最大值mdev_max這兩者的初始值 */
/* 2倍的rtt值,tcp_rto_min之間,那個(gè)大,就選那個(gè) */
tp->mdev_max = tp->rttvar = max(tp->mdev, tcp_rto_min(sk));
tp->rtt_seq = tp->snd_nxt;
}
再看tcp_rto_min的代碼,在文件include/net/tcp.h中:
static inline u32 tcp_rto_min(struct sock *sk)
{
struct dst_entry *dst = __sk_dst_get(sk);
u32 rto_min = TCP_RTO_MIN; /* 200ms */
if (dst && dst_metric_locked(dst, RTAX_RTO_MIN))
rto_min = dst_metric_rtt(dst, RTAX_RTO_MIN);
return rto_min;
}
結(jié)合起來看,如果第一個(gè)數(shù)據(jù)包往返時(shí)間在100ms以內(nèi),rtt預(yù)測(cè)初始的偏差值就固定為200ms,當(dāng)數(shù)據(jù)包往返時(shí)間超過100ms,rtt預(yù)測(cè)偏差的初始值是2倍的rtt值,也就是說rttvar最小值是200ms。
接著分析計(jì)算和影響srtt和rttvar的函數(shù)是tcp_rtt_estimator的代碼:
if (tp->mdev > tp->mdev_max) {
/* 跟蹤rtt的偏差,記錄偏差最大值mdev_max */
tp->mdev_max = tp->mdev;
if (tp->mdev_max > tp->rttvar) /* 偏差最大值大于 rttvar時(shí),rttvar跟著變大 */
tp->rttvar = tp->mdev_max;
}
if (after(tp->snd_una, tp->rtt_seq)) {
/* 偏差最大值小于 rttvar時(shí),rttvar也會(huì)相應(yīng)減少 */
if (tp->mdev_max < tp->rttvar)
tp->rttvar -= (tp->rttvar - tp->mdev_max) >> 2;
tp->rtt_seq = tp->snd_nxt;
/* 每個(gè)發(fā)送周期結(jié)束,重置mdev_max為tcp_rto_min */
tp->mdev_max = tcp_rto_min(sk);
}
也就是說,rtt預(yù)測(cè)偏差值rttvar會(huì)跟著實(shí)際的rtt預(yù)測(cè)偏差值變化,如果波動(dòng)變大,則跟著變大,反之,如果波動(dòng)變小,也會(huì)跟著變小。但因?yàn)槊總€(gè)發(fā)送周期內(nèi),偏差的最大值會(huì)重置為tcp_rto_min,所以,rtt預(yù)測(cè)偏差值rttvar不會(huì)小于200ms。
那這200ms的限制,有啥簡(jiǎn)單的方法調(diào)整么?繼續(xù)看tcp_rto_min的代碼,前面也貼過,如下:
static inline u32 tcp_rto_min(struct sock *sk)
{
struct dst_entry *dst = __sk_dst_get(sk);
u32 rto_min = TCP_RTO_MIN; /* 200ms */
if (dst && dst_metric_locked(dst, RTAX_RTO_MIN))
rto_min = dst_metric_rtt(dst, RTAX_RTO_MIN);
return rto_min;
}
從上面的代碼可以看出,如果對(duì)應(yīng)的目標(biāo)的路由表項(xiàng)中設(shè)置了rto_min值,則以設(shè)置的值為準(zhǔn)。這可以通過netlink機(jī)制來修改,具體可以通過ip route命令,增加rto_min選項(xiàng)來完成。
分析完源代碼,接著試驗(yàn)一下。
運(yùn)行下面的命令修改成20ms:
sudo ip route add 123.125.104.197/32 via 10.209.83.254 rto_min 20
看以下修改后的結(jié)果:
[xiaohong@localhost ~]$ ip route list
default via 10.209.83.254 dev enp0s3 proto static metric 1024
10.209.80.0/22 dev enp0s3 proto kernel scope link src 10.209.80.111
123.125.104.197 via 10.209.83.254 dev enp0s3 rto_min lock 20ms
清除以下路由表的緩存,這樣可以立即查看效果:
sudo ip tcp_metrics flush
再測(cè)試訪問weibo.com:
[xiaohong@localhost ~]$ nc80
GET /
在另外的終端中確認(rèn)一下結(jié)果:
[xiaohong@localhost iproute2-3.19.0]$ ss -eipn '( dport = :www )'
tcp ESTAB 0 0 10.209.80.111:56487 123.125.104.197:80 users:(("nc",1786,3)) uid:1000 ino:14606 sk:ffff88002c992d00 <->
ts sack cubic wscale:0,7 rto:22 rtt:2/1 mss:1448 cwnd:10 send 57.9Mbps rcv_space:14600
可以看出,本次的rtt值是2ms,rto為22ms,即已經(jīng)生效。
歡迎一起討論,拍磚也可以。呵呵。
更多信息請(qǐng)查看IT技術(shù)專欄