文章总结: 本文使用packetdrill研究TCPSillyWindowSyndrome,重点分析接收端Clark算法机制。实验表明,接收端仅在应用层读取数据后空闲缓存显著增加(大于当前窗口两倍且大于MSS)时才触发窗口更新ACK。若读取量不足导致空闲空间小于MSS,内核将通告零窗口以避免SWS,有效防止传输低效。 综合评分: 93 文章分类: 网络安全,逆向分析,安全工具
Wireshark & Packetdrill | Silly Window Syndrome
原创
7ACE
Echo Reply
2026年1月5日 08:08 江苏
该装糊涂的时候装糊涂
实验目的
基于 packetdrill TCP 三次握手脚本,通过构造模拟服务器端场景,研究测试接收端避免 Silly Window Syndrome 现象。
基础脚本
# cat tcp_sws_000.pkt 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0+0 bind(3, ..., ...) = 0+0 listen(3, 1) = 0
+0 < S 0:0(0) win 10000 <mss 1460>+0 > S. 0:0(0) ack 1 <...>+0.01 < . 1:1(0) ack 1 win 10000+0 accept(3, ..., ...) = 4#
#
Silly Window Syndrome
Silly Window Syndrome 是指通信双方(发送端和接收端)以一种“低效”的方式工作,导致网络传输了大量有效载荷很小(比如只有1字节)的 TCP 报文段。一个生动的比喻:想象一个巨大的工厂(发送端)和一个巨大的仓库(接收端)之间,用一辆每次只能运一个零件(1 字节数据)的卡车( TCP 报文段)来运输。这辆卡车本身很大(有 40 字节的 IP 和 TCP 头部),但运载的货物却极小。这导致了极低的运输效率,大部分燃料和资源都浪费在了卡车本身的运行上,而不是货物上。综合来说,SWS 的根本原因是应用程序频繁地进行小量数据的读写操作,而 TCP 又过于“殷勤”地立即响应这些操作。
接收端通过 Clark 算法避免问题,其核心是不通告小窗口:当回复 ACK 的时候,如果可用的接收缓存过小则直接通告零窗口;当应用层读取数据,只有空闲接收缓存大于一定值的时候,会根据空闲缓存计算出一个新的 Window Size,如果这个新的 Window Size 大于等于两倍的当前接收窗口,才会立即触发窗口更新。发送端则通过 Nagle 算法来避免,其核心是合并小数据:如果之前发送的数据还未被确认,发送端会缓存后续传来的一些小数据,等待它们合并成一个更大的报文或在收到ACK后再发送,以此减少网络中小包的数量。两者协同工作,分别从“抑制触发”和“主动延迟”两个角度,共同确保了 TCP 传输的高效性,避免了网络被大量小数据包充斥的局面。
基础测试
对于发送端的 Nagle 算法,之前在《TCP Nagle 算法》和《TCP Nagle 算法续》两篇文章中介绍,此次不再赘述,仅通过实验研究接收端避免 Silly Window Syndrome 的现象。
首先是基础场景,如下脚本,通过 SO_RCVBUF 设置接收缓存。
# cat tcp_sws_001.pkt 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0+0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [3000], 4) = 0+0 bind(3, ..., ...) = 0+0 listen(3, 1) = 0
+0 < S 0:0(0) win 10000 <mss 1460>+0 > S. 0:0(0) ack 1 <...>+0.01 < . 1:1(0) ack 1 win 10000+0 accept(3, ..., ...) = 4
+0.01 < P. 1:1461(1460) ack 1 win 10000+0 < P. 1461:2921(1460) ack 1 win 10000
+0 `sleep 1`#
通过 tcpdump 捕获数据包如下,可以看到在收到第一个数据段后,服务器端立马响应 ACK 数据包,即 Quick ACK,而在收到第二个数据段时,服务器所发送的 ACK 数据包就变为了 Delayed ACK,间隔 40ms+。
# packetdrill tcp_sws_001.pkt#
# tcpdump -i any -nn port 8080tcpdump: data link type LINUX_SLL2tcpdump: verbose output suppressed, use -v[v]... for full protocol decodelistening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes11:57:58.192829 tun0 In IP 192.0.2.1.56939 > 192.168.63.69.8080: Flags [S], seq 0, win 10000, options [mss 1460], length 011:57:58.192936 tun0 Out IP 192.168.63.69.8080 > 192.0.2.1.56939: Flags [S.], seq 1285251059, ack 1, win 2920, options [mss 1460], length 011:57:58.203116 tun0 In IP 192.0.2.1.56939 > 192.168.63.69.8080: Flags [.], ack 1, win 10000, length 011:57:58.213220 tun0 In IP 192.0.2.1.56939 > 192.168.63.69.8080: Flags [P.], seq 1:1461, ack 1, win 10000, length 1460: HTTP11:57:58.213249 tun0 Out IP 192.168.63.69.8080 > 192.0.2.1.56939: Flags [.], ack 1461, win 1460, length 011:57:58.213262 tun0 In IP 192.0.2.1.56939 > 192.168.63.69.8080: Flags [P.], seq 1461:2921, ack 1, win 10000, length 1460: HTTP11:57:58.256181 tun0 Out IP 192.168.63.69.8080 > 192.0.2.1.56939: Flags [.], ack 2921, win 0, length 011:57:59.216422 ? Out IP 192.168.63.69.8080 > 192.0.2.1.56939: Flags [R.], seq 1, ack 2921, win 2920, length 011:57:59.216449 ? In IP 192.0.2.1.56939 > 192.168.63.69.8080: Flags [R.], seq 2921, ack 1, win 10000, length 0#
以下通过应用层读取数据大小不同区分现象,修改脚本如下,首先 read 1460 字节大小。
# cat tcp_sws_002.pkt 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0+0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [3000], 4) = 0+0 bind(3, ..., ...) = 0+0 listen(3, 1) = 0
+0 < S 0:0(0) win 10000 <mss 1460>+0 > S. 0:0(0) ack 1 <...>+0.01 < . 1:1(0) ack 1 win 10000+0 accept(3, ..., ...) = 4
+0.01 < P. 1:1461(1460) ack 1 win 10000+0 < P. 1461:2921(1460) ack 1 win 10000
+0.01 read(4,...,1460) = 1460
+0 `sleep 1`#
通过 tcpdump 捕获数据包如下,可以看到在收到第一个数据段后,服务器端立马响应 ACK 数据包,即 Quick ACK,而在收到第二个数据段时,服务器所发送的 ACK 数据包就变为了 Delayed ACK,延迟发送,之后在间隔 10ms ,应用层读取了数据,大小 1460,此时判断需要并发送了 ACK 数据包,其中 Win 1460,表示接收窗口更新为 1460 。
# packetdrill tcp_sws_002.pkt#
# tcpdump -i any -nn port 8080tcpdump: data link type LINUX_SLL2tcpdump: verbose output suppressed, use -v[v]... for full protocol decodelistening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes11:53:14.992640 tun0 In IP 192.0.2.1.43459 > 192.168.234.100.8080: Flags [S], seq 0, win 10000, options [mss 1460], length 011:53:14.992715 tun0 Out IP 192.168.234.100.8080 > 192.0.2.1.43459: Flags [S.], seq 3858090574, ack 1, win 2920, options [mss 1460], length 011:53:15.002951 tun0 In IP 192.0.2.1.43459 > 192.168.234.100.8080: Flags [.], ack 1, win 10000, length 011:53:15.013220 tun0 In IP 192.0.2.1.43459 > 192.168.234.100.8080: Flags [P.], seq 1:1461, ack 1, win 10000, length 1460: HTTP11:53:15.013303 tun0 Out IP 192.168.234.100.8080 > 192.0.2.1.43459: Flags [.], ack 1461, win 1460, length 011:53:15.013334 tun0 In IP 192.0.2.1.43459 > 192.168.234.100.8080: Flags [P.], seq 1461:2921, ack 1, win 10000, length 1460: HTTP11:53:15.023573 tun0 Out IP 192.168.234.100.8080 > 192.0.2.1.43459: Flags [.], ack 2921, win 1460, length 011:53:16.052373 ? Out IP 192.168.234.100.8080 > 192.0.2.1.43459: Flags [R.], seq 1, ack 2921, win 2920, length 011:53:16.052406 ? In IP 192.0.2.1.43459 > 192.168.234.100.8080: Flags [R.], seq 2921, ack 1, win 10000, length 0#
修改脚本如下,将 read 由之前的 1460 改为 1459 字节。
# cat tcp_sws_003.pkt 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0+0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [3000], 4) = 0+0 bind(3, ..., ...) = 0+0 listen(3, 1) = 0
+0 < S 0:0(0) win 10000 <mss 1460>+0 > S. 0:0(0) ack 1 <...>+0.01 < . 1:1(0) ack 1 win 10000+0 accept(3, ..., ...) = 4
+0.01 < P. 1:1461(1460) ack 1 win 10000+0 < P. 1461:2921(1460) ack 1 win 10000
+0.01 read(4,...,1459) = 1459
+0 `sleep 1`#
通过 tcpdump 捕获数据包如下,可以看到在收到第一个数据段后,服务器端立马响应 ACK 数据包,即 Quick ACK,而在收到第二个数据段时,服务器所发送的 ACK 数据包就变为了 Delayed ACK,在间隔 10ms 之后,应用层读取了数据,大小 1459,此时判断发送 ACK 的条件不成立,并不触发窗口更新数据包,仍是在间隔 40ms+ 后延迟 ACK 发送,且 Win 0,表示接收窗口为 0,虽然此时应用层读取了 1459 ,释放了 1459 字节大小空间。
# packetdrill tcp_sws_003.pkt#
# tcpdump -i any -nn port 8080tcpdump: data link type LINUX_SLL2tcpdump: verbose output suppressed, use -v[v]... for full protocol decodelistening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes12:01:05.644565 tun0 In IP 192.0.2.1.46221 > 192.168.180.116.8080: Flags [S], seq 0, win 10000, options [mss 1460], length 012:01:05.644642 tun0 Out IP 192.168.180.116.8080 > 192.0.2.1.46221: Flags [S.], seq 1090302568, ack 1, win 2920, options [mss 1460], length 012:01:05.654805 tun0 In IP 192.0.2.1.46221 > 192.168.180.116.8080: Flags [.], ack 1, win 10000, length 012:01:05.664970 tun0 In IP 192.0.2.1.46221 > 192.168.180.116.8080: Flags [P.], seq 1:1461, ack 1, win 10000, length 1460: HTTP12:01:05.664998 tun0 Out IP 192.168.180.116.8080 > 192.0.2.1.46221: Flags [.], ack 1461, win 1460, length 012:01:05.665013 tun0 In IP 192.0.2.1.46221 > 192.168.180.116.8080: Flags [P.], seq 1461:2921, ack 1, win 10000, length 1460: HTTP12:01:05.708229 tun0 Out IP 192.168.180.116.8080 > 192.0.2.1.46221: Flags [.], ack 2921, win 0, length 012:01:06.688384 ? Out IP 192.168.180.116.8080 > 192.0.2.1.46221: Flags [R.], seq 1, ack 2921, win 2920, length 012:01:06.688413 ? In IP 192.0.2.1.46221 > 192.168.180.116.8080: Flags [R.], seq 2921, ack 1, win 10000, length 0#
对于上述应用层读取数据的两个实验场景结果的不同,主要是 tcp_cleanup_rbuf() 和 __tcp_select_window() 函数相关,重点计算 new_window = __tcp_select_window(sk) 的值。
- read 1460 场景
在 __tcp_select_window() 函数中,由于 tp->rx_opt.rcv_wscale 为 0 的缘故,window = tp->rcv_wnd 的值为 1460;
在 tcp_cleanup_rbuf() 函数中,new_window 值为 1460,满足 new_window && new_window >= 2 * rcv_window_now ,此时 time_to_ack = true,调用 tcp_send_ack() 发送 ACK,之后在 ACK 数据包中的 Window 值,也由 tcp_select_window() 确定,即为 1460。
- read 1459 场景
表面上虽然仅差距 1 字节的大小,但在 __tcp_select_window() 函数中,free_space 显著降低,虽然未小于总接收缓存的 1/16 ,但是小于 mss ,因此直接返回 0 值;
在 tcp_cleanup_rbuf() 函数中,new_window 值为 0,不满足 new_window && new_window >= 2 * rcv_window_now ,此时 time_to_ack = false,未调用 tcp_send_ack() ,仍然保持 Delayed ACK,最后在延迟 ACK 时间超时后,发送 ACK,此时在 ACK 数据包中的 Window 值,也由 tcp_select_window() 确定,即为 0。
上述场景为应用层读取数据大小不同造成的不同实验现象,一个是立马触发 ACK 窗口更新,一个是延迟 ACK。以下继续在 Delayed ACK 之后的 read 少量字节的场景,模拟由 TCP ZeroWindow Probe 触发出 ACK 回复,实际上也就是 TCP ZeroWindow Probe ACK。
# cat tcp_sws_004.pkt 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0+0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [3000], 4) = 0+0 bind(3, ..., ...) = 0+0 listen(3, 1) = 0
+0 < S 0:0(0) win 10000 <mss 1460>+0 > S. 0:0(0) ack 1 <...>+0.01 < . 1:1(0) ack 1 win 10000+0 accept(3, ..., ...) = 4
+0.01 < P. 1:1461(1460) ack 1 win 10000+0 < P. 1461:2921(1460) ack 1 win 10000
+0.1 read(4,...,1000) = 1000
+0.01 < . 2920:2920(0) ack 1 win 10000
+0 `sleep 1`#
通过 tcpdump 捕获数据包如下,可以看到在收到第一个数据段后,服务器端立马响应 ACK 数据包,即 Quick ACK,而在收到第二个数据段时,服务器所发送的 ACK 数据包就变为了 Delayed ACK,间隔 40ms+,之后服务器进行了 read 1000 字节,如之前实验,未满足条件并未触发出 ACK 窗口更新。
再之后客户端发送了 TCP ZeroWindow Probe ,触发服务器端响应了 ACK,且 Window 值为 0。此处的过程与 tcp_cleanup_rbuf() 函数无关,是由于收到 TCP ZeroWindow Probe 后触发回复 ACK,而在 ACK 发送过程中,仍然调用了 tcp_select_window() 和 __tcp_select_window() ,这其中仍然是根据 free_space 大小,虽然未小于总接收缓存的 1/16 ,但是小于 mss ,因此返回 0 值。
# packetdrill tcp_sws_004.pkt#
# tcpdump -i any -nn port 8080tcpdump: data link type LINUX_SLL2tcpdump: verbose output suppressed, use -v[v]... for full protocol decodelistening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes16:06:10.232572 tun0 In IP 192.0.2.1.38475 > 192.168.218.189.8080: Flags [S], seq 0, win 10000, options [mss 1460], length 016:06:10.232600 tun0 Out IP 192.168.218.189.8080 > 192.0.2.1.38475: Flags [S.], seq 2494482824, ack 1, win 2920, options [mss 1460], length 016:06:10.242747 tun0 In IP 192.0.2.1.38475 > 192.168.218.189.8080: Flags [.], ack 1, win 10000, length 016:06:10.252826 tun0 In IP 192.0.2.1.38475 > 192.168.218.189.8080: Flags [P.], seq 1:1461, ack 1, win 10000, length 1460: HTTP16:06:10.252850 tun0 Out IP 192.168.218.189.8080 > 192.0.2.1.38475: Flags [.], ack 1461, win 1460, length 016:06:10.252870 tun0 In IP 192.0.2.1.38475 > 192.168.218.189.8080: Flags [P.], seq 1461:2921, ack 1, win 10000, length 1460: HTTP16:06:10.296221 tun0 Out IP 192.168.218.189.8080 > 192.0.2.1.38475: Flags [.], ack 2921, win 0, length 016:06:10.362940 tun0 In IP 192.0.2.1.38475 > 192.168.218.189.8080: Flags [.], ack 1, win 10000, length 016:06:10.362958 tun0 Out IP 192.168.218.189.8080 > 192.0.2.1.38475: Flags [.], ack 2921, win 0, length 016:06:11.367787 ? Out IP 192.168.218.189.8080 > 192.0.2.1.38475: Flags [R.], seq 1, ack 2921, win 2920, length 016:06:11.367817 ? In IP 192.0.2.1.38475 > 192.168.218.189.8080: Flags [R.], seq 2920, ack 1, win 10000, length 0#
代码参考
/* Clean up the receive buffer for full frames taken by the user, * then send an ACK if necessary. COPIED is the number of bytes * tcp_recvmsg has given to the user so far, it speeds up the * calculation of whether or not we must ACK for the sake of * a window update. */void tcp_cleanup_rbuf(struct sock *sk, int copied){ struct tcp_sock *tp = tcp_sk(sk); bool time_to_ack = false;
struct sk_buff *skb = skb_peek(&sk->sk_receive_queue);
WARN(skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq), "cleanup rbuf bug: copied %X seq %X rcvnxt %X\n", tp->copied_seq, TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt);
if (inet_csk_ack_scheduled(sk)) { const struct inet_connection_sock *icsk = inet_csk(sk);
if (/* Once-per-two-segments ACK was not sent by tcp_input.c */ tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss || /* * If this read emptied read buffer, we send ACK, if * connection is not bidirectional, user drained * receive buffer and there was a small segment * in queue. */ (copied > 0 && ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) || ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) && !inet_csk_in_pingpong_mode(sk))) && !atomic_read(&sk->sk_rmem_alloc))) time_to_ack = true; }
/* We send an ACK if we can now advertise a non-zero window * which has been raised "significantly". * * Even if window raised up to infinity, do not send window open ACK * in states, where we will not receive more. It is useless. */ if (copied > 0 && !time_to_ack && !(sk->sk_shutdown & RCV_SHUTDOWN)) { __u32 rcv_window_now = tcp_receive_window(tp);
/* Optimize, __tcp_select_window() is not cheap. */ if (2*rcv_window_now <= tp->window_clamp) { __u32 new_window = __tcp_select_window(sk);
/* Send ACK now, if this read freed lots of space * in our buffer. Certainly, new_window is new window. * We can advertise it now, if it is not less than current one. * "Lots" means "at least twice" here. */ if (new_window && new_window >= 2 * rcv_window_now) time_to_ack = true; } } if (time_to_ack) tcp_send_ack(sk);}
u32 __tcp_select_window(struct sock *sk){ struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); /* MSS for the peer's data. Previous versions used mss_clamp * here. I don't know if the value based on our guesses * of peer's MSS is better for the performance. It's more correct * but may be worse for the performance because of rcv_mss * fluctuations. --SAW 1998/11/1 */ int mss = icsk->icsk_ack.rcv_mss; int free_space = tcp_space(sk); int allowed_space = tcp_full_space(sk); int full_space, window;
if (sk_is_mptcp(sk)) mptcp_space(sk, &free_space, &allowed_space);
full_space = min_t(int, tp->window_clamp, allowed_space);
if (unlikely(mss > full_space)) { mss = full_space; if (mss <= 0) return 0; } if (free_space < (full_space >> 1)) { icsk->icsk_ack.quick = 0;
if (tcp_under_memory_pressure(sk)) tcp_adjust_rcv_ssthresh(sk);
/* free_space might become our new window, make sure we don't * increase it due to wscale. */ free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale);
/* if free space is less than mss estimate, or is below 1/16th * of the maximum allowed, try to move to zero-window, else * tcp_clamp_window() will grow rcv buf up to tcp_rmem[2], and * new incoming data is dropped due to memory limits. * With large window, mss test triggers way too late in order * to announce zero window in time before rmem limit kicks in. */ if (free_space < (allowed_space >> 4) || free_space < mss) return 0; }
if (free_space > tp->rcv_ssthresh) free_space = tp->rcv_ssthresh;
/* Don't do rounding if we are using window scaling, since the * scaled window will not line up with the MSS boundary anyway. */ if (tp->rx_opt.rcv_wscale) { window = free_space;
/* Advertise enough space so that it won't get scaled away. * Import case: prevent zero window announcement if * 1<<rcv_wscale > mss. */ window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale)); } else { window = tp->rcv_wnd; /* Get the largest window that is a nice multiple of mss. * Window clamp already applied above. * If our current window offering is within 1 mss of the * free space we just keep it. This prevents the divide * and multiply from happening most of the time. * We also don't do any window rounding when the free space * is too small. */ if (window <= free_space - mss || window > free_space) window = rounddown(free_space, mss); else if (mss == full_space && free_space > window + (full_space >> 1)) window = free_space; }
return window;}
往期推荐
1. Wireshark 提示和技巧 | 捕获点之 TCP 三次握手
2. Wireshark 提示和技巧 | a == ${a} 显示过滤宏
3. Wireshark TS | 防火墙空闲会话超时问题
4. Wireshark TS | 超时重传时间不翻倍
5. 网络设备 MTU MSS Jumboframe 全解
后台回复「TT」获取 Wireshark 提示和技巧系列 合集
后台回复「TS」获取 Wireshark Troubleshooting 系列 合集
如需交流或加技术群,可后台直接留言,我会在第一时间回复,谢谢!
免责声明:
本文所载程序、技术方法仅面向合法合规的安全研究与教学场景,旨在提升网络安全防护能力,具有明确的技术研究属性。
任何单位或个人未经授权,将本文内容用于攻击、破坏等非法用途的,由此引发的全部法律责任、民事赔偿及连带责任,均由行为人独立承担,本站不承担任何连带责任。
本站内容均为技术交流与知识分享目的发布,若存在版权侵权或其他异议,请通过邮件联系处理,具体联系方式可点击页面上方的联系我。
本文转载自:Echo Reply 7ACE《Wireshark & Packetdrill | Silly Window Syndrome》
版权声明
本站仅做备份收录,仅供研究与教学参考之用。
读者将信息用于其他用途的,全部法律及连带责任由读者自行承担,本站不承担任何责任。










评论