Linux tso (tcp segmentation offload) - what it means and how to enable/disable it
Why am I seeing packets larger than MTU in tcpdump ?
You probably have seen at least one a wireshark or tcpdump showing some strange packet sizes, way over the regular legitimate MSS and MTU values ( 1460 and 1500 bytes for Ethernet).
Before you begin, read:
How the Linux TCP output engine works.
Here is an example:
Code:
03:52:23.511915 IP 1.1.1.1.31586 > 2.2.2.2.80: Flags [S], seq 92802589, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 3052623680 ecr 0], length 0
03:52:23.511952 IP 2.2.2.2.80 > 1.1.1.1.31586: Flags [S.], seq 512157486, ack 92802590, win 14480, options [mss 1460,sackOK,TS val 303700 ecr 3052623680,nop,wscale 9], length 0
03:52:23.614824 IP 1.1.1.1.31586 > 2.2.2.2.80: Flags [.], ack 1, win 4117, options [nop,nop,TS val 3052623784 ecr 303700], length 0
03:52:23.615236 IP 1.1.1.1.31586 > 2.2.2.2.80: Flags [P.], seq 1:116, ack 1, win 4117, options [nop,nop,TS val 3052623784 ecr 303700], length 115
03:52:23.615256 IP 2.2.2.2.80 > 1.1.1.1.31586: Flags [.], ack 116, win 29, options [nop,nop,TS val 303725 ecr 3052623784], length 0
03:52:23.615428 IP 2.2.2.2.80 > 1.1.1.1.31586: Flags [.], seq 1:14481, ack 116, win 29, options [nop,nop,TS val 303726 ecr 3052623784], length 14480
03:52:23.720612 IP 1.1.1.1.31586 > 2.2.2.2.80: Flags [.], ack 2897, win 4077, options [nop,nop,TS val 3052623888 ecr 303726], length 0
03:52:23.721956 IP 1.1.1.1.31586 > 2.2.2.2.80: Flags [.], ack 5793, win 4095, options [nop,nop,TS val 3052623889 ecr 303726], length 0
03:52:23.721982 IP 2.2.2.2.80 > 1.1.1.1.31586: Flags [.], seq 14481:23169, ack 116, win 29, options [nop,nop,TS val 303752 ecr 3052623889], length 8688
03:52:23.722413 IP 1.1.1.1.31586 > 2.2.2.2.80: Flags [.], ack 8689, win 4095, options [nop,nop,TS val 3052623890 ecr 303726], length 0
03:52:23.722431 IP 1.1.1.1.31586 > 2.2.2.2.80: Flags [.], ack 11585, win 4095, options [nop,nop,TS val 3052623890 ecr 303726], length 0
The packet at timestamp 03:52:23.615236 is from client with a tcp push flag (tell remote host to push the data up the stack, from kernel to application - perform a context switch). This is most probably the GET request sent by browser size 115 bytes.
Packet at 03:52:23.615256 is an empty ACK - server kernel aknowledges the previous segment sent by client and next packet at timestamp 03:52:23.615428 is a tcp packet from server to client with the actual payload. See the length 14480.
This length is impossible to send over ethernet link with MTU 1500. Here is the thing: tcpdump and wireshark capture packets in kernel infrastructure (bpf filter in Linux), not what it is actually sent by the network card. So what does this mean ?
TCP Segmentation offload - TSO
In order to save kernel cpu load, the Linux/FreeBSD/Windows kernel calculates the receive window of the tcp client, calculates the send window for this connection and then pushes as much data as possible as permitted by these restrictions.
TCP Segmentation offload allows the system to do TCP segmentation in the NIC driver instead of main CPU via kernel.
In this case, client receive window is 4117*2^6 = 263488. Server initial send window is 10 (TCP segments) so the kernel prepares a buffer of less than 10*1460 bytes (they end up using an MSS of 1448). This is sent from Linux kernel to the interface driver for actual segmentation (along with other parameters like info for the nic driver how to segment these big tcp segments).
Check if TSO is disabled/enabled
To confirm it's tcp segmentation offload (kernel isn't performing the tcp segmentation, but the nic driver) "ethtool" can be used:
Code:
root@server:~# ethtool --show-offload eth0 (OR ethtool -k eth0)
Features for eth0:
rx-checksumming: off [fixed]
tx-checksumming: on
tx-checksum-ipv4: off [fixed]
tx-checksum-unneeded: off [fixed]
tx-checksum-ip-generic: on
tx-checksum-ipv6: off [fixed]
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: on
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: on
tx-tcp6-segmentation: on
udp-fragmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on [fixed]
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: on
loopback: off [fixed]
Disable tcp segmentation offload:
Code:
root@server:~# ethtool -K eth0 tso off
root@server:~# ethtool -K eth0 gso off
Check tcp segmentation if disabled:Code:
Features for eth0:
rx-checksumming: off [fixed]
tx-checksumming: on
tx-checksum-ipv4: off [fixed]
tx-checksum-unneeded: off [fixed]
tx-checksum-ip-generic: on
tx-checksum-ipv6: off [fixed]
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: on
tcp-segmentation-offload: off
tx-tcp-segmentation: off
tx-tcp-ecn-segmentation: off
tx-tcp6-segmentation: off
udp-fragmentation-offload: on
generic-segmentation-offload: off
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on [fixed]
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: on
loopback: off [fixed]
Let's confirm with a tcpdump:
Code:
04:28:07.291329 IP 1.1.1.1.18236 > 2.2.2.2.80: Flags [S], seq 2840543612, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 3054767697 ecr 0], length 0
04:28:07.291383 IP 2.2.2.2.80 > 1.1.1.1.18236: Flags [S.], seq 186047052, ack 2840543613, win 14480, options [mss 1460,sackOK,TS val 118394 ecr 3054767697,nop,wscale 9], length 0
04:28:07.397891 IP 1.1.1.1.18236 > 2.2.2.2.80: Flags [.], ack 1, win 4117, options [nop,nop,TS val 3054767805 ecr 118394], length 0
04:28:07.398712 IP 1.1.1.1.18236 > 2.2.2.2.80: Flags [P.], seq 1:116, ack 1, win 4117, options [nop,nop,TS val 3054767805 ecr 118394], length 115
04:28:07.398735 IP 2.2.2.2.80 > 1.1.1.1.18236: Flags [.], ack 116, win 29, options [nop,nop,TS val 118421 ecr 3054767805], length 0
04:28:07.398896 IP 2.2.2.2.80 > 1.1.1.1.18236: Flags [.], seq 1:1449, ack 116, win 29, options [nop,nop,TS val 118421 ecr 3054767805], length 1448
04:28:07.398962 IP 2.2.2.2.80 > 1.1.1.1.18236: Flags [.], seq 1449:2897, ack 116, win 29, options [nop,nop,TS val 118421 ecr 3054767805], length 1448
04:28:07.398968 IP 2.2.2.2.80 > 1.1.1.1.18236: Flags [.], seq 2897:4345, ack 116, win 29, options [nop,nop,TS val 118421 ecr 3054767805], length 1448
04:28:07.398975 IP 2.2.2.2.80 > 1.1.1.1.18236: Flags [.], seq 4345:5793, ack 116, win 29, options [nop,nop,TS val 118421 ecr 3054767805], length 1448
04:28:07.398981 IP 2.2.2.2.80 > 1.1.1.1.18236: Flags [.], seq 5793:7241, ack 116, win 29, options [nop,nop,TS val 118421 ecr 3054767805], length 1448
04:28:07.398988 IP 2.2.2.2.80 > 1.1.1.1.18236: Flags [.], seq 7241:8689, ack 116, win 29, options [nop,nop,TS val 118421 ecr 3054767805], length 1448
04:28:07.398995 IP 2.2.2.2.80 > 1.1.1.1.18236: Flags [.], seq 8689:10137, ack 116, win 29, options [nop,nop,TS val 118421 ecr 3054767805], length 1448
04:28:07.399000 IP 2.2.2.2.80 > 1.1.1.1.18236: Flags [.], seq 10137:11585, ack 116, win 29, options [nop,nop,TS val 118421 ecr 3054767805], length 1448
04:28:07.399006 IP 2.2.2.2.80 > 1.1.1.1.18236: Flags [.], seq 11585:13033, ack 116, win 29, options [nop,nop,TS val 118421 ecr 3054767805], length 1448
04:28:07.399012 IP 2.2.2.2.80 > 1.1.1.1.18236: Flags [.], seq 13033:14481, ack 116, win 29, options [nop,nop,TS val 118421 ecr 3054767805], length 1448
!!!
Linux tcp segmentation offload is not disabled unless generic segmentation offload is disabled also (ethtool -K eth0 gso off) !!!!
Disabling TSO in Debian persistently:
There are two possible ways:
1. Add ethtool commands in /etc/rc.local (straight forward)
2. Using /etc/network/interfaces and add the "pre-up" line below "iface eth0 inet static":
Code:
root@server:~# cat /etc/network/interfaces
# Network interface for debian
# The loopback network interface
auto lo
iface lo inet loopback
# The primary network interface
allow-hotplug eth0
auto eth0
iface eth0 inet static
pre-up /sbin/ethtool -K eth0 tso off
pre-up /sbin/ethtool -K eth0 gso off
...