Re: >=dhcpcd-7.0.0 makes interface hang on high traffic
Remy Blank
Sun Sep 09 22:26:59 2018> At least on 4.14 it recovers. Well, yes, I agree it's better than hanging indefinitely, but it prevents all downloads that are larger than a few MiB, so it's not really that much of an improvement :) > A quick google for this issue shows the e1000e Detected Hardware Unit > Hang error being repored since 2012, which pre-dates dhcpcd-7 quite someway. > Here's an informative one I looked at: > https://serverfault.com/questions/616485/e1000e-reset-adapter-unexpectedly-detected-hardware-unit-hang > > This lead me to: > https://serverfault.com/questions/193114/linux-e1000e-intel-networking-driver-problems-galore-where-do-i-start > > Those two threads should have enough information to fix the problem, > ranging from using ethtool to disable various features on the card, > trying a newly released driver (was tested on linux-4.15 to work by the > poster), upgrading the EEPROM from a buggy version or diabling advanced > power management on the card by passing in a kernel directive. Thanks for the pointers. I tried each of the suggested fixes. - My card isn't one of those affected by the EEPROM issue. - Setting pcie_aspm=off doesn't change anything. - I don't have a BIOS option to disable C1E. - Updating the e1000e driver to 3.4.2.1-NAPI doesn't change anything (4.14.65 has 3.2.6-k). - ethtool -K int gso off gro off tso off This seems to fix the issue, but it comes with a significant reduction in throughput (~20%). I tried disabling the options individually, but all combinations other than "all on" or "all off" result in weird behavior. - ethtool -K int tx off rx off This also seems to fix the issue, without any noticeable throughput reduction. Actually, disabling "tx" is enough. Disabling "rx" only doesn't fix the issue. >> The issue happens even if I kill dhcpcd after it has set up the network >> interface, so it must be something in the way it configures the >> interface, and not the process itself. > > If this is true, it must be a kernel bug. > Once dhcpcd exits, it closes all BPF sockets which will remove all BPF > setup in the kernel which dhcpcd created. I agree that everything points at the kernel. But I'm still wondering why this particular dhcpcd commit triggers the issue. If killing the process clears the BPF setup, why would the issue persist after it exits? > For reference, could you post ip output of addresses and routes (both > inet and inet6) from dhcpd-6 and dhcpcd-7 please? There might be a > differnce there - infact I know there is one but I'd still like to see > it please. Here's the IP configuration with dhcpcd-7 (actually, at exactly the culprit commit, which was before dhcpcd-7), slightly redacted (removed non-relevant interfaces, obfuscated addresses). $ ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 brd 127.255.255.255 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ext: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 00:0a:cd:xx:xx:xx brd ff:ff:ff:ff:ff:ff inet 178.xxx.xxx.xxx/21 brd 255.255.255.255 scope global ext valid_lft forever preferred_lft forever inet6 fe80::20a:cdff:fexx:xxxx/64 scope link valid_lft forever preferred_lft forever 4: int: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 50:7b:9d:xx:xx:xx brd ff:ff:ff:ff:ff:ff inet 172.16.xxx.xxx/24 brd 172.16.xxx.255 scope global int valid_lft forever preferred_lft forever inet6 2001:470:26:xxx::2/64 scope global valid_lft forever preferred_lft forever inet6 2001:470:26:xxx::1/64 scope global valid_lft forever preferred_lft forever inet6 fe80::527b:9dff:fexx:xxxx/64 scope link valid_lft forever preferred_lft forever $ ip -4 route default via 178.xxx.xxx.1 dev ext src 178.xxx.xxx.xxx metric 10 127.0.0.0/8 via 127.0.0.1 dev lo 172.16.xxx.0/24 dev int proto kernel scope link src 172.16.xxx.xxx 178.xxx.xxx.0/21 dev ext proto kernel scope link src 178.xxx.xxx.xxx metric 10 $ ip -6 route 2001:470:25:xxx::/64 dev he6 proto kernel metric 256 pref medium 2001:470:26:xxx::/64 dev int proto kernel metric 256 pref medium 2001:470:26:xxx::/64 dev int proto kernel metric 256 pref medium fe80::/64 dev ext proto kernel metric 256 pref medium fe80::/64 dev he6 proto kernel metric 256 pref medium fe80::/64 dev int proto kernel metric 256 pref medium default via 2001:470:25:xxx::1 dev he6 metric 1007 pref medium The IP configuration with dhcpcd-6.11.5 is almost identical. The only difference is that the route for the local IPv6 segment for "ext" is last instead of first. $ ip -6 route 2001:470:25:xxx::/64 dev he6 proto kernel metric 256 pref medium 2001:470:26:xxx::/64 dev int proto kernel metric 256 pref medium 2001:470:26:xxx::/64 dev int proto kernel metric 256 pref medium fe80::/64 dev he6 proto kernel metric 256 pref medium fe80::/64 dev int proto kernel metric 256 pref medium fe80::/64 dev ext proto kernel metric 256 pref medium default via 2001:470:25:xxx::1 dev he6 metric 1007 pref medium >> I have three other identical machines that have only a single network >> interface, and those run dhcpcd-7.0.1 without any issues. So this must >> be related to the routing between interfaces. > > These machines also run e1000e interfaces? Yes, they do. In fact, I run the speed-test from one of them, located on the internal network. Their network configuration is different, though: they use bonding between the Ethernet and WiFi interfaces (to seamlessly switch between cable and wireless). > It's possible it could be exposing a kernel bug in Linux BPF, but I do > doubt it. BPF only runs for a specific interface. dhcpcd will only setup > BPF for the interfaces it opens sockets on. So that only leaves the e1000e driver, I guess. > I don't have the hardware to investigate this as it sounds like a > hardware specific issue. So I can't do anything other than offer some > advice - try disabling parts of the interface as noted in the above two > URLs. I'm now running dhcpcd-7.0.1 (current Gentoo stable) with "tx off", and things look fine so far. I'm still puzzled why this specific commit triggers the issue. I took a closer look at the changes, and they really only seem to be BPF-related. The only other change I see is opening the PF_PACKET socket with SOCK_RAW instead of SOCK_DGRAM, but that too should not be relevant once the process exits. Something modified in this commit must be persisting in the kernel (and not specifically for the interface that was configured). Anyway, thanks a lot for your help :) -- Remy
Attachment:
signature.asc
Description: OpenPGP digital signature
| >=dhcpcd-7.0.0 makes interface hang on high traffic | Remy Blank |
| Re: >=dhcpcd-7.0.0 makes interface hang on high traffic | Roy Marples |