dhcpcd-discuss

Re: >=dhcpcd-7.0.0 makes interface hang on high traffic

Remy Blank

Sun Sep 09 22:26:59 2018

> At least on 4.14 it recovers.

Well, yes, I agree it's better than hanging indefinitely, but it
prevents all downloads that are larger than a few MiB, so it's not
really that much of an improvement :)

> A quick google for this issue shows the e1000e Detected Hardware Unit 
> Hang error being repored since 2012, which pre-dates dhcpcd-7 quite someway.
> Here's an informative one I looked at:
> https://serverfault.com/questions/616485/e1000e-reset-adapter-unexpectedly-detected-hardware-unit-hang
> 
> This lead me to:
> https://serverfault.com/questions/193114/linux-e1000e-intel-networking-driver-problems-galore-where-do-i-start
> 
> Those two threads should have enough information to fix the problem, 
> ranging from using ethtool to disable various features on the card, 
> trying a newly released driver (was tested on linux-4.15 to work by the 
> poster), upgrading the EEPROM from a buggy version or diabling advanced 
> power management on the card by passing in a kernel directive.

Thanks for the pointers. I tried each of the suggested fixes.

 - My card isn't one of those affected by the EEPROM issue.
 - Setting pcie_aspm=off doesn't change anything.
 - I don't have a BIOS option to disable C1E.
 - Updating the e1000e driver to 3.4.2.1-NAPI doesn't change anything
(4.14.65 has 3.2.6-k).

 - ethtool -K int gso off gro off tso off

   This seems to fix the issue, but it comes with a significant
reduction in throughput (~20%). I tried disabling the options
individually, but all combinations other than "all on" or "all off"
result in weird behavior.

 - ethtool -K int tx off rx off

   This also seems to fix the issue, without any noticeable throughput
reduction. Actually, disabling "tx" is enough. Disabling "rx" only
doesn't fix the issue.

>> The issue happens even if I kill dhcpcd after it has set up the network
>> interface, so it must be something in the way it configures the
>> interface, and not the process itself.
> 
> If this is true, it must be a kernel bug.
> Once dhcpcd exits, it closes all BPF sockets which will remove all BPF 
> setup in the kernel which dhcpcd created.

I agree that everything points at the kernel. But I'm still wondering
why this particular dhcpcd commit triggers the issue. If killing the
process clears the BPF setup, why would the issue persist after it exits?

> For reference, could you post ip output of addresses and routes (both 
> inet and inet6) from dhcpd-6 and dhcpcd-7 please? There might be a 
> differnce there - infact I know there is one but I'd still like to see 
> it please.

Here's the IP configuration with dhcpcd-7 (actually, at exactly the
culprit commit, which was before dhcpcd-7), slightly redacted (removed
non-relevant interfaces, obfuscated addresses).

$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ext: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
state UP group default qlen 1000
    link/ether 00:0a:cd:xx:xx:xx brd ff:ff:ff:ff:ff:ff
    inet 178.xxx.xxx.xxx/21 brd 255.255.255.255 scope global ext
       valid_lft forever preferred_lft forever
    inet6 fe80::20a:cdff:fexx:xxxx/64 scope link
       valid_lft forever preferred_lft forever
4: int: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
state UP group default qlen 1000
    link/ether 50:7b:9d:xx:xx:xx brd ff:ff:ff:ff:ff:ff
    inet 172.16.xxx.xxx/24 brd 172.16.xxx.255 scope global int
       valid_lft forever preferred_lft forever
    inet6 2001:470:26:xxx::2/64 scope global
       valid_lft forever preferred_lft forever
    inet6 2001:470:26:xxx::1/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::527b:9dff:fexx:xxxx/64 scope link
       valid_lft forever preferred_lft forever

$ ip -4 route
default via 178.xxx.xxx.1 dev ext src 178.xxx.xxx.xxx metric 10
127.0.0.0/8 via 127.0.0.1 dev lo
172.16.xxx.0/24 dev int proto kernel scope link src 172.16.xxx.xxx
178.xxx.xxx.0/21 dev ext proto kernel scope link src 178.xxx.xxx.xxx
metric 10

$ ip -6 route
2001:470:25:xxx::/64 dev he6 proto kernel metric 256 pref medium
2001:470:26:xxx::/64 dev int proto kernel metric 256 pref medium
2001:470:26:xxx::/64 dev int proto kernel metric 256 pref medium
fe80::/64 dev ext proto kernel metric 256 pref medium
fe80::/64 dev he6 proto kernel metric 256 pref medium
fe80::/64 dev int proto kernel metric 256 pref medium
default via 2001:470:25:xxx::1 dev he6 metric 1007 pref medium

The IP configuration with dhcpcd-6.11.5 is almost identical. The only
difference is that the route for the local IPv6 segment for "ext" is
last instead of first.

$ ip -6 route
2001:470:25:xxx::/64 dev he6 proto kernel metric 256 pref medium
2001:470:26:xxx::/64 dev int proto kernel metric 256 pref medium
2001:470:26:xxx::/64 dev int proto kernel metric 256 pref medium
fe80::/64 dev he6 proto kernel metric 256 pref medium
fe80::/64 dev int proto kernel metric 256 pref medium
fe80::/64 dev ext proto kernel metric 256 pref medium
default via 2001:470:25:xxx::1 dev he6 metric 1007 pref medium

>> I have three other identical machines that have only a single network
>> interface, and those run dhcpcd-7.0.1 without any issues. So this must
>> be related to the routing between interfaces.
> 
> These machines also run e1000e interfaces?

Yes, they do. In fact, I run the speed-test from one of them, located on
the internal network. Their network configuration is different, though:
they use bonding between the Ethernet and WiFi interfaces (to seamlessly
switch between cable and wireless).

> It's possible it could be exposing a kernel bug in Linux BPF, but I do 
> doubt it. BPF only runs for a specific interface. dhcpcd will only setup 
> BPF for the interfaces it opens sockets on.

So that only leaves the e1000e driver, I guess.

> I don't have the hardware to investigate this as it sounds like a 
> hardware specific issue. So I can't do anything other than offer some 
> advice - try disabling parts of the interface as noted in the above two 
> URLs.

I'm now running dhcpcd-7.0.1 (current Gentoo stable) with "tx off", and
things look fine so far.

I'm still puzzled why this specific commit triggers the issue. I took a
closer look at the changes, and they really only seem to be BPF-related.
The only other change I see is opening the PF_PACKET socket with
SOCK_RAW instead of SOCK_DGRAM, but that too should not be relevant once
the process exits. Something modified in this commit must be persisting
in the kernel (and not specifically for the interface that was configured).

Anyway, thanks a lot for your help :)

-- Remy

Attachment: signature.asc
Description: OpenPGP digital signature


References:
>=dhcpcd-7.0.0 makes interface hang on high trafficRemy Blank
Re: >=dhcpcd-7.0.0 makes interface hang on high trafficRoy Marples
Archive administrator: postmaster@marples.name