Re: >=dhcpcd-7.0.0 makes interface hang on high traffic
Roy Marples
Sun Sep 09 09:28:48 2018
Hi Remy
Let me start off by saying thankyou for a very detailed explanation of
the issue.
On 08/09/2018 22:55, Remy Blank wrote:
I would like to report an issue that I have been experiencing since
dhcpcd-7.0.0, where one interface of a machine acting as a router hangs
on high traffic. While this may not seem related to dhcpcd, it is fully
reproducible, and have I bisected the issue to a specific dhcpcd commit.
Here are details of my setup:
- ThinkPad P70, running Gentoo, with 2 network interfaces.
- The machine's internal network interface, called "int", is an "Intel
Corporation Ethernet Connection (2) I219-LM (rev 31)", driven by the
e1000e driver. It is connected to the internal network.
This is an important detail.
On Linux, the e1000 driver resets the PHY when interface MTU changes.
This is a bug in the driver/hardware that was reported years ago when
dhcpcd changed the interface MTU.
One of the goals of dhcpcd-7 was to not set the interface MTU at all -
rather the MTU of each route it creates, which has the same effect for
our purposes. The bonus here is that the e1000 no longer resets and we
can use the MTU.
- An additional network interface, called "ext", is connected as an
ExpressCard. It's a "Realtek Semiconductor Co., Ltd. RTL8111/8168/8411
PCI Express Gigabit Ethernet Controller (rev 03)", driven by the r8169
driver. It is connected to the internet.
- The "int" network has a static configuration, and is IPv4+IPv6.
- The "ext" network runs dhcpcd to get its configuration from the ISP,
and is IPv4 only. An IPv6 SIT tunnel runs over it, though.
- The machine routes internet traffic to and from machines on the
internal network.
- The internet connection is 500 Mb/s in, 50 Mb/s out.
Now the symptoms. When I run >=dhcpcd-7.0.0 on the "ext" interface, any
sustained high-bandwidth traffic between "int" and "ext" makes the "int"
interface hang. I can reproduce the issue very easily, by running an
internet speed test (e.g. <http://www.speedtest.net/>) on a machine
connected to "int". When this happens, the kernel logs the following:
e1000e 0000:00:1f.6 int: Detected Hardware Unit Hang:
TDH <25>
TDT <60>
next_to_use <60>
next_to_clean <24>
buffer_info[next_to_clean]:
time_stamp <1002ebe6c>
next_to_watch <26>
jiffies <1002ec3c0>
next_to_watch.status <0>
MAC Status <80083>
PHY Status <796d>
PHY 1000BASE-T Status <3800>
PHY Extended Status <3000>
PCI Status <10>
e1000e 0000:00:1f.6 int: Reset adapter unexpectedly
e1000e: int NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
The issue is reproducible across a wide range of kernel versions (tested
with 4.4.117, 4.9.90 and 4.14.65). On 4.4 and 4.9 kernels, the interface
would hang indefinitely until I restarted it. On 4.14, it resets and
recovers by itself, but all open connections break, and traffic is
halted until the interface resets.
At least on 4.14 it recovers.
A quick google for this issue shows the e1000e Detected Hardware Unit
Hang error being repored since 2012, which pre-dates dhcpcd-7 quite someway.
Here's an informative one I looked at:
https://serverfault.com/questions/616485/e1000e-reset-adapter-unexpectedly-detected-hardware-unit-hang
This lead me to:
https://serverfault.com/questions/193114/linux-e1000e-intel-networking-driver-problems-galore-where-do-i-start
Those two threads should have enough information to fix the problem,
ranging from using ethtool to disable various features on the card,
trying a newly released driver (was tested on linux-4.15 to work by the
poster), upgrading the EEPROM from a buggy version or diabling advanced
power management on the card by passing in a kernel directive.
Now, the strange thing is that "int" is *not* the interface on which
dhcpcd runs, it's the other one.
The issue happens even if I kill dhcpcd after it has set up the network
interface, so it must be something in the way it configures the
interface, and not the process itself.
If this is true, it must be a kernel bug.
Once dhcpcd exits, it closes all BPF sockets which will remove all BPF
setup in the kernel which dhcpcd created.
For reference, could you post ip output of addresses and routes (both
inet and inet6) from dhcpd-6 and dhcpcd-7 please? There might be a
differnce there - infact I know there is one but I'd still like to see
it please.
dhcpcd-6.11.5 does not exhibit these symptoms, and even prolonged
high-bandwidth traffic doesn't cause any issues. I have also tried
dhcpcd-7.0.8, and the issue is still present there.
dhcpcd-7 doess setup both the interface, routing and addressing
differently from dhcpd-6.
dhcpcd-6 has a fairly big hammer approach to setting up routing -
dhcpcd-7 does not, but this relies on a modern linux (4.13 or newer
IIRC) where we disable the kernel generating prefix/subnet routes per
address (older linux could do this for IPv6, 4.13 iirc allowed it for
IPv4 as well).
I have three other identical machines that have only a single network
interface, and those run dhcpcd-7.0.1 without any issues. So this must
be related to the routing between interfaces.
These machines also run e1000e interfaces?
I have bisected the range from 6.11.5 to 7.0.0, and git gave me the
following culprit:
Rename if_*raw functions to bpf_* so it's more descriptive and move
https://roy.marples.name/git/dhcpcd.git/commit/?id=88047988bb0055cbce02e4cece210c8289f4ffa6
This is a large commit making fairly major changes to the BPF code. This
might explain why dhcpcd affects another interface that it isn't
controlling directly. Maybe a bug in the BPF code causes an infinite
loop under certain conditions, making the interface hang. Or the new BPF
code triggers a kernel bug.
It's possible it could be exposing a kernel bug in Linux BPF, but I do
doubt it. BPF only runs for a specific interface. dhcpcd will only setup
BPF for the interfaces it opens sockets on.
Unfortunately, this is the limit of my investigation skills, as I have
zero knowledge of BPF "assembly", so I'm unable to pinpoint the cause
more precisely. But I'm happy to perform more tests, or try out patches
that could fix the issue or provide more information about it.
I don't have the hardware to investigate this as it sounds like a
hardware specific issue. So I can't do anything other than offer some
advice - try disabling parts of the interface as noted in the above two
URLs.
Let me know how it works out for you.
Thanks for a great piece of software!
Thanks!
Roy
Archive administrator: postmaster@marples.name