dhcpcd-discuss

>=dhcpcd-7.0.0 makes interface hang on high traffic

Remy Blank

Sat Sep 08 22:04:57 2018

Hello,

I would like to report an issue that I have been experiencing since
dhcpcd-7.0.0, where one interface of a machine acting as a router hangs
on high traffic. While this may not seem related to dhcpcd, it is fully
reproducible, and have I bisected the issue to a specific dhcpcd commit.


Here are details of my setup:

 - ThinkPad P70, running Gentoo, with 2 network interfaces.

 - The machine's internal network interface, called "int", is an "Intel
Corporation Ethernet Connection (2) I219-LM (rev 31)", driven by the
e1000e driver. It is connected to the internal network.

  - An additional network interface, called "ext", is connected as an
ExpressCard. It's a "Realtek Semiconductor Co., Ltd. RTL8111/8168/8411
PCI Express Gigabit Ethernet Controller (rev 03)", driven by the r8169
driver. It is connected to the internet.

 - The "int" network has a static configuration, and is IPv4+IPv6.

 - The "ext" network runs dhcpcd to get its configuration from the ISP,
and is IPv4 only. An IPv6 SIT tunnel runs over it, though.

 - The machine routes internet traffic to and from machines on the
internal network.

 - The internet connection is 500 Mb/s in, 50 Mb/s out.


Now the symptoms. When I run >=dhcpcd-7.0.0 on the "ext" interface, any
sustained high-bandwidth traffic between "int" and "ext" makes the "int"
interface hang. I can reproduce the issue very easily, by running an
internet speed test (e.g. <http://www.speedtest.net/>) on a machine
connected to "int". When this happens, the kernel logs the following:

e1000e 0000:00:1f.6 int: Detected Hardware Unit Hang:
  TDH                  <25>
  TDT                  <60>
  next_to_use          <60>
  next_to_clean        <24>
buffer_info[next_to_clean]:
  time_stamp           <1002ebe6c>
  next_to_watch        <26>
  jiffies              <1002ec3c0>
  next_to_watch.status <0>
MAC Status             <80083>
PHY Status             <796d>
PHY 1000BASE-T Status  <3800>
PHY Extended Status    <3000>
PCI Status             <10>
e1000e 0000:00:1f.6 int: Reset adapter unexpectedly
e1000e: int NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

The issue is reproducible across a wide range of kernel versions (tested
with 4.4.117, 4.9.90 and 4.14.65). On 4.4 and 4.9 kernels, the interface
would hang indefinitely until I restarted it. On 4.14, it resets and
recovers by itself, but all open connections break, and traffic is
halted until the interface resets.

Now, the strange thing is that "int" is *not* the interface on which
dhcpcd runs, it's the other one.

The issue happens even if I kill dhcpcd after it has set up the network
interface, so it must be something in the way it configures the
interface, and not the process itself.

dhcpcd-6.11.5 does not exhibit these symptoms, and even prolonged
high-bandwidth traffic doesn't cause any issues. I have also tried
dhcpcd-7.0.8, and the issue is still present there.

I have three other identical machines that have only a single network
interface, and those run dhcpcd-7.0.1 without any issues. So this must
be related to the routing between interfaces.


I have bisected the range from 6.11.5 to 7.0.0, and git gave me the
following culprit:

Rename if_*raw functions to bpf_* so it's more descriptive and move
https://roy.marples.name/git/dhcpcd.git/commit/?id=88047988bb0055cbce02e4cece210c8289f4ffa6

This is a large commit making fairly major changes to the BPF code. This
might explain why dhcpcd affects another interface that it isn't
controlling directly. Maybe a bug in the BPF code causes an infinite
loop under certain conditions, making the interface hang. Or the new BPF
code triggers a kernel bug.


Unfortunately, this is the limit of my investigation skills, as I have
zero knowledge of BPF "assembly", so I'm unable to pinpoint the cause
more precisely. But I'm happy to perform more tests, or try out patches
that could fix the issue or provide more information about it.


Thanks for a great piece of software!

-- Remy

Attachment: signature.asc
Description: OpenPGP digital signature


Follow-Ups:
Re: >=dhcpcd-7.0.0 makes interface hang on high trafficRoy Marples
Archive administrator: postmaster@marples.name