Re: problem with "nobackground" mode on NetBSD current?
Rob Newberry
Fri Sep 11 01:20:23 2020
I've debugged this down to something unexpected happening in pidfile_lock, which gets called around line 2345 of dhcpcd.c.
But the unexpected part is happening inside the pidfile_lock code in NetBSD's libutil.
It LOOKS to me like pidfile_lock is being called early on (around line 2250).
But by the time we call it a second time, we've forked and (maybe?) switched users.
And this time, when we call pidfile_lock, it goes through a different code path. Looking at libutil's pidfile.c sources, it is finding that pidfile_fd is valid (i.e., it's not -1), and then trying to truncate it.
But this time (after debugging in the kernel), the pidfile_fd file descriptor's "write" permission is gone. And so the call to "ftruncate" fails (in the kernel check for "fp->f_flag & FWRITE"). And that bubbles up to cause the failure.
To "fix" the issue, I added a call to "pidfile_clean" shortly after the error handling in the first call to pidfile_lock. With that change, the second call to pidfile_lock no longer goes through the "it's already open" path, and since the pidfile isn't open, it opens and locks it and everything works fine.
But I don't know if that's the RIGHT fix. There's now a window of time where the original process (which checked and locked the pidfile) and the child (actually, grand-child I believe) process (re-)locks it. I don't know if that's bad or not.
Rob
> On Sep 10, 2020, at 3:37 PM, Rob Newberry <robthedude@xxxxxxx> wrote:
>
> Hi.
>
> I'm building my own "embedded" NetBSD system images (running on a Raspberry Pi), and running dhcpcd under the control of my own network management process.
>
> That network management process does the following:
>
> - runs (and allows to exit) "ifconfig" to enable bwfm0 (I haven't debugged why wpa_supplicant doesn't do this on its own, but if I don't do this, then I can't even scan)
>
> /sbin/ifconfig bwfm0 up
>
> - writes a simple /etc/wpa_supplicant.conf file
>
> - runs wpa_supplicant (which doesn't fork by default, and I capture it's output into a file)
>
> /sbin/wpa_supplicant -i bwfm0 -c /etc/wpa_supplicant.conf -d -d
>
> - runs dhcpcd in "nobackground" mode (I also capture it's output to a file)
>
> /sbin/dhcpcd -d -B bwfm0
>
> When it runs, dhcpcd dies with a few error messages:
>
> Sep 10 04:26:08 dhcpcd[824]: main: pidfile_lock -1: Invalid argument
> Sep 10 04:26:08 dhcpcd[824]: ps_sendpsmmsg: Bad file descriptor
> Sep 10 04:26:08 dhcpcd[824]: main: control_stop: Bad file descriptor
> Sep 10 04:26:08 dhcpcd[824]: ps_sendpsmmsg: Bad file descriptor
>
> Now, first off, this didn't happen "before" -- and by "before" I mean that I was working on this a few months ago, not long after NetBSD 9.0 shipped. Certainly before the new dhcpcd was integrated -- but I'm sure there are plenty of other changes in NetBSD-current right now.
>
> I've debugged a little bit, checking for all the obvious stuff -- "/var/run/dhcpcd" is present:
>
> # ls -l /var/run/dhcpcd
> drwxrwxrwx 3 root wheel 512 Sep 10 04:27 /var/run/dhcpcd
>
> Prior to starting up, the directory is empty. But at the time of death, the "pid" file that it's complaining about is there:
>
> # ls -l /var/run/dhcpcd/
> total 1
> -rw-r--r-- 1 root wheel 4 Sep 10 04:26 bwfm0.pid
>
> # cat /var/run/dhcpcd/bwfm0.pid
> 821
>
> # ps ax | grep dhcpcd
> 821 tty00 Z 0:00.00 (dhcpcd)
>
> And the "db" directory is there as well (but empty):
>
> # ls -ld /var/db/dhcpcd
> drwxrwxrwx 2 root wheel 512 Sep 10 04:25 /var/db/dhcpcd
>
> What's REALLY interesting to me, though, is that if, in this current state, I run dhcpcd AGAIN, directly from the command line, with the same arguments, it works just fine:
>
> # /sbin/dhcpcd -d -B bwfm0
> dhcpcd-9.2.0 starting
> spawned privileged actioneer on PID 938
> spawned network proxy on PID 1133
> spawned controller proxy on PID 934
> chrooting to `/var/chroot/dhcpcd' as _dhcpcd
> bwfm0: executing `/libexec/dhcpcd-run-hooks' PREINIT
> spawned listener fe80::ba27:ebff:fedb:539 on PID 986
> bwfm0: connected to Access Point `dudenet'
> bwfm0: executing `/libexec/dhcpcd-run-hooks' CARRIER
> DUID 00:01:00:01:26:ec:69:6f:b8:27:eb:db:05:39
> bwfm0: IAID eb:db:05:39
> bwfm0: delaying IPv6 router solicitation for 0.3 seconds
> bwfm0: delaying IPv4 for 0.9 seconds
> bwfm0: soliciting an IPv6 router
> bwfm0: sending Router Solicitation
> bwfm0: reading lease `/var/db/dhcpcd/bwfm0-dudenet.lease'
> bwfm0: soliciting a DHCP lease
> bwfm0: sending DISCOVER (xid 0xd86ffd41), next in 4.4 seconds
> bwfm0: offered 10.0.1.141 from 10.0.1.1
> bwfm0: sending REQUEST (xid 0xd86ffd41), next in 3.7 seconds
> bwfm0: acknowledged 10.0.1.141 from 10.0.1.1
> bwfm0: adding IP address 10.0.1.141/24 broadcast 10.0.1.255
> bwfm0: sending Router Solicitation
> bwfm0: DAD completed for 10.0.1.141
> bwfm0: leased 10.0.1.141 for 86400 seconds
> bwfm0: renew in 43200 seconds, rebind in 75600 seconds
> bwfm0: writing lease `/var/db/dhcpcd/bwfm0-dudenet.lease'
> bwfm0: IP address 10.0.1.141/24 already exists
> bwfm0: adding route to 10.0.1.0/24
> bwfm0: adding default route via 10.0.1.1
> bwfm0: ARP announcing 10.0.1.141 (1 of 2), next in 2.0 seconds
> bwfm0: executing `/libexec/dhcpcd-run-hooks' BOUND
> spawned listener 10.0.1.141 on PID 954
> bwfm0: sending Router Solicitation
> bwfm0: ARP announcing 10.0.1.141 (2 of 2)
> bwfm0: sending Router Solicitation
> bwfm0: no IPv6 Routers available
> ^Z[2] + Suspended /sbin/dhcpcd -d -B bwfm0
> # bg
> [2] /sbin/dhcpcd -d -B bwfm0
>
>
> # ps ax | grep dhcpc
> 934 ? I 0:00.01 dhcpcd: [control proxy] bwfm0 [ip4] [ip6]
> 936 ? I 0:00.03 dhcpcd: bwfm0 [ip4] [ip6]
> 938 ? I 0:00.02 dhcpcd: [privileged actioneer] bwfm0 [ip4] [ip6]
> 954 ? I 0:00.01 dhcpcd: [network proxy] 10.0.1.141
> 986 ? I 0:00.01 dhcpcd: [network proxy] fe80::ba27:ebff:fedb:539
> 1133 ? I 0:00.00 dhcpcd: [network proxy] bwfm0 [ip4] [ip6]
> 821 tty00 Z 0:00.00 (dhcpcd)
> 933 tty00 S 0:00.02 dhcpcd: [launcher]
>
>
> I'm currently trying to debug what's going on, basically adding some logging around pid_file lock (and I might even need to add some debugging inside libutil and pid_file lock).
>
> But before I dive in, I'm wondering if anyone here has any easier things to try. I'm not all that familiar with this code -- don't mind getting my hands dirty, but if there's any "did you check this?" or "it's probably this" ideas, I'm glad have them.
>
> Thanks a ton for any guidance.
>
> Rob
>
>
Archive administrator: postmaster@marples.name