dhcpcd-discuss

Re: problem with "nobackground" mode on NetBSD current?

Rob Newberry

Fri Sep 11 01:20:23 2020

I've debugged this down to something unexpected happening in pidfile_lock, which gets called around line 2345 of dhcpcd.c.  

But the unexpected part is happening inside the pidfile_lock code in NetBSD's libutil.

It LOOKS to me like pidfile_lock is being called early on (around line 2250).

But by the time we call it a second time, we've forked and (maybe?) switched users.

And this time, when we call pidfile_lock, it goes through a different code path.  Looking at libutil's pidfile.c sources, it is finding that pidfile_fd is valid (i.e., it's not -1), and then trying to truncate it.  

But this time (after debugging in the kernel), the pidfile_fd file descriptor's "write" permission is gone.  And so the call to "ftruncate" fails (in the kernel check for "fp->f_flag & FWRITE").  And that bubbles up to cause the failure.

To "fix" the issue, I added a call to "pidfile_clean" shortly after the error handling in the first call to pidfile_lock.  With that change, the second call to pidfile_lock no longer goes through the "it's already open" path, and since the pidfile isn't open, it opens and locks it and everything works fine.

But I don't know if that's the RIGHT fix.  There's now a window of time where the original process (which checked and locked the pidfile) and the child (actually, grand-child I believe) process (re-)locks it.  I don't know if that's bad or not.

Rob




> On Sep 10, 2020, at 3:37 PM, Rob Newberry <robthedude@xxxxxxx> wrote:
> 
> Hi.
> 
> I'm building my own "embedded" NetBSD system images (running on a Raspberry Pi), and running dhcpcd under the control of my own network management process.
> 
> That network management process does the following:
> 
> 	- runs (and allows to exit) "ifconfig" to enable bwfm0 (I haven't debugged why wpa_supplicant doesn't do this on its own, but if I don't do this, then I can't even scan)
> 
> 		/sbin/ifconfig bwfm0 up
> 
> 	- writes a simple /etc/wpa_supplicant.conf file
> 
> 	- runs wpa_supplicant (which doesn't fork by default, and I capture it's output into a file)
> 
> 		/sbin/wpa_supplicant -i bwfm0 -c /etc/wpa_supplicant.conf -d -d
> 
> 	- runs dhcpcd in "nobackground" mode (I also capture it's output to a file)
> 
> 		/sbin/dhcpcd -d -B bwfm0
> 
> When it runs, dhcpcd dies with a few error messages:
> 
> 	Sep 10 04:26:08  dhcpcd[824]: main: pidfile_lock -1: Invalid argument
> 	Sep 10 04:26:08  dhcpcd[824]: ps_sendpsmmsg: Bad file descriptor
> 	Sep 10 04:26:08  dhcpcd[824]: main: control_stop: Bad file descriptor
> 	Sep 10 04:26:08  dhcpcd[824]: ps_sendpsmmsg: Bad file descriptor
> 
> Now, first off, this didn't happen "before" -- and by "before" I mean that I was working on this a few months ago, not long after NetBSD 9.0 shipped.  Certainly before the new dhcpcd was integrated -- but I'm sure there are plenty of other changes in NetBSD-current right now.
> 
> I've debugged a little bit, checking for all the obvious stuff -- "/var/run/dhcpcd" is present:
> 
> 	# ls -l /var/run/dhcpcd
> 	drwxrwxrwx  3 root  wheel  512 Sep 10 04:27 /var/run/dhcpcd
> 
> Prior to starting up, the directory is empty.  But at the time of death, the "pid" file that it's complaining about is there:
> 
> 	# ls -l /var/run/dhcpcd/
> 	total 1
> 	-rw-r--r--  1 root  wheel  4 Sep 10 04:26 bwfm0.pid
> 
> 	# cat /var/run/dhcpcd/bwfm0.pid 
> 	821
> 
> 	# ps ax | grep dhcpcd
> 	821 tty00 Z    0:00.00 (dhcpcd)
> 
> And the "db" directory is there as well (but empty):
> 
> 	# ls -ld /var/db/dhcpcd
> 	drwxrwxrwx  2 root  wheel  512 Sep 10 04:25 /var/db/dhcpcd
> 
> What's REALLY interesting to me, though, is that if, in this current state, I run dhcpcd AGAIN, directly from the command line, with the same arguments, it works just fine:
> 
> 	# /sbin/dhcpcd -d -B bwfm0
> 	dhcpcd-9.2.0 starting
> 	spawned privileged actioneer on PID 938
> 	spawned network proxy on PID 1133
> 	spawned controller proxy on PID 934
> 	chrooting to `/var/chroot/dhcpcd' as _dhcpcd
> 	bwfm0: executing `/libexec/dhcpcd-run-hooks' PREINIT
> 	spawned listener fe80::ba27:ebff:fedb:539 on PID 986
> 	bwfm0: connected to Access Point `dudenet'
> 	bwfm0: executing `/libexec/dhcpcd-run-hooks' CARRIER
> 	DUID 00:01:00:01:26:ec:69:6f:b8:27:eb:db:05:39
> 	bwfm0: IAID eb:db:05:39
> 	bwfm0: delaying IPv6 router solicitation for 0.3 seconds
> 	bwfm0: delaying IPv4 for 0.9 seconds
> 	bwfm0: soliciting an IPv6 router
> 	bwfm0: sending Router Solicitation
> 	bwfm0: reading lease `/var/db/dhcpcd/bwfm0-dudenet.lease'
> 	bwfm0: soliciting a DHCP lease
> 	bwfm0: sending DISCOVER (xid 0xd86ffd41), next in 4.4 seconds
> 	bwfm0: offered 10.0.1.141 from 10.0.1.1
> 	bwfm0: sending REQUEST (xid 0xd86ffd41), next in 3.7 seconds
> 	bwfm0: acknowledged 10.0.1.141 from 10.0.1.1
> 	bwfm0: adding IP address 10.0.1.141/24 broadcast 10.0.1.255
> 	bwfm0: sending Router Solicitation
> 	bwfm0: DAD completed for 10.0.1.141
> 	bwfm0: leased 10.0.1.141 for 86400 seconds
> 	bwfm0: renew in 43200 seconds, rebind in 75600 seconds
> 	bwfm0: writing lease `/var/db/dhcpcd/bwfm0-dudenet.lease'
> 	bwfm0: IP address 10.0.1.141/24 already exists
> 	bwfm0: adding route to 10.0.1.0/24
> 	bwfm0: adding default route via 10.0.1.1
> 	bwfm0: ARP announcing 10.0.1.141 (1 of 2), next in 2.0 seconds
> 	bwfm0: executing `/libexec/dhcpcd-run-hooks' BOUND
> 	spawned listener 10.0.1.141 on PID 954
> 	bwfm0: sending Router Solicitation
> 	bwfm0: ARP announcing 10.0.1.141 (2 of 2)
> 	bwfm0: sending Router Solicitation
> 	bwfm0: no IPv6 Routers available
> 	^Z[2] + Suspended               /sbin/dhcpcd -d -B bwfm0
> 	# bg
> 	[2] /sbin/dhcpcd -d -B bwfm0
> 
> 
> 	# ps ax | grep dhcpc
> 	 934 ?     I    0:00.01 dhcpcd: [control proxy] bwfm0 [ip4] [ip6] 
> 	 936 ?     I    0:00.03 dhcpcd: bwfm0 [ip4] [ip6] 
> 	 938 ?     I    0:00.02 dhcpcd: [privileged actioneer] bwfm0 [ip4] [ip6] 
> 	 954 ?     I    0:00.01 dhcpcd: [network proxy] 10.0.1.141 
> 	 986 ?     I    0:00.01 dhcpcd: [network proxy] fe80::ba27:ebff:fedb:539 
> 	1133 ?     I    0:00.00 dhcpcd: [network proxy] bwfm0 [ip4] [ip6] 
> 	 821 tty00 Z    0:00.00 (dhcpcd)
> 	 933 tty00 S    0:00.02 dhcpcd: [launcher] 
> 
> 
> I'm currently trying to debug what's going on, basically adding some logging around pid_file lock (and I might even need to add some debugging inside libutil and pid_file lock).
> 
> But before I dive in, I'm wondering if anyone here has any easier things to try.  I'm not all that familiar with this code -- don't mind getting my hands dirty, but if there's any "did you check this?" or "it's probably this" ideas, I'm glad have them.
> 
> Thanks a ton for any guidance.
> 
> Rob
> 
> 


Follow-Ups:
Re: problem with "nobackground" mode on NetBSD current?Roy Marples
References:
problem with "nobackground" mode on NetBSD current?Rob Newberry
Archive administrator: postmaster@marples.name