dhcpcd-discuss

Re: problem with "nobackground" mode on NetBSD current?

Rob Newberry

Fri Sep 11 15:05:01 2020

I'll answer a couple questions here, but since you also responded to my second message (after I'd done some debugging), I'll save more for there.

> On 10/09/2020 23:37, Rob Newberry wrote:
>> I'm building my own "embedded" NetBSD system images (running on a Raspberry Pi), and running dhcpcd under the control of my own network management process.
>> That network management process does the following:
>> 	- runs (and allows to exit) "ifconfig" to enable bwfm0 (I haven't debugged why wpa_supplicant doesn't do this on its own, but if I don't do this, then I can't even scan)
>> 		/sbin/ifconfig bwfm0 up
> 
> There are reasons why wpa_supplicant does not do this, but that's fine. It should still work because by default dhcpcd *will* do this and wpa_supplicant will then start scanning.

Yes, I remember now.  I had this in my code before I got dhcpcd working last year, and then took it out -- but when I ran into this new issue, dhcpcd died before getting very far, so I had to put it back in.  I'll remove it once I get dhcpcd working.

(Would appreciate more insight on why wpa_supplicant does not do this, but it's fine offline.)

> 
>> 	- writes a simple /etc/wpa_supplicant.conf file
>> 	- runs wpa_supplicant (which doesn't fork by default, and I capture it's output into a file)
>> 		/sbin/wpa_supplicant -i bwfm0 -c /etc/wpa_supplicant.conf -d -d
>> 	- runs dhcpcd in "nobackground" mode (I also capture it's output to a file)
>> 		/sbin/dhcpcd -d -B bwfm0
> 
> 
> Why? Both wpa_supplicant and dhcpcd have options to redirect output to files and should work fine when backgrounded.
> I can see that you're using dhcpcd-9.2.0 so you already have the stdio fixes.
> Anyway....

Because I'm running my own "custom" embedded system, and I want my management process to know when either one of them dies.

>> When it runs, dhcpcd dies with a few error messages:
>> 	Sep 10 04:26:08  dhcpcd[824]: main: pidfile_lock -1: Invalid argument
>> 	Sep 10 04:26:08  dhcpcd[824]: ps_sendpsmmsg: Bad file descriptor
>> 	Sep 10 04:26:08  dhcpcd[824]: main: control_stop: Bad file descriptor
>> 	Sep 10 04:26:08  dhcpcd[824]: ps_sendpsmmsg: Bad file descriptor
>> Now, first off, this didn't happen "before" -- and by "before" I mean that I was working on this a few months ago, not long after NetBSD 9.0 shipped.  Certainly before the new dhcpcd was integrated -- but I'm sure there are plenty of other changes in NetBSD-current right now.
> 
> dhcpcd with priviledge separation will be new for NetBSD-10.
> I've comitted a fix here which tidies the errors up a litte, so you should just get the pidfile_lock error now.
> https://roy.marples.name/cgit/dhcpcd.git/commit/?id=f4b5b1b9497015888c9dc69dc6428fe71f983ab1
> 
>> I've debugged a little bit, checking for all the obvious stuff -- "/var/run/dhcpcd" is present:
>> 	# ls -l /var/run/dhcpcd
>> 	drwxrwxrwx  3 root  wheel  512 Sep 10 04:27 /var/run/dhcpcd
>> Prior to starting up, the directory is empty.  But at the time of death, the "pid" file that it's complaining about is there:
>> 	# ls -l /var/run/dhcpcd/
>> 	total 1
>> 	-rw-r--r--  1 root  wheel  4 Sep 10 04:26 bwfm0.pid
>> 	# cat /var/run/dhcpcd/bwfm0.pid
>> 	821
>> 	# ps ax | grep dhcpcd
>> 	821 tty00 Z    0:00.00 (dhcpcd)
>> And the "db" directory is there as well (but empty):
>> 	# ls -ld /var/db/dhcpcd
>> 	drwxrwxrwx  2 root  wheel  512 Sep 10 04:25 /var/db/dhcpcd
> 
> This looks wrong:
> $ ls -ld /var/db/dhcpcd
> drwxr-xr-x  2 root  wheel  4 Aug 30 08:54 /var/db/dhcpcd

I can change.  The reason I did this was this note in "dist/README.md":

	dhcpcd-9 requires this directory and contents to be writeable by the
	unprivileged user (default _dhcpcd, _dhcp or dhcpcd)

The unprivileged user isn't root, and I don't think it's a member of "wheel" either -- so is this comment inaccurate?


>> What's REALLY interesting to me, though, is that if, in this current state, I run dhcpcd AGAIN, directly from the command line, with the same arguments, it works just fine:
>> 	# /sbin/dhcpcd -d -B bwfm0
>> 	dhcpcd-9.2.0 starting
>> 	spawned privileged actioneer on PID 938
>> 	spawned network proxy on PID 1133
>> 	spawned controller proxy on PID 934
>> 	chrooting to `/var/chroot/dhcpcd' as _dhcpcd
>> 	bwfm0: executing `/libexec/dhcpcd-run-hooks' PREINIT
>> 	spawned listener fe80::ba27:ebff:fedb:539 on PID 986
>> 	bwfm0: connected to Access Point `dudenet'
>> 	bwfm0: executing `/libexec/dhcpcd-run-hooks' CARRIER
>> 	DUID 00:01:00:01:26:ec:69:6f:b8:27:eb:db:05:39
>> 	bwfm0: IAID eb:db:05:39
>> 	bwfm0: delaying IPv6 router solicitation for 0.3 seconds
>> 	bwfm0: delaying IPv4 for 0.9 seconds
>> 	bwfm0: soliciting an IPv6 router
>> 	bwfm0: sending Router Solicitation
>> 	bwfm0: reading lease `/var/db/dhcpcd/bwfm0-dudenet.lease'
>> 	bwfm0: soliciting a DHCP lease
>> 	bwfm0: sending DISCOVER (xid 0xd86ffd41), next in 4.4 seconds
>> 	bwfm0: offered 10.0.1.141 from 10.0.1.1
>> 	bwfm0: sending REQUEST (xid 0xd86ffd41), next in 3.7 seconds
>> 	bwfm0: acknowledged 10.0.1.141 from 10.0.1.1
>> 	bwfm0: adding IP address 10.0.1.141/24 broadcast 10.0.1.255
>> 	bwfm0: sending Router Solicitation
>> 	bwfm0: DAD completed for 10.0.1.141
>> 	bwfm0: leased 10.0.1.141 for 86400 seconds
>> 	bwfm0: renew in 43200 seconds, rebind in 75600 seconds
>> 	bwfm0: writing lease `/var/db/dhcpcd/bwfm0-dudenet.lease'
>> 	bwfm0: IP address 10.0.1.141/24 already exists
>> 	bwfm0: adding route to 10.0.1.0/24
>> 	bwfm0: adding default route via 10.0.1.1
>> 	bwfm0: ARP announcing 10.0.1.141 (1 of 2), next in 2.0 seconds
>> 	bwfm0: executing `/libexec/dhcpcd-run-hooks' BOUND
>> 	spawned listener 10.0.1.141 on PID 954
>> 	bwfm0: sending Router Solicitation
>> 	bwfm0: ARP announcing 10.0.1.141 (2 of 2)
>> 	bwfm0: sending Router Solicitation
>> 	bwfm0: no IPv6 Routers available
>> 	^Z[2] + Suspended               /sbin/dhcpcd -d -B bwfm0
>> 	# bg
>> 	[2] /sbin/dhcpcd -d -B bwfm0
>> 	# ps ax | grep dhcpc
>> 	 934 ?     I    0:00.01 dhcpcd: [control proxy] bwfm0 [ip4] [ip6]
>> 	 936 ?     I    0:00.03 dhcpcd: bwfm0 [ip4] [ip6]
>> 	 938 ?     I    0:00.02 dhcpcd: [privileged actioneer] bwfm0 [ip4] [ip6]
>> 	 954 ?     I    0:00.01 dhcpcd: [network proxy] 10.0.1.141
>> 	 986 ?     I    0:00.01 dhcpcd: [network proxy] fe80::ba27:ebff:fedb:539
>> 	1133 ?     I    0:00.00 dhcpcd: [network proxy] bwfm0 [ip4] [ip6]
>> 	 821 tty00 Z    0:00.00 (dhcpcd)
>> 	 933 tty00 S    0:00.02 dhcpcd: [launcher]
>> I'm currently trying to debug what's going on, basically adding some logging around pid_file lock (and I might even need to add some debugging inside libutil and pid_file lock).
>> But before I dive in, I'm wondering if anyone here has any easier things to try.  I'm not all that familiar with this code -- don't mind getting my hands dirty, but if there's any "did you check this?" or "it's probably this" ideas, I'm glad have them.
>> Thanks a ton for any guidance.
> 
> See process 821? Looks like it's in the zombie state, ie crashed.
> Is there a coredump for it? Maybe the outputn files you captures or syslog entries can give a further clue.

That's the original one I ran.  It died because of pidfile_lock failure and then exited.  It's stuck because the parent process (my custom "network" [really, embedded system] management process hasn't called waitpid to clean it up yet).  There's no core file because it didn't die from any kind of fault.

Rob


Follow-Ups:
Re: problem with "nobackground" mode on NetBSD current?Roy Marples
References:
problem with "nobackground" mode on NetBSD current?Rob Newberry
Re: problem with "nobackground" mode on NetBSD current?Roy Marples
Archive administrator: postmaster@marples.name