Re: problem with "nobackground" mode on NetBSD current?
Rob Newberry
Fri Sep 11 15:05:01 2020
I'll answer a couple questions here, but since you also responded to my second message (after I'd done some debugging), I'll save more for there.
> On 10/09/2020 23:37, Rob Newberry wrote:
>> I'm building my own "embedded" NetBSD system images (running on a Raspberry Pi), and running dhcpcd under the control of my own network management process.
>> That network management process does the following:
>> - runs (and allows to exit) "ifconfig" to enable bwfm0 (I haven't debugged why wpa_supplicant doesn't do this on its own, but if I don't do this, then I can't even scan)
>> /sbin/ifconfig bwfm0 up
>
> There are reasons why wpa_supplicant does not do this, but that's fine. It should still work because by default dhcpcd *will* do this and wpa_supplicant will then start scanning.
Yes, I remember now. I had this in my code before I got dhcpcd working last year, and then took it out -- but when I ran into this new issue, dhcpcd died before getting very far, so I had to put it back in. I'll remove it once I get dhcpcd working.
(Would appreciate more insight on why wpa_supplicant does not do this, but it's fine offline.)
>
>> - writes a simple /etc/wpa_supplicant.conf file
>> - runs wpa_supplicant (which doesn't fork by default, and I capture it's output into a file)
>> /sbin/wpa_supplicant -i bwfm0 -c /etc/wpa_supplicant.conf -d -d
>> - runs dhcpcd in "nobackground" mode (I also capture it's output to a file)
>> /sbin/dhcpcd -d -B bwfm0
>
>
> Why? Both wpa_supplicant and dhcpcd have options to redirect output to files and should work fine when backgrounded.
> I can see that you're using dhcpcd-9.2.0 so you already have the stdio fixes.
> Anyway....
Because I'm running my own "custom" embedded system, and I want my management process to know when either one of them dies.
>> When it runs, dhcpcd dies with a few error messages:
>> Sep 10 04:26:08 dhcpcd[824]: main: pidfile_lock -1: Invalid argument
>> Sep 10 04:26:08 dhcpcd[824]: ps_sendpsmmsg: Bad file descriptor
>> Sep 10 04:26:08 dhcpcd[824]: main: control_stop: Bad file descriptor
>> Sep 10 04:26:08 dhcpcd[824]: ps_sendpsmmsg: Bad file descriptor
>> Now, first off, this didn't happen "before" -- and by "before" I mean that I was working on this a few months ago, not long after NetBSD 9.0 shipped. Certainly before the new dhcpcd was integrated -- but I'm sure there are plenty of other changes in NetBSD-current right now.
>
> dhcpcd with priviledge separation will be new for NetBSD-10.
> I've comitted a fix here which tidies the errors up a litte, so you should just get the pidfile_lock error now.
> https://roy.marples.name/cgit/dhcpcd.git/commit/?id=f4b5b1b9497015888c9dc69dc6428fe71f983ab1
>
>> I've debugged a little bit, checking for all the obvious stuff -- "/var/run/dhcpcd" is present:
>> # ls -l /var/run/dhcpcd
>> drwxrwxrwx 3 root wheel 512 Sep 10 04:27 /var/run/dhcpcd
>> Prior to starting up, the directory is empty. But at the time of death, the "pid" file that it's complaining about is there:
>> # ls -l /var/run/dhcpcd/
>> total 1
>> -rw-r--r-- 1 root wheel 4 Sep 10 04:26 bwfm0.pid
>> # cat /var/run/dhcpcd/bwfm0.pid
>> 821
>> # ps ax | grep dhcpcd
>> 821 tty00 Z 0:00.00 (dhcpcd)
>> And the "db" directory is there as well (but empty):
>> # ls -ld /var/db/dhcpcd
>> drwxrwxrwx 2 root wheel 512 Sep 10 04:25 /var/db/dhcpcd
>
> This looks wrong:
> $ ls -ld /var/db/dhcpcd
> drwxr-xr-x 2 root wheel 4 Aug 30 08:54 /var/db/dhcpcd
I can change. The reason I did this was this note in "dist/README.md":
dhcpcd-9 requires this directory and contents to be writeable by the
unprivileged user (default _dhcpcd, _dhcp or dhcpcd)
The unprivileged user isn't root, and I don't think it's a member of "wheel" either -- so is this comment inaccurate?
>> What's REALLY interesting to me, though, is that if, in this current state, I run dhcpcd AGAIN, directly from the command line, with the same arguments, it works just fine:
>> # /sbin/dhcpcd -d -B bwfm0
>> dhcpcd-9.2.0 starting
>> spawned privileged actioneer on PID 938
>> spawned network proxy on PID 1133
>> spawned controller proxy on PID 934
>> chrooting to `/var/chroot/dhcpcd' as _dhcpcd
>> bwfm0: executing `/libexec/dhcpcd-run-hooks' PREINIT
>> spawned listener fe80::ba27:ebff:fedb:539 on PID 986
>> bwfm0: connected to Access Point `dudenet'
>> bwfm0: executing `/libexec/dhcpcd-run-hooks' CARRIER
>> DUID 00:01:00:01:26:ec:69:6f:b8:27:eb:db:05:39
>> bwfm0: IAID eb:db:05:39
>> bwfm0: delaying IPv6 router solicitation for 0.3 seconds
>> bwfm0: delaying IPv4 for 0.9 seconds
>> bwfm0: soliciting an IPv6 router
>> bwfm0: sending Router Solicitation
>> bwfm0: reading lease `/var/db/dhcpcd/bwfm0-dudenet.lease'
>> bwfm0: soliciting a DHCP lease
>> bwfm0: sending DISCOVER (xid 0xd86ffd41), next in 4.4 seconds
>> bwfm0: offered 10.0.1.141 from 10.0.1.1
>> bwfm0: sending REQUEST (xid 0xd86ffd41), next in 3.7 seconds
>> bwfm0: acknowledged 10.0.1.141 from 10.0.1.1
>> bwfm0: adding IP address 10.0.1.141/24 broadcast 10.0.1.255
>> bwfm0: sending Router Solicitation
>> bwfm0: DAD completed for 10.0.1.141
>> bwfm0: leased 10.0.1.141 for 86400 seconds
>> bwfm0: renew in 43200 seconds, rebind in 75600 seconds
>> bwfm0: writing lease `/var/db/dhcpcd/bwfm0-dudenet.lease'
>> bwfm0: IP address 10.0.1.141/24 already exists
>> bwfm0: adding route to 10.0.1.0/24
>> bwfm0: adding default route via 10.0.1.1
>> bwfm0: ARP announcing 10.0.1.141 (1 of 2), next in 2.0 seconds
>> bwfm0: executing `/libexec/dhcpcd-run-hooks' BOUND
>> spawned listener 10.0.1.141 on PID 954
>> bwfm0: sending Router Solicitation
>> bwfm0: ARP announcing 10.0.1.141 (2 of 2)
>> bwfm0: sending Router Solicitation
>> bwfm0: no IPv6 Routers available
>> ^Z[2] + Suspended /sbin/dhcpcd -d -B bwfm0
>> # bg
>> [2] /sbin/dhcpcd -d -B bwfm0
>> # ps ax | grep dhcpc
>> 934 ? I 0:00.01 dhcpcd: [control proxy] bwfm0 [ip4] [ip6]
>> 936 ? I 0:00.03 dhcpcd: bwfm0 [ip4] [ip6]
>> 938 ? I 0:00.02 dhcpcd: [privileged actioneer] bwfm0 [ip4] [ip6]
>> 954 ? I 0:00.01 dhcpcd: [network proxy] 10.0.1.141
>> 986 ? I 0:00.01 dhcpcd: [network proxy] fe80::ba27:ebff:fedb:539
>> 1133 ? I 0:00.00 dhcpcd: [network proxy] bwfm0 [ip4] [ip6]
>> 821 tty00 Z 0:00.00 (dhcpcd)
>> 933 tty00 S 0:00.02 dhcpcd: [launcher]
>> I'm currently trying to debug what's going on, basically adding some logging around pid_file lock (and I might even need to add some debugging inside libutil and pid_file lock).
>> But before I dive in, I'm wondering if anyone here has any easier things to try. I'm not all that familiar with this code -- don't mind getting my hands dirty, but if there's any "did you check this?" or "it's probably this" ideas, I'm glad have them.
>> Thanks a ton for any guidance.
>
> See process 821? Looks like it's in the zombie state, ie crashed.
> Is there a coredump for it? Maybe the outputn files you captures or syslog entries can give a further clue.
That's the original one I ran. It died because of pidfile_lock failure and then exited. It's stuck because the parent process (my custom "network" [really, embedded system] management process hasn't called waitpid to clean it up yet). There's no core file because it didn't die from any kind of fault.
Rob
Archive administrator: postmaster@marples.name