dhcpcd-discuss

Re: problem with "nobackground" mode on NetBSD current?

Rob Newberry

Fri Sep 11 20:28:06 2020

OK, I've at least PARTIALLY figured this out.

There's a bug in my "embedded system/network manager" code that is somehow closing STDIN_FILENO before forking and running dhcpcd.

I haven't yet figured out exactly where that bug is, but I will :-).

It's a really cool bug, in that it's the kind of nastiness that few programs ever expect to deal with.  In this particular case, what was happening was that, because STDIN_FILENO was closed, the pidfile lock was getting back STDIN_FILENO for the lock file.

And then when dhcpcd called "freopen" on stdin, it actually closed the pidfile and re-opened it for read-only (appropriate for stdin), and the future pidfile_lock failed because the descriptor was no longer open for writing.  Very weird repercussion from such an unexpected situation (I kinda like bugs like this, as it makes me learn more about other parts of the system in more detail).

Anyway, very sorry to have bothered folks here, since it did indeed turn out to be a bug elsewhere (and completely my own), but really, really appreciate the help debugging.

Cheers!

Rob



> On Sep 11, 2020, at 7:10 AM, Rob Newberry <robthedude@xxxxxxx> wrote:
> 
>>> I've debugged this down to something unexpected happening in pidfile_lock, which gets called around line 2345 of dhcpcd.c.
>>> But the unexpected part is happening inside the pidfile_lock code in NetBSD's libutil.
>>> It LOOKS to me like pidfile_lock is being called early on (around line 2250).
>>> But by the time we call it a second time, we've forked and (maybe?) switched users.
>> 
>> No.
>> We double fork and lock the pidfile right away - we are still the root process.
>> The master process switches user at line 2406.
>> 
>>> And this time, when we call pidfile_lock, it goes through a different code path.  Looking at libutil's pidfile.c sources, it is finding that pidfile_fd is valid (i.e., it's not -1), and then trying to truncate it.
>>> But this time (after debugging in the kernel), the pidfile_fd file descriptor's "write" permission is gone.  And so the call to "ftruncate" fails (in the kernel check for "fp->f_flag & FWRITE").  And that bubbles up to cause the failure.
>> 
>> So we need to track *why* permission was lost.
>> Certainly this privsep version of dhcpcd has been live in NetBSD since April 2nd this year and this is the first time this issue has been reported.
> 
> OK, I'll debug more today and see if I can figure out when the FWRITE permission is removed from the file descriptor (possibly tricky because that's check is happening in the kernel -- not "hard" just might take longer).
> 
>>> To "fix" the issue, I added a call to "pidfile_clean" shortly after the error handling in the first call to pidfile_lock.  With that change, the second call to pidfile_lock no longer goes through the "it's already open" path, and since the pidfile isn't open, it opens and locks it and everything works fine.
>>> But I don't know if that's the RIGHT fix.  There's now a window of time where the original process (which checked and locked the pidfile) and the child (actually, grand-child I believe) process (re-)locks it.  I don't know if that's bad or not.
>> 
>> Certainly this is not the right fix. It's there to enforce another dhcpcd instance does not run at the same time.
>> 
>> What is the underlying filesystem for /var/run?
> 
> It's standard ffs, but running on a RAM disk:
> 
> 	# mount
> 	/dev/md0a on / type ffs (local)
> 	/dev/ld0e on /boot type msdos (local)
> 	ptyfs on /dev/pts type ptyfs (local)
> 
> Rob
> 
> 


Follow-Ups:
Re: problem with "nobackground" mode on NetBSD current?Roy Marples
References:
problem with "nobackground" mode on NetBSD current?Rob Newberry
Re: problem with "nobackground" mode on NetBSD current?Rob Newberry
Re: problem with "nobackground" mode on NetBSD current?Roy Marples
Re: problem with "nobackground" mode on NetBSD current?Rob Newberry
Archive administrator: postmaster@marples.name