select / poll timeout is not accurate

Published: Friday, August 1, 2008
Tags: tech

So I’ve finally found out why the dhcpcd-4.0.0-rc series sometimes wedges itself - select and poll do not always timeout on time. If they return 0, then can return slightly early and the real time left is very small. This is likely a kernel timer resolution issue, something that won’t be fixed easily, quickly or ever. Now, for fully passing IPv4LL compliance AND being DHCP re-transmission compliant AND timing out to the userland correctly we introduced expiry timers in dhcpcd-4.0.0-rc1. And for good randomness they are sub-second timers. So by fixing this, we sometimes wedged. I didn’t notice it on my fast machines, but my main tester uses a slow embedded unit where it happened quite a lot! :( So the fix is to change the timers to countdown timers instead of expiry ones. This is done by calculating the actual time between select calls and subtracting the result from our timers. If timeout (return 0) then ensure that the lowest timer is negative so it really has timed out. This makes the code lighter as we’re don’t have to add the timeout to now all the time, just store the new timeout. Also, I’ve had to remove the “waiting for N seconds” log as it’s now useless- instead each relevant message says when the next event will occur. I think this looks nicer ;) Oddly enough, this is the same behaviour as dhcpcd prior to the 4.0-rc1, just that we’re now using subsecond timers- and more of them … we have come full circle :PLastly, some buggy libc implementations (FreeBSD prior to 7.0, uClibc-0.9.29) have the headers for a monotonic clock but the libc doesn’t report it’s actually there! dhcpcd will warn about this, because if the clock changes whilst dhcpcd is running then our timer code won’t be firing its events at the right time.The net result is quite a large patch at this stage in the rc progress to stable, so there will be one last rc released over the next few days.