Re: dhcpcd runs into timeout while probing offered ip via arp
Jean Wolter
Wed Nov 26 08:54:48 2008
Roy Marples <roy@xxxxxxxxxxxx> writes:
> On Mon, 24 Nov 2008 13:58:24 +0100, Jean Wolter
> <jw5-dhcpcd@xxxxxxxxxxxxxxxxxxxx> wrote:
>> dhcpcd starts up, gets an offer after discovery and tries to check via
>> ARP, whether the offered IP is free. dhcpcd times out during this
>> check and leaves the interface un-configured.
>> Somehow I have the impression that the additional offers confuse
>> the state machine of dhcpcd and it misses the timeouts for the ARP
>> probes and in turn forgets to conclude that the IP is not assigned
>> and can be acknowledged/leased.
>
> The state machine of dhcpcd-3 and dhcpcd-4 can get easily confused when the
> timing goes wrong.
> That is why you have this warning
>> Mon Nov 17 19:36:52 2008: eth2: host does not support a monotonic clock -
>> timing can skew
>
> I would guess from that you're running dhcpcd in an embedded environment?
> IIRC some ucLibc versions don't work well with clock_gettime.
> I've attached a patch that talks directly to the kernel for CLOCK_MONOTONIC
> if the libc call fails.
> Does this fix things?
It does help - dhcpcd now actually uses the montonic clock provided by
the underlying kernel. But it does not solve the problem we are
experiencing.
I instrumented dhcpcd a little bit and found the problem. There are
three timers, "timeout", "exit", and "stop". The arp-probing code uses
the "timeout"-timer. The following trace shows function names, states
and used timers (the line numbers are a little bit off due to the
additional code). The following happens:
- dhcpcd gets an offer
Tue Nov 25 20:16:57 2008: eth2: offered 192.168.1.33 from 192.168.1.1
Tue Nov 25 20:16:57 2008: eth2: handle_timeout:1162 - state STATE_PROBING
- dhcpcd starts probing the ip address via arp, it uses the timer
"timeout" for it
Tue Nov 25 20:16:58 2008: eth2: checking 192.168.1.33 is available on attached networks
Tue Nov 25 20:16:58 2008: eth2: sending ARP probe (1 of 3), next in 1.30 seconds
Tue Nov 25 20:16:58 2008: eth2: wait_for_fd:792 - state STATE_PROBING
Tue Nov 25 20:16:58 2008: eth2: get_lowest_timer:777 - state STATE_PROBING
Tue Nov 25 20:16:58 2008: eth2: get_lowest_timer: lowest timer: timeout (1.30 seconds)
Tue Nov 25 20:16:58 2008: eth2: wait_for_fd:798 - state STATE_PROBING
Tue Nov 25 20:16:58 2008: eth2: polling for 1299 milli seconds
- while waiting for an arp reply or a timeout the netlink fd
becomes ready, handle_link() ist invoked
Tue Nov 25 20:16:58 2008: eth2: handle_link:1768 - state STATE_PROBING
- handle_link() resets the timer "timeout" unconditionally
Tue Nov 25 20:16:58 2008: eth2: resetting timeout
handle_link(struct if_state *state)
{
int retval;
retval = link_changed(state->interface);
if (retval == -1) {
logger(LOG_ERR, "link_changed: %s", strerror(errno));
return -1;
}
if (retval == 0)
return 0;
logger(LOG_WARNING, "resetting timeout");
timerclear(&state->timeout);
...
- handle_link() does not detect anything interesting, it enters the
default branch of the switch statement, state->carrier is 1
(LINK_UP), so it simply leaves the function
Tue Nov 25 20:16:58 2008: eth2: carrier_status default, state->carrier==1
- now the next timer is the "exit"-timer and dhcpcd times out after
40s without leaving the PROBING state and therefore terminates
Tue Nov 25 20:16:58 2008: eth2: wait_for_fd:792 - state STATE_PROBING
Tue Nov 25 20:16:58 2008: eth2: get_lowest_timer:777 - state STATE_PROBING
Tue Nov 25 20:16:58 2008: eth2: get_lowest_timer: lowest timer: exit (40.00 seconds)
Tue Nov 25 20:16:58 2008: eth2: wait_for_fd:798 - state STATE_PROBING
Tue Nov 25 20:16:58 2008: eth2: polling for 39996 milli seconds
So the bug seems to be the unconditional timeout reset in
handle_link().
Some minor remarks:
- It might help to have a compile time warning about the missing
clock_monotonic support. It took me a while to actually realize, that
the code is not included at all. So something along these lines
maybe?
get_monotonic(struct timeval *tp)
{
static int posix_clock_set = 0;
#if defined(_POSIX_MONOTONIC_CLOCK) && defined(CLOCK_MONOTONIC)
...
#else
#warning "host does not support a monotonic clock - timing can skew"
#endif
- some plattforms supply clock_gettime(), but it returns an error if
invoked for CLOCK_MONOTONIC. The code does not catch this:
if (posix_clock_set == 0) {
if (clock_gettime(CLOCK_MONOTONIC, &ts) == 0) {
posix_clock = CLOCK_MONOTONIC;
clock_monotonic = 1;
}
posix_clock_set = 1;
}
...
/* Something above failed, so fall back to
gettimeofday */
if (!posix_clock_set) {
logger(LOG_WARNING, NO_MONOTONIC);
posix_clock_set = 1;
}
return gettimeofday(tp, NULL);
So even if clock_gettime() fails, the warning does not show up.
regards,
Jean
Archive administrator: postmaster@marples.name