Re: dhcpcd runs into timeout while probing offered ip via arp
Roy Marples
Wed Nov 26 10:40:03 2008
On Wed, 2008-11-26 at 09:54 +0100, Jean Wolter wrote:
> It does help - dhcpcd now actually uses the montonic clock provided by
> the underlying kernel. But it does not solve the problem we are
> experiencing.
http://www.uclibc.org/lists/busybox/2007-July/028051.html
uclibc-0.9.28 supports it but uclibc-0.9.27 doesn't.
So I guess I'd rather that you upgraded to uclibc-0.9.28
> I instrumented dhcpcd a little bit and found the problem. There are
> three timers, "timeout", "exit", and "stop". The arp-probing code uses
> the "timeout"-timer. The following trace shows function names, states
> and used timers (the line numbers are a little bit off due to the
> additional code). The following happens:
>
> - dhcpcd gets an offer
> Tue Nov 25 20:16:57 2008: eth2: offered 192.168.1.33 from 192.168.1.1
> Tue Nov 25 20:16:57 2008: eth2: handle_timeout:1162 - state STATE_PROBING
>
> - dhcpcd starts probing the ip address via arp, it uses the timer
> "timeout" for it
> Tue Nov 25 20:16:58 2008: eth2: checking 192.168.1.33 is available on attached networks
> Tue Nov 25 20:16:58 2008: eth2: sending ARP probe (1 of 3), next in 1.30 seconds
> Tue Nov 25 20:16:58 2008: eth2: wait_for_fd:792 - state STATE_PROBING
> Tue Nov 25 20:16:58 2008: eth2: get_lowest_timer:777 - state STATE_PROBING
> Tue Nov 25 20:16:58 2008: eth2: get_lowest_timer: lowest timer: timeout (1.30 seconds)
> Tue Nov 25 20:16:58 2008: eth2: wait_for_fd:798 - state STATE_PROBING
> Tue Nov 25 20:16:58 2008: eth2: polling for 1299 milli seconds
>
> - while waiting for an arp reply or a timeout the netlink fd
> becomes ready, handle_link() ist invoked
> Tue Nov 25 20:16:58 2008: eth2: handle_link:1768 - state STATE_PROBING
>
> - handle_link() resets the timer "timeout" unconditionally
> Tue Nov 25 20:16:58 2008: eth2: resetting timeout
>
> handle_link(struct if_state *state)
> {
> int retval;
>
> retval = link_changed(state->interface);
> if (retval == -1) {
> logger(LOG_ERR, "link_changed: %s", strerror(errno));
> return -1;
> }
> if (retval == 0)
> return 0;
>
> logger(LOG_WARNING, "resetting timeout");
> timerclear(&state->timeout);
> ...
>
> - handle_link() does not detect anything interesting, it enters the
> default branch of the switch statement, state->carrier is 1
> (LINK_UP), so it simply leaves the function
>
> Tue Nov 25 20:16:58 2008: eth2: carrier_status default, state->carrier==1
>
> - now the next timer is the "exit"-timer and dhcpcd times out after
> 40s without leaving the PROBING state and therefore terminates
>
> Tue Nov 25 20:16:58 2008: eth2: wait_for_fd:792 - state STATE_PROBING
> Tue Nov 25 20:16:58 2008: eth2: get_lowest_timer:777 - state STATE_PROBING
> Tue Nov 25 20:16:58 2008: eth2: get_lowest_timer: lowest timer: exit (40.00 seconds)
> Tue Nov 25 20:16:58 2008: eth2: wait_for_fd:798 - state STATE_PROBING
> Tue Nov 25 20:16:58 2008: eth2: polling for 39996 milli seconds
>
> So the bug seems to be the unconditional timeout reset in
> handle_link().
Could be.
I've attached a patch that may help you, but I need to regression test
it a little before applying.
> Some minor remarks:
> - It might help to have a compile time warning about the missing
> clock_monotonic support. It took me a while to actually realize, that
> the code is not included at all. So something along these lines
> maybe?
>
> get_monotonic(struct timeval *tp)
> {
> static int posix_clock_set = 0;
> #if defined(_POSIX_MONOTONIC_CLOCK) && defined(CLOCK_MONOTONIC)
> ...
> #else
> #warning "host does not support a monotonic clock - timing can skew"
> #endif
Could be done, but #warning is a GCC extensions iirc and I know for sure
that PCC compiles dhcpcd just fine. I can investigate a little.
> - some plattforms supply clock_gettime(), but it returns an error if
> invoked for CLOCK_MONOTONIC. The code does not catch this:
>
> if (posix_clock_set == 0) {
> if (clock_gettime(CLOCK_MONOTONIC, &ts) == 0) {
> posix_clock = CLOCK_MONOTONIC;
> clock_monotonic = 1;
> }
> posix_clock_set = 1;
> }
> ...
> /* Something above failed, so fall back to
> gettimeofday */
> if (!posix_clock_set) {
> logger(LOG_WARNING, NO_MONOTONIC);
> posix_clock_set = 1;
> }
> return gettimeofday(tp, NULL);
> So even if clock_gettime() fails, the warning does not show up.
Another attached patch should address this also.
Let me know how they work for you!
Thanks
Roy
Index: client.c
===================================================================
--- client.c (revision 1097)
+++ client.c (working copy)
@@ -1736,7 +1736,6 @@
if (retval == 0)
return 0;
- timerclear(&state->timeout);
switch (carrier_status(state->interface->name)) {
case -1:
logger(LOG_ERR, "carrier_status: %s", strerror(errno));
@@ -1746,6 +1745,7 @@
logger(LOG_INFO, "carrier lost");
state->carrier = LINK_DOWN;
do_socket(state, SOCKET_CLOSED);
+ timerclear(&state->timeout);
if (state->state != STATE_BOUND)
timerclear(&state->stop);
}
@@ -1755,6 +1755,7 @@
logger(LOG_INFO, "carrier acquired");
state->state = STATE_RENEW_REQUESTED;
state->carrier = LINK_UP;
+ timerclear(&state->timeout);
timerclear(&state->stop);
return 1;
}
Index: common.c
===================================================================
--- common.c (revision 1097)
+++ common.c (working copy)
@@ -201,9 +201,8 @@
if (posix_clock_set == 0) {
if (clock_gettime(CLOCK_MONOTONIC, &ts) == 0) {
posix_clock = CLOCK_MONOTONIC;
- clock_monotonic = 1;
+ clock_monotonic = posix_clock_set = 1;
}
- posix_clock_set = 1;
}
if (clock_monotonic) {
@@ -225,9 +224,8 @@
if (posix_clock_set == 0) {
if (mach_timebase_info(&info) == KERN_SUCCESS) {
factor = (double)info.numer / (double)info.denom;
- clock_monotonic = 1;
+ clock_monotonic = posix_clock_set = 1;
}
- posix_clock_set = 1;
}
if (clock_monotonic) {
nano = mach_absolute_time();
Index: client.c
===================================================================
--- client.c (revision 1097)
+++ client.c (working copy)
@@ -1736,7 +1736,6 @@
if (retval == 0)
return 0;
- timerclear(&state->timeout);
switch (carrier_status(state->interface->name)) {
case -1:
logger(LOG_ERR, "carrier_status: %s", strerror(errno));
@@ -1746,6 +1745,7 @@
logger(LOG_INFO, "carrier lost");
state->carrier = LINK_DOWN;
do_socket(state, SOCKET_CLOSED);
+ timerclear(&state->timeout);
if (state->state != STATE_BOUND)
timerclear(&state->stop);
}
@@ -1755,6 +1755,7 @@
logger(LOG_INFO, "carrier acquired");
state->state = STATE_RENEW_REQUESTED;
state->carrier = LINK_UP;
+ timerclear(&state->timeout);
timerclear(&state->stop);
return 1;
}
Attachment:
signature.asc
Description: This is a digitally signed message part
Archive administrator: postmaster@marples.name