dhcpcd-discuss

Re: dhcpcd runs into timeout while probing offered ip via arp

Jean Wolter

Wed Nov 26 08:54:48 2008

Roy Marples <roy@xxxxxxxxxxxx> writes:

> On Mon, 24 Nov 2008 13:58:24 +0100, Jean Wolter
> <jw5-dhcpcd@xxxxxxxxxxxxxxxxxxxx> wrote:
>> dhcpcd starts up, gets an offer after discovery and tries to check via
>> ARP, whether the offered IP is free. dhcpcd times out during this
>> check and leaves the interface un-configured.
>> Somehow I have the impression that the additional offers confuse
>> the state machine of dhcpcd and it misses the timeouts for the ARP
>> probes and in turn forgets to conclude that the IP is not assigned
>> and can be acknowledged/leased.
>
> The state machine of dhcpcd-3 and dhcpcd-4 can get easily confused when the
> timing goes wrong.
> That is why you have this warning
>> Mon Nov 17 19:36:52 2008: eth2: host does not support a monotonic clock -
>> timing can skew
>
> I would guess from that you're running dhcpcd in an embedded environment?
> IIRC some ucLibc versions don't work well with clock_gettime.
> I've attached a patch that talks directly to the kernel for CLOCK_MONOTONIC
> if the libc call fails.
> Does this fix things?

It does help - dhcpcd now actually uses the montonic clock provided by
the underlying kernel. But it does not solve the problem we are
experiencing.

I instrumented dhcpcd a little bit and found the problem. There are
three timers, "timeout", "exit", and "stop". The arp-probing code uses
the "timeout"-timer. The following trace shows function names, states
and used timers (the line numbers are a little bit off due to the
additional code). The following happens:

- dhcpcd gets an offer
Tue Nov 25 20:16:57 2008: eth2: offered 192.168.1.33 from 192.168.1.1
Tue Nov 25 20:16:57 2008: eth2: handle_timeout:1162 - state STATE_PROBING

- dhcpcd starts probing the ip address via arp, it uses the timer
  "timeout" for it
Tue Nov 25 20:16:58 2008: eth2: checking 192.168.1.33 is available on attached networks
Tue Nov 25 20:16:58 2008: eth2: sending ARP probe (1 of 3), next in 1.30 seconds
Tue Nov 25 20:16:58 2008: eth2: wait_for_fd:792 - state STATE_PROBING
Tue Nov 25 20:16:58 2008: eth2: get_lowest_timer:777 - state STATE_PROBING
Tue Nov 25 20:16:58 2008: eth2: get_lowest_timer: lowest timer: timeout (1.30 seconds)
Tue Nov 25 20:16:58 2008: eth2: wait_for_fd:798 - state STATE_PROBING
Tue Nov 25 20:16:58 2008: eth2: polling for 1299 milli seconds

- while waiting for an arp reply or a timeout the netlink fd
  becomes ready, handle_link() ist invoked
Tue Nov 25 20:16:58 2008: eth2: handle_link:1768 - state STATE_PROBING

- handle_link() resets the timer "timeout" unconditionally
Tue Nov 25 20:16:58 2008: eth2: resetting timeout

handle_link(struct if_state *state)
{
        int retval;

        retval = link_changed(state->interface);
        if (retval == -1) {
                logger(LOG_ERR, "link_changed: %s", strerror(errno));
                return -1;
        }
        if (retval == 0)
                return 0;

        logger(LOG_WARNING, "resetting timeout");
        timerclear(&state->timeout);
        ...

- handle_link() does not detect anything interesting, it enters the
  default branch of the switch statement, state->carrier is 1
  (LINK_UP), so it simply leaves the function

Tue Nov 25 20:16:58 2008: eth2: carrier_status default, state->carrier==1

- now the next timer is the "exit"-timer and dhcpcd times out after
  40s without leaving the PROBING state and therefore terminates

Tue Nov 25 20:16:58 2008: eth2: wait_for_fd:792 - state STATE_PROBING
Tue Nov 25 20:16:58 2008: eth2: get_lowest_timer:777 - state STATE_PROBING
Tue Nov 25 20:16:58 2008: eth2: get_lowest_timer: lowest timer: exit (40.00 seconds)
Tue Nov 25 20:16:58 2008: eth2: wait_for_fd:798 - state STATE_PROBING
Tue Nov 25 20:16:58 2008: eth2: polling for 39996 milli seconds

So the bug seems to be the unconditional timeout reset in
handle_link().

Some minor remarks: 
- It might help to have a compile time warning about the missing
  clock_monotonic support. It took me a while to actually realize, that
  the code is not included at all. So something along these lines
  maybe?

get_monotonic(struct timeval *tp)
{
        static int posix_clock_set = 0;
#if defined(_POSIX_MONOTONIC_CLOCK) && defined(CLOCK_MONOTONIC)
...
#else
#warning "host does not support a monotonic clock - timing can skew"
#endif

- some plattforms supply clock_gettime(), but it returns an error if
  invoked for CLOCK_MONOTONIC. The code does not catch this:

        if (posix_clock_set == 0) {
                if (clock_gettime(CLOCK_MONOTONIC, &ts) == 0) {
                        posix_clock = CLOCK_MONOTONIC;
                        clock_monotonic = 1;
                }
                posix_clock_set = 1;
        }
        ...
        /* Something above failed, so fall back to
           gettimeofday */
        if (!posix_clock_set) {
                logger(LOG_WARNING, NO_MONOTONIC);
                posix_clock_set = 1;
        }
        return gettimeofday(tp, NULL);
  So even if clock_gettime() fails, the warning does not show up.

regards,
Jean

Follow-Ups:
Re: dhcpcd runs into timeout while probing offered ip via arpRoy Marples
References:
dhcpcd runs into timeout while probing offered ip via arpJean Wolter
Re: dhcpcd runs into timeout while probing offered ip via arpRoy Marples
Archive administrator: postmaster@marples.name