Reply To: dhcp problem

#15583
AvatarClabbo
Member
  • Total Post: 4
  • Newbie

This is what I found:

One of our sites has had numerous intermittent DHCP issues. The symptoms were varied and unpredictable. Normal Windows clients would sometimes fail to lease an IP address successfully, preventing users from logging onto the domain. Usually, the client would eventually get an address, but during times of high utilization this could sometimes take many minutes (if it worked at all). The issue came to a head when doing our summer deployments. Our imaging process consists of booting from a CD/floppy or PXE and joining a Ghost multicast session. Not getting an IP address from DHCP was comlpetely halting work Our technicians had to manually assign each client an IP address and remember which ones were already used. Needless to say tensions and blood pressure were high.

Previously I had tried troubleshooting the problem by updating to the latest firmware on our switches, checking their configs and trying to rule out any problems on the DHCP server. These were all dead ends. I couldn’t see anything strange in the packet traces, and was runnning out of ideas. One of our technicians however noticed that if he booted a system to Windows and let it get an IP address first, the BootCD would then grab the same address and everything would work. I decided to latch onto this and dig deeper. I compared traces from “coldâ€￾ booted machines, and “warmâ€￾ (boot to Windows first) booted machines. At first I couldn’t find anything. but that was because I was only looking at BootP messages.

To try and cut down on the amount of traffic I was capturing, I set my capture filter to only grab UDP. In doing this, I also saw ARP requests coming from the DHCP clients. The machines that booted fine followed a process like this:

(client) Discover
(server) Offer
(client) Request
(server) ACK
(client) ARP for offered IP
(client) ARP for offered IP
(client) No response to ARP – claim IP
They had no trouble getting an IP because Windows had already done all the hard work of collision detection. I unfortunately did not capture traffic from a Windows client in this environment. It would have been nice to see how windows handles this. The failing (cold booted) machines would proceed like this

(client) Discover
(server) Offer
(client) Request
(server) ACK
(client) ARP for offered IP
(other client) ARP Reply
(client) Broadcast ARP Reply
Repeat 1-7
(client) Blank DHCP Request
(server) NAK
Repeat 9-10 until client gives up (long time)
The difference occurs at step 6. In this case, a WYSE terminal (1200LE) replied to the gratuitous ARP request from the client. In seeing another device on the network, the client then rebroadcast the ARP reply so others would see it, and then proceeds to request another IP address. The server tries to assign the same address to the client, seeing that it already has leased it to that client. The client then tries to request again and is sent a NAK each time. This process repeats until the client gives up.

So – why would a DHCP server try to hand out an address still in use? Because the lease time was up and the device did not renew during the lease time. Normal server operation is to delete a lease when it expires. Why would a client not renew it’s lease? I’m not sure. I’ve contacted WYSE to find out why the device doesn’t just renew it’s address instead of requiring a restart when the lease is up. No response yet. There’s even an option in the WYSE config files to choose whether to restart or shut down the device when the lease expires. The restarting isn’t really the issue though. When the devices are left on, they seem to go into a standby mode until woken up by mouse, keyboard or pressing the power button. When the device wakes, it presents the prompt “The dhcp lease has expired. You must restart.â€￾ Unfortunately, when in the sleep state, the devices respond to ARP but not pings. Windows DHCP servers use ping to test for collisions.

So who is at fault here? I’m not sure. I am going to read the DHCP spec and try to figure it out. Mainly because I want to know who to blame. If you have any ideas please share.

http://joelgibby.net/2007/07/31/wyse-terminal-causing-dhcp-issues