Tuesday, March 22, 2011

Cisco 5508 WLC 5 min timeout bug

Ok, don't you love it when you've been struggling to troubleshoot something that just doesn't make any sense, then you start grasping at a straw and it starts coming together. So, we have 4 devices. Three existing 4404 cisco wireless lan controllers. I come along with a new and pretty 5508 to add to the bunch. Configure it so it has all the same vlans/ssids/settings/mobility groups/etc. as the others, bring up some access points on it, everything looks great.

So, i go log into the old controller, take a building worth of APs and add the new 5508 to the top of their "High Availability" list, let them swing over on their own. Walk over to the building, log in for a minute, everything looks great, project successful right?

Hour later I start hearing reports about people in that building on the 5508 getting kicked off every 5 minutes. I hurriedly swing all the APs back to the old 4404s and problem goes away.

Ok, not sure why the testing didn't bring this up but lets bring up a test AP at my desk. I stream video on my laptop beside me for 4 hours before deciding there isn't a timeout issue. I swing a single AP over to that problem building and walk over to test with my laptop, make sure there actually is a problem. Sure enough, every 5 minutes I have issues. Still says I'm connected on laptop and controller, but cannot ping gateway, cannot get anywhere. Reconnect/click repair on my wireless icon and I'm good to go (for another 5 minutes). It doesn't even ask me to authenticate again so I know the web authentication isn't getting timed out, BUT, I go ahead and change the idle timeout to several hours (since it is 300 seconds, I tried to avoid seeing 300 anywhere), and changed the user timeout to 8 hours. I peer through looking for any other 300 second timer that might be giving me issues. Go ahead and test, still nothing wrong at my desk with the test APs but still timing out in the problem building...

verify there are no ACLs that might be blocking or different for the problem building versus my test lab building.

Finally I start grasping at something, the APs that I am testing with are booting straight to the 5508. The APs swung over using high availability do not reboot but simply authenticate to the new controller.

The next time I swung the APs over in the production building I chose to reset AP after changing the primary high availability. To my delight i am now typing this on a non-timing out connection. I also verified there are identical software versions on all controllers, so there shouldn't be problems moving between them... I should probably let TAC know about this little bug.

*Before any tells me not to test on a production network, #1, I did test in the lab first and didn't find anything #2, it's spring break so there are only a handful of students on campus anyway, far from the thousands of wireless clients I would have on a typical school day*

PROBLEM CAME BACK TODAY- think I found the real problem now (and this time it makes sense). Two of my controllers were using the same AP multicast address!!! #facepalm APs aren't going to like getting updates from two different WLCs at the same time, periodic updates must come at a 5 minute interval. I think this will actually be the solution, will update if for some reason it is not.


Stevenjwilliams83 said...

Did you ever get this problem fixed? I am having a similar issue except I only have 1 controller using the AP multicast group IP address.

Wyatt said...

It seems to be working fine now, but honestly have been so slammed with work that I haven't done too much testing. Other timers I changed that seemingly shouldn't matter or didn't seem to fix it before: Expiration Timeout for Rogue Clients, user idle timeout, session timeout in WLAN... Also changed to an even number of uplinks :-) I don't know, if you have any timers or settings you are questioning I could compare to my config.