Interesting Troubleshooting Cases, Part 2 - The Zoom issues in just one building

Note: This article is part 2 of a 4-part troubleshooting series, with more in-depth information about a TEN talk at WLPC.
Part 1 - The RADIUS connection
Part 3 - Breaking other Wi-Fi
Part 4 - The suddenly weaker Wi-Fi
Video recording from WLPC Prague

Incoming Ticket: Terrible connection using Zoom on Wi-Fi. Works for a while, then unusable for a minute. Building is almost empty.

This is of course pretty broad, so to get an overview on what I checked:

  • The building was new, and got 802.11ac Wave2 Wi-Fi, planned and validated - so it should not be a coverage issue
  • Also happens when the building is almost empty, so capacity should not be an issue
  • Multiple client typed affected, so probably not a client driver issue
  • Log check revealed that there is no unexpected roaming and no channel changes
  • It was validated that it worked fine on the wired network, so only wireless was affected
  • On the whole campus, only this building had this issue
  • It was hard to reproduce, sometimes it took hours, sometimes it was multiple times in an hour

So, as this seemed like it would take a while to troubleshoot, and this building was essentially a co-working space, I just moved my workspace there for a while.

My first thought was that this could be a layer 1 issue. Only hitting this building, and there is a pilot factory attached to it. Maybe some heavy machines interfering…?

So I sat there, working, doing video conferences, with my spectrum analyzer - waiting for the problem to hit. And it did.

SpecAn

Nothing that stood out to me - spectrum looks clean, just a bunch of traffic, but nothing to be concerned about. But when the problem hit me, I noticed that I still had high SNR, but my connection dropped to MCS 0 and I had no throughput.

So I decided to move up one layer and get more data - packet capture time. And I did capture some interesting stuff.

AP to Client - RTS

Here we see the AP communicating to a client - Request-To-Send. If you look at the packet number - this is not filtered. This is how it was in the air.

AP to Client - RTS

AP to different client - again Request-To-Send. If you look at the timestamps, most of the time there is only 1 microsecond between those.

AP to Client - BAR

And if you are getting tired of Request-To-Send - how about Block Ack Requests?

Seeing those, at first I was a bit confused. But the more I thought about it, the more it occurred to me that this had to be an AP issue. And this would make sense hitting only this one building - because as this was the newest building, it was the only one with this type of AP, the others had 802.11ac Wave 1 APs.

So I searched Ciscos Bug DB - and hit the Jackpot.

Cisco Bug CSCvu61194
Cisco 2800, 3800 APs sends burst of RTS and BAR randomly leading to low client data rates
Symptom:
During normal operation, randomly we see bunch of RTS/BAR packets being sent from the AP to the client.
This leads to lower data rates or packet drops in the network.
Customers might also see quality degradation issues with Webex/Skype/MicrosoftTeams audio/video calls.

So, I waited for the bugfixed version to be posted and after installing, the problem was gone.

Even if the circumstances make it seem sometimes like a PHY layer issue, it can still be something else. Different AP generation can also mean different bugs.