Interesting Troubleshooting Cases, Part 1 - The RADIUS Connection
Note: This article is part 1 of a 4-part troubleshooting series, with more in-depth information about a TEN talk at WLPC.
Part 2 - Zoom issues
Part 3 - Breaking other Wi-Fi
Part 4 - The suddenly weaker Wi-Fi
Video recording from WLPC Prague
Incoming Ticket: I’m a student from Institution X in the same town. I am visiting your library and I can’t connect to eduroam here. It works fine on the campus of Institution X.
If you are not familiar with eduroam, it is a worldwide network of Universities, granting each other wireless access.
eduroam is built in a tree-like structure:
If you as an institution receive an authentication request from a foreign user, you will forward it to your country root - you only talk directly to your root. This root will either know where to forward it - if it is in the same country - or forward it to his root, which knows all countries. Upon being received by the correct country root, it will arrive at the right institution.
So, the first instinct for a foreign user not being able to authenticate is to check the RADIUS logs, which I did.
Login incorrect (Home Server failed to respond): [user-abc@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-345@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-abc@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-123@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-789@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-qwe@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-123@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-abc@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-345@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-abc@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-123@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-789@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-qwe@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-123@x.at] (from client OUR-C9800-CONTROLLER)
We can see that all authentications from users of institution X fail, with “Home Server failed to respond”. This means that the request hit the timeout - it got forwarded to our countries' root server, but never came to a close. This could mean a problem there - at the root server - so let’s check how other institutions on our campus are doing in the RADIUS log:
Login OK: [student-a@y.at] (from client OUR-C9800-CONTROLLER)
Login OK: [staff-8@y.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-789@x.at] (from client OUR-C9800-CONTROLLER)
Login OK: [mp64353442@a.de](from client OUR-C9800-CONTROLLER)
Login OK: [u_ddee@c.es] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home server says so): [asmith@d.es] (from client OUR-C9800-CONTROLLER)
Login OK: [8349572345@b.de] (from client OUR-C9800-CONTROLLER)
We can see that we have successful logins from all over Europe, and someone probably has the wrong password, but the “failed to respond” only happens with users from institution X.
So, this paints a pretty clear picture, right?
It works for others in the same countries, it works for others abroad, and only fails to X - seems to me that X needs to check their RADIUS server! Case closed.
Or not?
When talking to the administrative staff of Institution X, they said that they see our requests in the RADIUS logs, and answer them, but then nothing more happens. So, we need to get more data.
At first, I combed through a lot more RADIUS logs to find any clues. And I did find some:
Login incorrect (Home Server failed to respond): [user-abc@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-345@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-abc@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-123@x.at] (from client OUR-C9800-CONTROLLER)
Login OK: [user-789@x.at] (from client OTHER-CISCO-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-qwe@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-123@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-abc@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-345@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-abc@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-123@x.at] (from client OUR-C9800-CONTROLLER)
Login OK: [user-789@x.at] (from client OTHER-CISCO-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-qwe@x.at] (from client OUR-C9800-CONTROLLER)
Login OK: [user-xyz@x.at] (from client OTHER-CISCO-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-123@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-345@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-abc@x.at] (from client OUR-C9800-CONTROLLER)
Login OK: [user-123@x.at] (from client OUR-C9800-CONTROLLER)
Login incorrect (Home Server failed to respond): [user-qwe@x.at] (from client OUR-C9800-CONTROLLER)
Two interesting things here:
- the “other Cisco controller” belongs to a Hospital, that is attached to us. It has its own IT, and its own Wi-Fi, but uses our RADIUS server. It is Cisco but runs AireOS. It looks like, authentications from this controller are OK all the time.
- It seems that sometimes, even an authentication from our controller gets through, even though it failed multiple times before.
So - if it works from a different controller, that means it is not their fault, and because it works to others from our controller means it is not our fault … so what is going on?
We need to get more data. So I did a packet capture of the RADIUS server.
Here we have a successful auth from the AireOS controller, with the RADIUS talking to the root:
We see the nice request/challenge ping-pong until we hit either Access-Accept or Access-Reject. Now let’s look at the same conversation between the servers, initiated by the 9800 controller:
The interesting thing here is not only that it fails, but also that it is not the case that there is no communication - we do get an initial challenge back, but the response request gets duplicated after a few seconds until we hit the timeout. The duplicates are the RADIUS server trying to get an answer (as it is UDP, there could just be packet loss), but ultimately to no avail.
It is clear there has to be something different in these requests. So we look deeper into the initial request. Here are they, side by side:
This is a lot of info, I copied it out as text and compared, so I can find relevant differences. The one that struck me is already highlighted - the “Framed-MTU” attribute is by default a lot higher on the C9800 controller.
And it would make sense if you think about it - the Framed-MTU attribute tells how large the payload inside the RADIUS packet is allowed to be. If the whole packet gets too large for a part of the connection, it will get fragmented - with all of the trouble that comes with it, for example just dropping it, or dropping when out of order. This could explain why it would work sometimes and sometimes not.
To verify that this is the problem, I used the very good Mac software EAPTest which allows you to simulate an EAP supplicant & authenticator to check against a RADIUS server. Just enter the details of the simulated connection and the address of your RADIUS server (you have to be an allowed client, of course), and it will make the connection and tell you about all the packets between those. You can just simulate any attribute you like - so I experimented with different Framed-MTU values.
The requests started failing at Framed-MTU 1450 and above and were fine below, and we can see that it looks exactly like in the capture - we get an initial answer back, then fail and send duplicates.
So we have multiple issues here, where only the combination of all of those makes it a problem.
- Somewhere between the country root and the foreign RADIUS server, there seems to be a lower MTU, and UDP fragments keep getting dropped.
This could be some Firewall doing IPSEC, or the whole thing could be for example in Azure - this can lead to a lot of problems, as Azure by default drops out-of-order fragments. Also note, that such policy-drops (fragments, or out-of-order fragments) are usually not in the normal “drop” log of a firewall, as it is not a firewall rule dropping these packets. As a side note, if you do put your RADIUS in the cloud, you should pay special attention to the transport and maybe take a look at RADSEC.
I checked to see if the drops could be a problem on our side, but they weren’t - we did get a few authentications from others with fragments, which got through and reassembled fine:
- This in itself would not be a problem if the packets weren’t so large.
Most RADIUS servers seem to ignore the Framed-MTU and set a relatively low value for itself, with Microsoft NPS you might have to set a policy yourself.
This way we would have smaller packets, and there would be no fragmentation and nothing to drop.
But what about our C9800 controller?
- If the C9800 would not set the Framed-MTU so high, the packets also would be smaller.
So… can we do that?
Cisco says yes… from IOS XE 17.4 onward. Back then we could not move away from 17.3, so this is not an option.
So, the last idea I had - since freeradius is highly flexible - is to just overwrite the attribute going out when proxying. Freeradius has the “pre-proxy” rule, that hits every request before it gets sent out. We can use it to manipulate attributes, as I am doing here:
pre-proxy {
update proxy-request {
Framed-MTU := "1300"
}
}
This way the RADIUS server on the other side did not make his answer packets so large that they would get fragmented and ultimately dropped and it fixed the issue.