Unusual RDS issue that has -everyone- I've talked to so far stumped

Rob_Brown · July 14, 2020, 2:38pm

Good afternoon Level1 Community. Long time watcher of Level1 news, but first time coming to the community forum and posting.

I’m reaching out here because I’ve got a thoroughly strange issue with an RDS deployment in Azure that so far has stumped absolutely everyone I’ve talked to (including multiple teams at MS themselves, from the bods at Concentrix to the Tek-Expert folks as well).

In fact, I’ve had support bods at Concentrix tell me my Azure deployment is wrong, and then link MS documentation that clearly indicates that my deployment is in fact correct; but I digress.

So I’m hoping that someone with more practical experience might have seen this before, or at least have a suggestion for resolving the fault.

Our Setup

Azure Traffic Manager -> 2x RDS Gateways -> Azure Load Balancer -> 2x RDS Connection Brokers (in HA) -> Session Hosts

The RDP sessions have an SSL encryption. (Well, they should, see below)

Our Issue

We started seeing an issue at the start of the year where the RDP files were occasionally missing the SSL info (and a few other lines). Some investigation showed that if the user was getting the file from Connection Broker 2, the file was missing the pertinent data. Any connection through Connection Broker 1 resulted in an intact file.

Now the two connection brokers use an AzureSQL database; and I’ve gone through the following so far:

1: Dropped and rebuilt Connection Broker 2, adding it back into the RDS farm
2: Verified connections from each connection broker to the AzureSQL DB via SSMS on each of the servers
3: Ran through every PS function I could as pertaining to connection brokers in high availability, checking for parameters, connection strings, permissions, roles and such.

As mentioned above, I’ve had multiple MS bods go through it, I’ve had engineers from other firms go through it, and yet no-one seems to have any clue as to why we’re getting these abridged RDP files.

Thus I’m reaching out elsewhere, to see if the tech community has any input.

Aprazeth · July 15, 2020, 12:58pm

That is an odd one indeed.

Let me preface this with the following: I am a rando person on the internet, do not take anything I say as fact, I am not responsible for any loss of data, etc. etc.

This might be out there, but have you perhaps checked the registry settings on the actual server itself along the lines of this post; https://syscenramblings.wordpress.com/2017/11/27/strange-case-of-the-failed-remote-desktop-gateway/

It mentions checking for the RDGClientTransport DWORD value in the registry. It’s a long shot but meh - this might be the difference between the 2 machines.

If that isn’t the cause, you could take a registry export of the 2 machines (or if you are using GPO’s, use gpresult) and compare between the 2 machines.

Lastly, and perhaps more importantly - backup the correct working machine. Multiple 3 times, or well at least - in different locations if possible

Rob_Brown · July 22, 2020, 2:06pm

Apologies for the delay in getting back to you.

I read the article, very interesting, but sadly not relevant to the issue at hand.

Thanks for looking though!

And yes, the fully working CB gets backed up daily to two regions

Aprazeth · July 22, 2020, 4:44pm

No worries - life/work can get in the way of things at times

Hmm, OK well if that didn’t work - and this may sound odd, but have you tried checking the Microsoft documentation “Troubleshooting an RDP general error to a VM in Azure”. It seems far fetched, but that also lists the correct registry key for the Terminal Server (pardon, Remote Desktop Server)

Source; https://docs.microsoft.com/en-us/azure/virtual-machines/troubleshooting/troubleshoot-rdp-general-error

Registry-key location: HKLM\SYSTEM\CurrentControlSet\Control\Terminal Server (and below)

You could check the registry-key settings between the correctly working and non-working one (perhaps there is a difference?) Again, there shouldn’t be a difference but just check to be absolutely certain.

Also, I am sure if I missed it - but did running GPResult the same results on both servers? (Source: https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/gpresult) This in particular to their GPO’s etc.

Rob_Brown · August 5, 2020, 3:18pm

@Aprazeth - Yup. Went through that, and compiled rds tracing logs for every option (some GW’s on, some off, some CB’s on, some off, etc)

using this: https://github.com/CSS-Windows/WindowsDiag/tree/master/UEX/RDSTracing

And again, nothing has pointed to anything apparent/obvious.

Rob_Brown · August 5, 2020, 3:19pm

Also built an entirely new CB and added it into the HA, the thinking being having a different DNS name would avoid any potential bad data/config that was left in the DB.

No joy.

Aprazeth · August 6, 2020, 7:21am

In that case, unless someone else comes up with something - I can only think of a ‘wipe and type’ scenario. Completely start from scratch (including the DB)

This is a really odd one indeed, and I think you’ve probably already spent way too much time on trying to figure this out and fix it. Perhaps it’s time to cut your losses?

Rob_Brown · August 6, 2020, 9:27am

Yeah, a wipe and rebuild is a course of action, but we’re hampered in that by a few issues - not least of all this being a live production environment that isn’t technically “broken”, just doesn’t work fully - so we’d need a full downtime weekend to do the work; and with all of our clients having massively embraced a ‘flexible working’ mentality because of the lockdown…

There’s a material cost aswell, since it’s not just replacing the servers in question, but also the SSL certificates and the extra run-time costs for storage, compute, IP’s, etc (not much in the grand scheme of things, but like all IT departments, I’m supposed to work magic and miracles with pocket lint, tatty string and a few mouldy beans that were found in the back of the fridge…)

And that’s not getting into the fact that I’d want to get paid for the weekends work!

Aaand the long long list of RDS user cal packs that would need to get migrated as well - and if you’ve ever done a licensing server > licensing server migration, you’ll know how much of a frustration that can be

Aprazeth · August 6, 2020, 2:44pm

… I can feel your pain. Good luck!

That said, you could start building the new environment up at the time while the other is still running (forget the naming for it, but you could isolate it network-wise to avoid naming duplicates) Once everything is built and ready, in terms of roll-over you could preemptively set the DNS lifetime to a lower value and at the time you’re ready to flip the switch it should be relatively ‘painless’ (as if it ever is…) In terms of the SKU/licensing migrations, well since MS themselves couldn’t sort - have them assist you there?