Weird Network Problem with Windows 2012R2

woelki · April 26, 2020, 2:27am

Hi there,

so we have a 600/40 Mbit Cable Connection at the company (sadly that is the fastest)

We have a lancom 1781EF+ Router connected to a hitron cable router form the isp which serves us the static public ips.

So at the hitron we get 600mbit nor prob
at the lancom directly we get around 560 to 600 (probably some random load once in a while)

but the server which is the following:
20 Cores
196 Gig Ram
14 2TB SSD Raid 6 on an Adaptec Controller (peaks at araound 6.5 gig write speed)
1 Gbit Intel NIC with vmq and sriov (i tried both disabled an denabled)
which runs the hyper-v host whicht contains 4 VMs

The VMs all connect over an externel 10gig hyper v switch to the outside and to each other.

Copying something over the network gets to roughly 300MB/s

and the internet is even more confusing
it starts slow on dslr reports speedcheck and gets faster after a while (sadly i cannot upload the picture for some reason)

Speedtest . net even just shows a 100mbit Downstream

So there seems to be something amiss with the server settings. I honestly have no idea what.

The internal vms are
-101 Fileserver and DomainControler (also DNS)
-102 RemoteGateway
-103 RDSH
-104 Failover RDSH

DC DNS is configured to the provider ips and the DCs NIC points on loopback and 1.1.1.1 as second dns

All other domain devices naturaly point to the DC as DNS and to the router as the gateway.

As far as i know this should be the correct setup. But i might be wrong

Any ideas?

hp185688 · April 26, 2020, 3:09am

What is the network configuration?

Are you using HyperV native link aggregation or have you opted for 1 SRIOV assigned nic port per VM?

If you are using aggregation, are you using Hyper-V Port as you typically should?

So you’re saying the speed varies all the time or does Speed test reflect the change from the period when it’s fast versus when it’s slow?

Do you have the ability to see the CPU load of the lancom when transfer’s are made at the slow speed to see if the ASIC is running high?

woelki · April 26, 2020, 9:44am

It is one GB Link as an external HyperV switch for all vms. They all have SR-IOV on.

The lancom doesnt really care whats happening normal cpu load also the lancom has nothing to do with the network performance within hyper v.

oh an no LAG it is just 1 GBit port

Let me try to show our network a little better
-Hitron Cable Modem
–Lancom Router
—First Gbit switch 1 GBIT Copper
----several clients nothing of importance
—Second Gbit Router connected over SFP 1Gbit
----several clients
----HyperV Host Server (on 1 Gbit Connection Intel I320)
-----Domain Controller VM (2012 R2)
-----RD Gateway Server VM (2012 R2)
-----RDSH VM (2012 R2)
-----RDSH Failover VM (2012 R2)

hp185688 · April 26, 2020, 3:26pm

Interesting, so your using a 1GB link for 4 VM’s?

I didn’t know that configuration was possible. I’ve seen HyperV link aggregation used on, let’s say, 2 one GB ports to give high latency services like ADCS extra pipe for multiple requests, but never the inverse, using a 1 GB virtual switch as a trunk port for 4 servers on a single port.

As far as I knew that config was not supported, but you say that that is working for you albeit performing poorly.

Let me know if that description sounds accurate and I’ll see what I can find based on my work on this in the past.

woelki · April 26, 2020, 3:47pm

exactly it is one 1 gbit connection which offers up to 7 SR-IOV instances/Queues

4 of those are asigned to the vms.

it actually works stable for several years know we just noticed this weird behaviour after upgrading the internet connection to 600mbit.

transfer from host to vm over smb shared peaks at around 280MiB (which is also kinda slow the raid is capable of much more)

But the internet really sucks at 130mbit even on the host.

Oh btw the hyperv virtual switch is at 10gig

hp185688 · April 26, 2020, 4:25pm

So this issue started after the upgrade.

Is it possible there’s a problem with the adjacency between the two devices, versus the one that existed with the slower device that preceded it?

Is there a means within the interfaces of your devices to confirm what type of negotiation protocol was used and whether STP protocol issues exist, slowing down communications with your server using its unique configuration?

woelki · April 26, 2020, 5:10pm

No it probably was always there.

We do get 550 from the lancom directly (if i connect a device directly to the lancom or one of the switches without joining the domain) so the lancom and the hitron work together perfectly fine.

There seems to be something amiss withing the hyperv host and its network. Sadly this server entered the company way before me and i have noe real idea where to start looking apart form the usual stuff (updating windows/driver, checking cables, checking ip configuration, disabling vmq/sriov) so far no luck.

hp185688 · April 26, 2020, 9:26pm

Well, I’m assuming you are using an internal switch based network segment for server communications/SMB, on the domain?

Of the standards I’ve seen using MSFT Services, those are almost always partitioned on their own network segments away from the internet.

If I was configuring a server like that, I wouldn’t even use a vlan/switchport based internal network for this sort of thing. I would use a virtual HyperV network for inter server communications, thus removing the bottleneck real networking equipment inevitably has.

Then, users machines would be on their own segment or in a less strictly isolated network with internet access and routing ability to the internal for communication with servers. That’s sort of the classic way. Also, since all your servers are already on one VM host, using HyperV virtual networks is like “free networking equipment” you don’t have to buy.

Those changes probably aren’t available in this situation.

I also would ask, why you aren’t using the 4 gigabit ports available on your network card for each of these servers and the baseboard based networking for the host?

This, given the config you already listed, I would think would be the default option. Then, from there, you can assign discrete ports to each server and not use a single nic as a trunk. I would be very surprised not to see speed and latency issues using a trunk in this way. But I’m coming from a place where that was explicitly not a supported option anyway. I’m curious how someone got it to work in the first place, and if support for that was added sometime when I wasn’t looking.

The Microsoft services I’ve used with this edition of server were not running all that great with very similar equipment to yours using independent ports for each server, and groups of 2 ports in some cases.

If it is possible to take a screenshot of the nic config, also, it might be easier to research.

woelki · April 27, 2020, 4:30pm

it is an on motherboard solution it only offers 1 gbit port. weird for a server borad i know but thats sadly what i have to work with

so i need to get exactly this configuration running or somehow convince the boss to buy a new network card.

Not really an option at the moment (corona you know)

So i somehow have to get the software fixed up. I don’t see why it should not be possible to get more than 300MB/s internal communication and more than 130Mbit Internet

hp185688 · April 27, 2020, 4:50pm

Wow, that is really unusual. Especially from a motherboard that supports so much memory. Even consumer boards supporting less memory often have 2.

There might be a way. There isn’t a specific reason that it should not be possible, other than it potentially not being a priority to design into the software as a result of a lack of demand for that feature.

Is it possible that 130Mbit is the highest the site you’re using is capable of measuring? I can’t say I’ve ever run this test on equipment with that transfer speed.

Also, a 1 gb nic is, max, going to give you 130 or 140 MB/s. You say you’re getting 300 MB/s?

woelki · April 27, 2020, 11:07pm

300MB/s is on the internal HyperV Switch it offers 10Gbit connections to all VMs so the vms can user 10Gbit between each other.

The Internet Connection would be limited to the 1gig (well and everything that is not using RD)

The Internet site is fine i get more when i am not using the server and directly connecting to the router. I also get more when i throw out the hyperv connections for a test. So something with HyperV Swtich seems to be the problem.

It is really confusing.