Update:
Issue suddenly reappeared without the Optane, so that has been ruled out as the issue. Performance is also vastly improved using a different NFS Server, see further below in the thread.
TL;DR:
Recently, I diagnosed @Bumperdoo’s System’s Performance editing Video from an NFS-attached TrueNAS Scale System. Performance had been rock-solid, but suddenly became stuttery and there was general instability on the System.
After trying different Kernels and Drivers, RHEL Family Distributions, as well as much NFS troubleshooting with nfs(io)stat and the likes…the Problem was solved by removing a Chipset-via-U.2-connected Optane P5800X and repurposing a secondary Samsung 990 Pro for the Boot Drive.
Now I’m trying to wrap my head around how the Optane could have been the cause for the issue.
This post is mainly for the purpose of curiostiy about the odd solution.
Relevant Hardware:
Workstation:
CPU: AMD Threadripper Pro 5975WX
Motherboard: Supermicro M12SWA-TF
RAM: 8x16 GB Micron DDR4 ECC RDIMM (tested fine wth memtest86+)
NIC in CPU PCIe Slot: Intel XXV710 Dual 25Gbit
GPUs, also in CPU PCIe Slots: Dual RTX 4090
I/O: Blackmagic Decklink Mini SDI Card to attach a reference monitor for Color work
Storage: Optane P5800X and Samsung 990 Pro as a secondary drive. (originally)
OS: Fedora 38 / Rocky Linux 9 (issue occurred under both)
Everything has plenty cooling and Performance of the individual components has been verified.
Server:
Amd Epyc / Supermicro H12-based Custom-built Server with 256GB of DDR4 ECC RDIMMs and another Intel XXV710. Running TrueNAS Scale with a large RAIDZ-2 WD Gold Array and a smaller Flash-based Pool. NFS Set to Standard (Syncing in this use case), no atime for the media datasets.
I won’t go into too much detail here since the Server has been performing fine and continues to do so.
Problem Description and Troubleshooting:
After over half a year of great performance Editing and Rendering Video from the above NAS in Davinci Resolve on above Workstation, Performance suddenly became stuttery, Video Playback and scrubbing was not smooth anymore.
Sometimes Performance looked fine for a short amount of time when starting a fresh workload, but would then quickly degrade again.
A few of the troubleshooting steps attempted:
- Trying different Kernels and Nvidia Drivers
- Making sure Network Performance was as expected (24.7Gbit/s between Client and Server both ways) and that there was no problem caused by the NIC under heavy load or with different packet sizes, all was OK, no packet drop, bad latency or retransmissions
- Changin NFS rsize and wsize, packet buffers on the OS Level, etc.
- Synthetic Sequential Loads were fine, but file transfers did show the same problem.
- Caching on both the Client and Server Side was behaving normally, as before and expected.
- Switching between Fedora 38 and Rocky Linux 9
- Using nfstat and nfsiostat to closely watch all the great metrics they provide, all was fine for one exception of reported throughput in nfsiostat
- Comparing Performance to another NFS Client - no issues on that one
- Checking whether GPUs or Decklink Card were reporting problems or creating the bottleneck, both not the case.
None of the above pointed at anything being wrong or showed signs of unusual behavior.
Except for nfsiostat reporting less-than-expected throughput in KB/s - not the amount of Video Davinci was actually going through, but I could not find any further reason for that after investingating Cache and NIC Behavior / Metrics.
Odd Solution / Open Question:
Removing the P5800X and using the 990 Pro as the Boot Drive instead solved the issue instantly. There was no change in configuration on either the Hardware or Software Side.
The only noticeable change in metrics was a correctly reported throughput in nfsiostat.
Now I am left wondering: How could the P5800X, which performs fine in Benchmarks and reports as healthy in its SMART Data - have been the cause of the problem?
It was connected via the Chipset, because the U.2 Connector on the Board is routed that way - maybe that created an underlying issue I wasn’t able to diagnose easily? The Samung is on CPU PCIe Lanes instead.
Happy to hear any ideas.