My first impression on HP Zbook Ultra G1a (Ryzen AI Max+ 395, Strix Halo +128 GB)

I got my HP Zbook G1a (395, 128 GB version) a month ago for my research, manipulating big matrices (need large memory capacity) and running FDTD simulations (require large memory bandwidth). For those two primary workloads, I think Strix Halo fits quite well among current laptops in the market.

The following is my short impression on this, focusing on its performance numbers.

OS: Windows 11 Pro 24H2
Power plan: Best performance mode except as noted

0. Note on the power draw of AI Max+ 395 APU on Zbook Ultra G1a

In the Best performance mode, a continuous full CPU load draws peak power ~ 80W, and sustains a 70W draw for a few minutes. Then, the power draw gradually gets down to 45W after about 30 min running, reducing about 10% of all core clock speed from the start.

In the GPU load (like running LLM), the same applies: starts at ~80W, stays at 70W for a while, then gradually goes to 45W.

1. CPU-Z, Cinebench R23, 7-Zip
CPU-Z bench
image

Cinebench R23

7-Zip bench

Fire Strike

Time Spy

2. Home-made FDTD calculation (comparison with CPU workstations)
FDTD is a memory-bandwidth-bound algorithm for numerical simulation of electrodynamics.

Results (steps per sec)

  • AI Max+ 395 (256bit LPDDR5x 8000 MHz): 10.4
  • Epyc 9654 2s (24ch DDR5 4800 MHz): 54.31
  • TR 5995wx (8ch DDR4 3200 MHz): 12.1
  • i9 7920x (4ch DDR4 2933 MHz): 4.49

It is amazing to see that this small laptop gives about 80% performance of TR 5995wx workstation in a memory-bandwidth-bound workload.

3. Local LLM and memory bandwidth

I’m a newbie at running a local LLM. Used LM Studio and just followed the simple instructions to run. So please note that the results could be misleading in some details.

The following is a result of Phi4 reasoning plus Q8 (15.5 GB) model, asked to evaluate an integral by using complex analysis. The context window size was set to be 24k, and Vulcan was used to run on the GPU. (The integral is quite tricky, though the answer is correct. Amazing. :astonished:)


I heard the memory bandwidth matters in LLM, and this laptop gives a 205 GB/s reading bandwidth while running the LLM, which is more than 80% of the theoretical peak.

One interesting thing is that, in my experience, setting a large dedicated GPU memory is not quite important. The laptop was able to load llama 3.3 70B Q8 (~75GB) with just 32GB of dedicated GPU memory. The rest of the data was loaded on the “shared” GPU memory. The same memory bandwidth (~200 GB/s) was observed in this case also.

4. COMSOL Multiphysics

For benchmark details, you can refer to the following topic.

I have run the CFD-only model, and here are the results.

  • 36m 48s (-np 16)
  • 35m 56s (-np 16 -blas aocl)

During the benchmark, the peak memory bandwidth was observed as ~72GB/s for reading.

5. Things that make performance-squeezing-out tricky on Windows

  • Regardless of the power plan, the second CCD remains parked by default—even when running on AC power—and it doesn’t wake up unless all 16 threads (8 cores + 8 SMT) are fully utilized. As a result, if you run a 16-threaded program, the second CCD won’t be activated. I’m not sure whether this behavior is controlled by AMD or HP, but I hope this policy will be changed later.

  • So, to make use of 16 threads across the two CCDs while running the COMSOL benchmark, I had to use Process Lasso to manually wake up the second CCD.

  • It would be best if HP provided an option to disable SMT in the BIOS, but I could not find it. Considering this laptop is intended for workstation use, I think this is more or less disappointing.

2 Likes

Nice!!
This is beating the Apple chips tested in terms of power efficiency per compute.

That is really odd, this seems like an AGESA thing to be refined in the future.

2 Likes

I hope so.

(To be more specific, the second CCD doesn’t wake up until all 16 threads in the first CCD are fully occupied.) I’m curious whether this behavior can also be observed on Linux.

It seems that the current core parking policy is designed for better performance-per-watt, but it can actually have a negative impact when SMT doesn’t provide much benefit, which is frequently discussed in numerical calculations.

Also, setting CPU affinity doesn’t help in waking up the second CCD. So far, the only method that has worked for me is manually disabling core parking by using a third-party software.

1 Like

Impressive. That LPDDR5x bandwidth is awesome. Technically 3 channels of 8000MT/s? Seems low for 4-channel of 8000.

Yeah I noticed first hand lately that DDR4 really can’t compete against double the clocks DDR5 has. If memory bound, CPU clocks don’t mean shit. And 128GB on stock laptop is getting scary indeed.

Good stuff, keep it up!

1 Like

Yes. In FDTD, the result is comparable to 3ch 8000 MT/s. Presumably, this is due to the higher latency of LPDDR5x compared to normal DDR5. Thanks for the comment.

I have the 16 core variant and tested this with OpenBLAS on AC:

On Linux, CCD1 is preferred but the scheduler has no problem waking up the CCD2 if necessary. With 8 threads it looks like (just eyeballing at htop, no hard data) the OpenBLAS operations are all scheduled to CCD1 and everything else is scheduled to CCD2. With 10 threads, CCD2 is used much more.

I have the feeling that the scheduler always sends the heavier tasks to CCD1 and everything that “overflows” to CCD2. I also observed marginally higher clocks on CCD1. I guess AMD did some binning here and CCD1 is a better bin than CCD2.

Under normal usage and on battery CCD2 often sleeps most of the time.

1 Like

That’s really great. I think CCD2 is utilized much more efficiently in Linux than in Windows. That’s exactly how I hope it would work in Windows as well. Thank you for the comment!

That is true, and I understand that this is what AMD is doing in multi-CCD processors for desktops and laptops (not sure about EPYC and Threadripper). CCD0 of mine shows a peak frequency of 5.15 GHz, while the second CCD shows a peak frequency around 4.5 GHz.

Anyway, the following is to demonstrate how CCDs work in Windows while running a 16-threaded program. Run on AC power with Best Performance Mode.

CPU affinity applied to allocated only on real-cores across two CCDS. This shows affinity setting doesn't work for waking up CCD2.
CPU affinity applied to allocated only on real-cores across two CCDS. This shows affinity setting doesn’t work for waking up CCD2.

Running without any configuration applied.
Running without any configuration applied.

Forced not to use SMT, and disabled core parking by using Process Lasso
Forced not to use SMT, and disabled core parking by using Process Lasso