The Ultimate Arch + Secureboot guide for Ryzen AI Max (ft. HP G1A 128gb 8060S monster laptop)

wendell · May 14, 2025, 4:59pm

Introduction

Wow! What a ride. I’ve recently reviewed two very different portable computing devices built around 128gb of ram and the 16 core Ryzen AI MAX 395+

This guide is focused around the very capable HP G1A laptop. It looks like a thin-n-lite laptop but in reality it’s kind of an intimidating monster. It is qualified for Ubuntu 24.04 LTS by HP but this guide is mainly about getting Arch Linux working on it, with all the bells and whistles:

Working Fingerprint Logins
BTRFS Snapshots and Subvolumes
Suspend/Resume
Hibernate (!!!)
Mediatek Wifi Reasonably Stable
ISP Camera Working (TODO)

This HP laptop can be a first class Linux experience. I get why AMD’s codename for this is Strix *Halo *

The keyboard backlight controls? They work.
The keyboard screen brightness controls? Yep.
Microphone and speaker control? Yes
TLP and power management? Flawless.
And Fingerprint Reader?!?!? Also flawelss.

This is a first class Linux laptop. The weakest point is the Mediatek Wifi, but as of 5/10/2025 this guide will walk you through what you need for the Mediatek Wifi to work properly even in Wifi 7 scenarios.

I’ve noticed that this laptop is somewhat misunderstood because laptops that have up to 16 cores and 128gb ram are usually much more aptly described as a portable desktop, because of the bulk, than a laptop. Not so for this!

I hope that doesn’t discourage product designers. This laptop is effectively a desktop replacement, even for people like me.

Note that even though this guide was essentially written for the HP G1A, it will mostly apply to other AMD Strix Halo devices out there, too.

Repairability

Even though the ram is soldered, the SSD and battery are easily replacable. HP has a good field-service manual. Props for that, and that gives the ground-truth of the repairability and serviceability of this machine.

Arch Install

This was done using the March 01 ArchInstall ISO. You should be familiar with the general Arch Install document and the arch wiki

Networking

iwctl
# then inside iwctl:
device list
station wlan0 scan
station wlan0 get-networks
station wlan0 connect YOUR_SSID

Disk Partitioning

I want sleep-then-hibernate to work to preserve battery life. There are security implications of writing memory to disk, and this guide doesn’t cover whole disk encryption. If you explore that please comment, though.

Because of that I often opt for a separate swap partition. In this case we use 128gb and hibernating takes a while to write that out.

cgdisk /dev/nvme0n1

EFI: 512MB (type EF00)
Root: rest of the disk -132G (type 8300)
Make a swap partition ≥ your RAM (e.g. 128GB)

You can specify a negative number for the ending sector with cgdisk. So -132G will give us 128G + padding. If your laptop is 32G or 64G you can size appropriately. I like to give it an extra 1-4GB of padding just because I’m paranoid, but exact sizing should be fine here.

Formatting:

mkfs.fat -F32 /dev/nvme0n1p1         # EFI
mkfs.btrfs /dev/nvme0n1p2             # Root
mkswap /dev/nvme0n1p3

The other thing I like to do with BTRFS is use snapshots and subvolumes, rather than partitions. We’ll come back to that. I like this approach better than having a separate /home partition.

Subvolumes + Snapshots

Use subvolumes instead of directories for /home, /var/log, etc., even without separate partitions. This helps with snapshotting and system rollback.

mount /dev/nvme0n1p2 /mnt

btrfs subvolume create /mnt/@
btrfs subvolume create /mnt/@home
btrfs subvolume create /mnt/@log
btrfs subvolume create /mnt/@cache
btrfs subvolume create /mnt/@pkg

#  I used to do this last one manually, but snapper now
#  has to do it, so don't create @snapshots if you 
#  btrfs subvolume create /mnt/@snapshots

umount /mnt

Now we can mount our /mnt to prep for the pacstrap

mount -o compress=zstd,noatime,space_cache=v2,ssd,subvol=@ /dev/nvme0n1p2 /mnt

mkdir -p /mnt/{boot,home,var/log,var/cache,var/lib/pacman/pkg}

mount -o compress=zstd,noatime,space_cache=v2,ssd,subvol=@home       /dev/nvme0n1p2 /mnt/home
mount -o compress=zstd,noatime,space_cache=v2,ssd,subvol=@log        /dev/nvme0n1p2 /mnt/var/log
mount -o compress=zstd,noatime,space_cache=v2,ssd,subvol=@cache      /dev/nvme0n1p2 /mnt/var/cache
mount -o compress=zstd,noatime,space_cache=v2,ssd,subvol=@pkg        /dev/nvme0n1p2 /mnt/var/lib/pacman/pkg


mount /dev/nvme0n1p1 /mnt/boot

You should be able to genfstab -U /mnt >> /mnt/etc/fstab and then verify that your fstab looks proper:

UUID=ROOT_UUID   /              btrfs  rw,noatime,compress=zstd,space_cache=v2,ssd,subvol=@           0 0
UUID=ROOT_UUID   /home          btrfs  rw,noatime,compress=zstd,space_cache=v2,ssd,subvol=@home       0 0
UUID=ROOT_UUID   /var/log       btrfs  rw,noatime,compress=zstd,space_cache=v2,ssd,subvol=@log        0 0
UUID=ROOT_UUID   /var/cache     btrfs  rw,noatime,compress=zstd,space_cache=v2,ssd,subvol=@cache      0 0
UUID=ROOT_UUID   /var/lib/pacman/pkg btrfs rw,noatime,compress=zstd,space_cache=v2,ssd,subvol=@pkg    0 0

UUID=EFI_UUID    /boot          vfat   defaults                                                       0 2
UUID=SWAP_UUID   swap           swap   defaults                    0 0

Use blkid to spot check UUIDs if anything says ‘none’ . If swap says ‘none’ you forgot to mkswap. Do that now and put the uuid in the fstab

The Rest of the Arch Install

pacstrap -K /mnt base linux linux-firmware systemd-boot networkmanager vim snapper linux-firmware mokutil
genfstab -U /mnt >> /mnt/etc/fstab
echo "your_laptop_name" >> /etc/hostname


arch-chroot /mnt

ln -sf /usr/share/zoneinfo/Region/City /etc/localtime
hwclock --systohc

echo "en_US.UTF-8 UTF-8" >> /etc/locale.gen
locale-gen
echo "LANG=en_US.UTF-8" > /etc/locale.conf


cat <<EOF > /etc/hosts
127.0.0.1   localhost
::1         localhost
127.0.1.1   archlaptopname.localdomain archlaptopname
EOF


systemctl enable NetworkManager

# add your user, set passwords

# set arch root password 
passwd

useradd -mG wheel yourusername
passwd yourusername

visudo and uncomment %wheel ALL=(ALL:ALL) ALL

…these should be familiar to you from having read the arch install wiki

One last step is going ahead and enabling snapshots

sudo snapper -c root create-config /

then edit /etc/snapper/configs/root

SUBVOLUME="/"
SNAPSHOT_CREATE=yes
TIMELINE_CREATE=yes
TIMELINE_CLEANUP=yes

We’ll enable timed snapshots later after first boot.

Secure boot with GRUB

pacman -S sbctl

sbctl status
# output looks like
# Secure Boot: disabled
# Setup Mode: enabled
# ...
# Vendor Keys: none

sbctl create-keys

# -m is important here because it also enrolls Microsoft's signing keys
sbctl enroll-keys -m

Understand that the next reboot after having done enroll-keys -m secure boot will attempt to be secure boot enabled. If you did something wrong, you must clear the secure boot keys to go back into setup mode to try again. So don’t reboot until you’ve completed all the steps. We’re still in the arch chroot.

Sign Everything and Setup Grub

sudo grub-mkconfig -o /boot/grub/grub.cfg
grub-install --target=x86_64-efi --efi-directory=/boot --bootloader-id=GRUB --modules="tpm" --disable-shim-lock 



sbctl sign  /boot/EFI/GRUB/grubx64.efi
sbctl sign  /boot/EFI/BOOT/BOOTX64.EFI
sbctl sign  /boot/vmlinux-linux

# verify everything has a green check
sbctl verify

This will be needed for snapshots:

sudo systemctl enable grub-btrfs.path

… And now you should be ready for your first reboot. Reboot the system. It should work. If not, skip to the secure boot troubleshooting section and we’ll go from there.

Some Recommended Packages

I use KDE on Arch, so I’d recommend also installing these packages:

pacman -Syu \
  plasma-meta kde-applications-meta \
  sddm xorg xdg-desktop-portal xdg-desktop-portal-kde \
  networkmanager bluez bluez-utils \
  pipewire wireplumber \
  power-profiles-daemon \
  mesa vulkan-intel vulkan-radeon libva-mesa-driver mesa-vdpau \
  powerdevil \
  konsole dolphin ark spectacle okular \
  fprint \
  amd-ucode \
  chromium

Note the amd-ucode package here. Intel-ucode would work for an intel system if you’re adapting these instructions to something that isn’t Strix Halo

fprint is for the fingerprint sensor.

and then enable some services:

systemctl enable sddm
systemctl enable NetworkManager
systemctl enable bluetooth
systemctl enable fstrim.timer

After first boot

Snapshots

Enable timed snapshots:

sudo systemctl enable --now snapper-timeline.timer
sudo systemctl enable --now snapper-cleanup.timer

In /etc/fstab add/uncomment the snapshots line:

# in fstab a line such as
# use blkid to confirm the uuid, should be copy paste 
# from the line above
UUID=...  /.snapshots  btrfs  subvol=@snapshots,compress=zstd,noatime,space_cache=v2,ssd  0 2

sudo mkdir -p /.snapshots
sudo mount -a

If you’d like to test the snapshots:
sudo snapper -c root create --description "Test snapshot"
then to view snapper -c root list

Testing Suspend / Resume

systemctl suspend
systemctl hibernate

Does the system wake and sleep? Does it suspend (after probably 45 seconds of writing to disk? 128gb ram first world problems, huh?) Congratulations this is a better experience than 95% of Linux users (and 50% of windows users) enjoy.

Working suspend/sleep/wake is half the reason people buy Apple laptops it seems…

Troubleshooting Secure Boot

Did something go wrong after first reboot and the system won’t boot? Not to worry – reboot the laptop and spam Fn+F10 to get back into setup. Go to Secure Boot and check the “clear Secure boot” checkbox. Save and Exit.

Troubleshooting Lockup After Resume on the G1A

I noticed in my journey from kernel version 6.14 > 6.15 > 6.16 bleeding edge that, at some point, there was a good chance that resume from hibernate would lock the machine. I think this is a regression, and it may occur when the machine hardware clock passes the midnight boundary. I am still investigating. Stay tuned…

Hibernate, Suspend and Kernel Versions

Right now (2025-06-10) it still seems like kernel 6.14.9 is the best bet for a stable sleep/wake and hibernate function. The linux-cachyos kernel from the AUR was okay, but there seems to be some type of regression in 6.15.0 and 6.15.1 wrt sleep/wake and suspend/resume.

For the webcam that means you’ll need the dkms driver I did for both the i2c bus (amd_isp4) and the webcam (ov5c.c) … The dkms driver should be uninstalled when using newer kernels however. Its a bit of a mess right now… stay tuned.

In the mean time

Here is a DKMS version of the module I built. Note that I recommend you pull the lib camera repo from above and build it, and build it with v4l2 emulation support.

Note: I have taken this down for the moment as I created a situation for myself when I upgraded 6.14.4 > 6.14.5 > 6.15 with this setup. Something is not right…

cat /sys/bus/acpi/devices/OMNI5C10\:00/status

The sensor on my G1A appears as OMNISC10, and I get status code 15 from the above.

I tried to extract the ispkernel 4.0 stuff from the above kernel but it was a little more involved than I expected. The driver I wrote doesn’t bind/find the ISP bugs, maybe because ispkernel4.0 is also missing…

cam -l doesn’t show any cameras even after I managed to get the module to load. I also sort of expected the intel ov05c10 would be compatible with the OMNI 5c10 and maybe Id just have to add an acpi name or something to the module, but that didn’t work either.

So, for now, kernel 6.16 is the path of least resistance on arch. OR building the older ubuntu 24.04 LTS kernel for your arch distro.

(Did I just become the maintainer of linux-g1a on aur? pls nooooo)

Update: 6/19: Ulgh, I got this working on 6.14 but now on 6.15 the amd_isp bus stuff is present in this kernel version, but the camera driver is now not loading, or is different. I think the intel driver was basically working as-is, but is now not. Not sure why. Investigating…

More Recommended Packages / Setup

TLP or auto-cpufreq
Timeshift
FirewallD/UFW
snap-pac

OpenVPN (and VPNs in General)

Some enterprise VPNs (looking at you Fortinet) advertise a default route (meaning that traffic is sent that way rather than via the internet) but your corporate firewall rules don’t always permit traffic. Sometimes the default config with OpenVPN is this way on pfsense. Consider the following output of ip route :

default via 10.39.1.1 dev tun0 proto static metric 50
default via 192.168.117.232 dev wlp193s0 proto dhcp src 192.168.117.201 metric 600
10.79.78.0/24 dev tun0 proto kernel scope link src 10.79.78.9 metric 50
10.200.0.0/24 via 10.79.78.1 dev tun0 proto static metric 50
69.254.111.42 via 192.168.117.232 dev wlp193s0 proto static metric 50
192.168.117.0/24 dev wlp193s0 proto kernel scope link src 192.168.117.201 metric 600
192.168.117.232 dev wlp193s0 proto static scope link metric 50

When the VPN is active one can reach VPN resources but not the internet. This is because the metric 50 is below metric 600 that Arch is using by default on mediatek wlp193s0. Dumb, but not for the reasons you think. (Sometimes the metric is 100).

We want to explicitly tell network manager to use a low value for the metric so it is the “first pick” for outbound traffic in this kind of a scenario. To change metric 600 to metric 25 just do this:

edit /etc/NetworkManager/conf.d/10-route-metric.conf and make it contain


[connection]
ipv4.route-metric=25
ipv6.route-metric=25

then systemctl restart NetworkManager and that is unlikely to be an issue again.

Steam!?

vi /etc/pacman.conf and uncomment multilib. then pacman -S steam and login!

Known Quirks

Wifi

With Kernel 6.14.3 and linux-firmware from 5/1 or newer, most of the Mediatek woes are gone. However there are some quirks if your access point is IPv6 and in scenarios where you are doing multi in/out. The Mediatek Wifi7 does not always have best performance on Linux and (for the most part) ASPM must remain disabled to have a good wifi experience. The HP laptop meets those criteria out of the box, but other Strix Halo devices may not. It is possible to disable ASPM via kernel parameter if it is not an accessible setting in bios, and the IPv6 bug is a known issue and will probably be fixed soon. If you trip over the ipv6 bug, it is possible to disable ipv6 with a kernel parameter and sidestep the bug for good wifi experiences.

I recommend the following tweaks to make the experience a bit smoother:

TODO

Benchmarks

I was able to benchmark the memory at @ 215 gigabyes/sec. I was able to achieve about 3.1 tokens/s on Deepseek R1 (unsloth 70gb). On a laptop

Update: Up to 5.5 tokens/sec WOOO!

Useful resources in no particular order:

specifically for 1151:

and

github.com/ROCm/ROCm

[Issue]: Is there a ROCm version that supports gfx1151?

opened 11:44AM - 14 Mar 25 UTC

moonshadow-25

Under Investigation

### Problem Description When I build the Pytorch with ROCm6.3.4,there are many …errors! [29/616] cd /home/moon/kt/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels && /home/moon/miniconda3/envs/ktransformers/lib/python3.12/site-packages/cmake/data/bin/cmake -E make_directory /home/moon/kt/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels/. && /home/moon/miniconda3/envs/ktransformers/lib/python3.12/site-packages/cmake/data/bin/cmake -D verbose:BOOL=OFF -D build_configuration:STRING=RELEASE -D generated_file:STRING=/home/moon/kt/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels/./torch_hip_generated_bgemm_kernel_bf16bf16bf16_256_16x256x64_16x16_1x4_8x16x1_8x16x1_1x16x1x16_4_Intrawave_v2.hip.o -P /home/moon/kt/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels/torch_hip_generated_bgemm_kernel_bf16bf16bf16_256_16x256x64_16x16_1x4_8x16x1_8x16x1_1x16x1x16_4_Intrawave_v2.hip.o.cmake FAILED: caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels/torch_hip_generated_bgemm_kernel_bf16bf16bf16_256_16x256x64_16x16_1x4_8x16x1_8x16x1_1x16x1x16_4_Intrawave_v2.hip.o /home/moon/kt/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels/torch_hip_generated_bgemm_kernel_bf16bf16bf16_256_16x256x64_16x16_1x4_8x16x1_8x16x1_1x16x1x16_4_Intrawave_v2.hip.o cd /home/moon/kt/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels && /home/moon/miniconda3/envs/ktransformers/lib/python3.12/site-packages/cmake/data/bin/cmake -E make_directory /home/moon/kt/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels/. && /home/moon/miniconda3/envs/ktransformers/lib/python3.12/site-packages/cmake/data/bin/cmake -D verbose:BOOL=OFF -D build_configuration:STRING=RELEASE -D generated_file:STRING=/home/moon/kt/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels/./torch_hip_generated_bgemm_kernel_bf16bf16bf16_256_16x256x64_16x16_1x4_8x16x1_8x16x1_1x16x1x16_4_Intrawave_v2.hip.o -P /home/moon/kt/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels/torch_hip_generated_bgemm_kernel_bf16bf16bf16_256_16x256x64_16x16_1x4_8x16x1_8x16x1_1x16x1x16_4_Intrawave_v2.hip.o.cmake In file included from /home/moon/kt/pytorch/aten/src/ATen/native/hip/bgemm_kernels/bgemm_kernel_bf16bf16bf16_256_16x256x64_16x16_1x4_8x16x1_8x16x1_1x16x1x16_4_Intrawave_v2.hip:3: In file included from /home/moon/kt/pytorch/aten/src/ATen/native/hip/bgemm_kernels/bgemm_kernel_template.h:11: In file included from /home/moon/kt/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp:9: In file included from /home/moon/kt/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/common_header.hpp:36: /home/moon/kt/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:32:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD' 32 | wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD; | ^ /home/moon/kt/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:47:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD' 47 | wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD; | ^ ### Operating System NAME="Ubuntu" VERSION="22.04.5 LTS (Jammy Jellyfish)" ### CPU AMD Eng Sample: 100-000001243-50_Y ### GPU amdgcn-amd-amdhsa--gfx1151 ### ROCm Version ROCm 6.3.4 ### ROCm Component _No response_ ### Steps to Reproduce _No response_ ### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support ROCk module version 6.10.5 is loaded ===================== HSA System Attributes ===================== Runtime Version: 1.14 Runtime Ext Version: 1.6 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE Mwaitx: DISABLED DMAbuf Support: YES ========== HSA Agents ========== ******* Agent 1 ******* Name: AMD Eng Sample: 100-000001243-50_Y Uuid: CPU-XX Marketing Name: AMD Eng Sample: 100-000001243-50_Y Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 49152(0xc000) KB Chip ID: 0(0x0) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 5172 BDFID: 0 Internal Node ID: 0 Compute Unit: 32 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Memory Properties: Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 131015728(0x7cf2430) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 131015728(0x7cf2430) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 131015728(0x7cf2430) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 4 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 131015728(0x7cf2430) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: ******* Agent 2 ******* Name: gfx1151 Uuid: GPU-XX Marketing Name: AMD Radeon Graphics Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 1 Device Type: GPU Cache Info: L1: 32(0x20) KB L2: 2048(0x800) KB L3: 16384(0x4000) KB Chip ID: 5510(0x1586) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2799 BDFID: 50432 Internal Node ID: 1 Compute Unit: 40 SIMDs per CU: 2 Shader Engines: 2 Shader Arrs. per Eng.: 2 WatchPts on Addr. Ranges:4 Coherent Host Access: FALSE Memory Properties: APU Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 25 SDMA engine uCode:: 14 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 65507864(0x3e79218) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 65507864(0x3e79218) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Recommended Granule:0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1151 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 *** Done *** ### Additional Information _No response_

has some useful background from march until the present.

this is a useful self-contained pytorch wheel for 1151:

which would give you a quick and dirty shortcut.

time llama.cpp-vulkan/build/bin/llama-bench -fa 1 -m ~/models/shisa-v2-llama3.3-70b.i1-Q4_K_M.gguf

model	size	params	backend	ngl	fa	test	t/s
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	Vulkan,RPC	99	1	pp512	79.98 ± 0.69
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	Vulkan,RPC	99	1	tg128	5.33 ± 0.00

The Ryzen AI Max 395’s Radeon 8060S has 40 RDNA3.5 CUs.
Max clock is 2.9GHz so thats a peak of 59.4 FP16/BF16 TFLOPS:

512 ops per clock per cpu times 40 CUs times 2.9ghz = 59.392 FP16 TFLOPS

But the catch is you have to have WMMA or wave32 VOPD otherwise the max is halved.

In a lot of AMDs marketing I noticed that they were plugging rocm support but they meant the npu. Of course rocm cpu also works fine but that leaves a lot of perf on the table, too.

Theoretically the hip backend is faster? But in reality I found vulkan to be better. Ymmv.

TODO

Koop · May 29, 2025, 6:55pm

This is pretty awesome. Looking forward to seeing what other OEM options crop up and give us the best of the best.

lhl · May 29, 2025, 7:54pm

If you’re interested in AI/LLM benchmarking, I’ve been taking notes here while I’ve been poking here: Strix Halo

ciaduck · May 30, 2025, 12:33am

Well. Now I know what I did wrong when I tried Arch in December and only installed base KDE packages. Looks like I probably needed a few more things.

I switched to OpenSUSE Tumbleweed and had a great time after that.

lhl · May 30, 2025, 6:03am

oh btw, i’m curious about what you tested w/. I got about the same mbw #s (rocm_bandwidth_test got my 212-213GB/s) - 70B Q4 quants should be about 40GB, so theoretical max should be just over 5 tok/s.

And in both my HIP and Vulkan tests, that’s what I get:

# Vulkan
❯ time llama.cpp-vulkan/build/bin/llama-bench -fa 1 -m ~/models/shisa-v2-llama3.3-70b.i1-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | Vulkan,RPC |  99 |  1 |           pp512 |         77.28 ± 0.69 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | Vulkan,RPC |  99 |  1 |           tg128 |          5.02 ± 0.00 |

build: 9a390c48 (5349)

real    3m0.783s
user    0m38.376s
sys     0m8.628s

# HIP
❯ time llama.cpp-rocwmma/build/bin/llama-bench -fa 1 -m ~/models/shisa-v2-llama3.3-70b.i1-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | ROCm,RPC   |  99 |  1 |           pp512 |         34.36 ± 0.02 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | ROCm,RPC   |  99 |  1 |           tg128 |          4.70 ± 0.00 |

build: 09232370 (5348)

real    3m53.133s
user    3m34.265s
sys     0m4.752s

Not that that’s so impressive IMO, but actually for MoE’s much better results:

# HIP (hipBLASLt)
❯ ROCBLAS_USE_HIPBLASLT=1 llama.cpp-hip/build/bin/llama-bench -m /home/lhl/models/Qwen3-235B-A22B-UD-Q3_K_XL-00001-of-00003.gguf
/share/libdrm/amdgpu.ids: No such file or directory
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
rocBLAS error: No hipBLASLt solution found
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_HIPBLASLT_ERROR environment variable is set.

rocBLAS warning: hipBlasLT failed, falling back to tensile.
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.
| qwen3moe 235B.A22B Q3_K - Medium |  96.59 GiB |   235.09 B | ROCm,RPC   |  99 |           pp512 |        120.46 ± 0.39 |
| qwen3moe 235B.A22B Q3_K - Medium |  96.59 GiB |   235.09 B | ROCm,RPC   |  99 |           tg128 |         10.63 ± 0.03 |

build: c753d7be (5392)

That’s a 97GB model that is GPT-4+ in basically every eval (and can reason to boot+) running at decent speeds.

WIP in progress, I’m writing a basic howto for hardware reviewers to do basic LLM benchmarking since most of them have no idea what they’re even testing/testing for: LLM Inference Benchmarking Cheat‑Sheet for Hardware Reviewers

wendell · May 30, 2025, 12:49pm

this is great, can I link to this/reference for our video? I have been working on the same but hacking on llama.cpp and applying some out of tree patches from their issue tracker to get slightly less bad perf. out of the box it was like 1 token/s but that was a month+ ago when I started.

I saw your issue on GitHub too, following that to see what and says. therock GitHub is the way to go I think but I worried turning others loose with that may yield inconsistent results because it’s changing so much.

some of the work @ubergarm has done on our desktop systems also benefitted my apu setup here too. there’s some good stuff on the llama.cpp threads here for perf tweaks

ubergarm · May 30, 2025, 4:48pm

Nice! I enjoy guides where I can just blindly copy-paste and watch it do the thing!

Have you seen llama-sweep-bench which checks prompt processing and token generation speed across a wide range of prompt lengths (kv-cache context)? You can make comparison graphs between quants or hardware configs like this which show how performance falls off over longer context.

I maintain a version ported to mainline llama.cpp in my personal fork here and just updated today.

Okay, back to releasing some big new quants thanks to Wendell’s help: ubergarm/DeepSeek-R1-0528-GGUF. I haven’t tried squeezing one down to under ~1.5BPW to fit on 128GB V/RAM yet though, hrmm…

lhl · May 30, 2025, 7:16pm

Sure, feel free to link to anything on https://llm-tracker.info/ - that includes my AMD GPU guide: AMD GPUs (I believe its the most comprehensive guide online for reviewing RDNA3 ML/LLM stuff, I’ve been updating it about once every quarter or so, depending on how busy I am).

I will am actively doing testing on the Strix Halo stuff (like literally fighting/compiling CK atm). You can check out the bugs I’ve filed to see that there’s very big pp performance that is being left on the table atm. The url may move at some point, but https://llm-tracker.info/ should be fine if you’re just pointing to an URL

In the past I’ve run efficiency tests between backend/architectures vs theoretical memory and compute btw:
Here’s the latest version of my chart LLM Worksheet - Google Sheets

Strix Halo’s MBW efficiency is fine, but the tok/TFLOP … is still quite bad atm.

Yeah, the big problem as I see it:

6.4.1 has basic rocBLAS gfx1151 support, it’s super slow
Most of the builds that have hipBLASLt built for gfx1151 (and you really want rocWMMA as well for better FA for llama.cpp - for FA for PyTorch you need AOTriton) are built against ROCm 6.5 - this can lead to issues now b/c for example I believe that llama.cpp won’t build since it’s using deprecated HIP structures that actually disappear in 6.5 (wah wah).
There are literally two community members (jammm and scottt) working on the gfx1151 PyTorch builds - like… really? (Remember Strix Halo came out in products in February. The 128GB model only makes sense as a AI devtoy (let’s be honest, both the raw compute and the mbw are on the “neat” to poke around with but of limited utility level) but the support is still… grim). At a certain point, it’s a bit exasperating.

lhl · May 30, 2025, 7:24pm

Ah neat, I’ll have to take a look at your llama-sweep-bench repo thx! I’ve been working on a similar tool recently, also tracking highwater memory usage mainly to be able to compare how different backends and settings compare:

You can imagine how showing -fa 1 w/ VRAM would probably.
Some other people are always asking about power usage and token/w so you could add that too.

I also have a lot of other production-related inference stuff like where I compare ttft and throughput vs concurrency (avg/p50/p99)

Quanting big models are a bit crazy. Doing my 405B right now, and W8A8-INT8 just finished. Took a full H200 node just under 2 days to finish (and it almost OOM’d towards the end):

Grassyloki · June 3, 2025, 2:44pm

I would caution against using btrfs as the rootFS. Its very slow in comparison to ext4 and xfs. I used to use it but even after disabling COW and enabling the flash optimization settings and it was still very slow.

grub2 still does not support argon2 so it cant be used to boot modern encrypted boot with LUKS unless you use a fork or use a different bootloader like systemd boot.

This guide is pretty good at telling you how to configure systemd boot:

If you going for a real secure boot, not installing the microsoft keys with sbctl enroll-keys -m instead using sbctl enroll-keys and do it without the MS keys. This along with setting a UEFI password should guarntee that the system will be pretty secure at rest.

wendell · June 3, 2025, 5:52pm

for btrfs and the way Ive setup snapshots my plan here is to show people how to pick boot time snapshots. my thought was newbs could more easily recover their systems.

do you know of another way to accomplish snapshots and rollback that’s reasonably fast? doing something out of tree with zfs that was similar then switching to this I was pretty delighted but even more speed is good.

lvm route maybe?

Grassyloki · June 3, 2025, 9:15pm

I dont think any other filesystem has that ease of use functionality. you can kind of with xfs and lvm, buuuuut its not something a noob should be doing. Sadly it seems like zfs and btrfs are the only ones who care about that functionality. with basic consumer ssd’s its not the end of the world, but if you got a high performer like the Crucial T705 it hurts it alot. Maybe im overthinking it on a laptop

riklaunim · June 5, 2025, 9:47pm

Locally, the 64GB Flow Z13 is available (well, “after 16.06”) and I’m tempted. I don’t need bleeding edge LLM options, but is 64GB still decent for running good models (text to image/video, text to text etc)?

wendell · June 5, 2025, 9:47pm

yep, 30b models would be pretty great

thehitchh1ker · June 9, 2025, 7:37pm

This is a very helpful article and thank you for writing it.
I am new linux user and sort of dove head first into arch linux and tiling window managers and now I am hooked. I am already so used to the functionality and ease that comes with linux and terminal workflow but I have a lot to learn still.
I would really appreciate it if you could write a similar guide for the Asus Zephyrus G16 2025 laptop (just bought the 5070ti with Intel 285H config).
Since I don’t have enough linux knowledge to adapt this guide for my own use-case, a dedicated article on the Zephyrus G16 would be really helpful, and it’s quite a popular laptop as well.

Alexander_Martinez · June 10, 2025, 3:11am

Shouldn’t the fstab entries for the btrfs subvolumes have 0 0 at the end because btrfs does not support fsck or something?

wendell · June 10, 2025, 2:55pm

oh yeah good catch.

Grassyloki · June 17, 2025, 4:14am

@wendell have you talked with any manufactures about the lack of laptops for the Ryzen AI MAX 395+ and Ryzen AI Max 390? it seems like there are only a few laptops with these chips. I want to get something like a zenbook S16 with one of these and ~64gb of ram. Is there not much demand or are the chips rare?