Nvidia Jetson Thor ramblings


root@flatbrick:~# nvidia-smi mig -lgip
+-------------------------------------------------------------------------------+
| GPU instance profiles:                                                        |
| GPU   Name               ID    Instances   Memory     P2P    SM    DEC   ENC  |
|                                Free/Total   GiB              CE    JPEG  OFA  |
|===============================================================================|
|   0  MIG 3g.0gb           0     1/1        0.00       No     18     2     2   |
|                                                               1     2     1   |
+-------------------------------------------------------------------------------+
root@flatbrick:~# nvidia-smi mig -cgi 0
Successfully created GPU instance ID  0 on GPU  0 using profile MIG 3g.0gb (ID  0)
root@flatbrick:~# nvidia-smi mig -lcip
+--------------------------------------------------------------------------------------+
| Compute instance profiles:                                                           |
| GPU     GPU       Name             Profile  Instances   Exclusive       Shared       |
|       Instance                       ID     Free/Total     SM       DEC   ENC   OFA  |
|         ID                                                          CE    JPEG       |
|======================================================================================|
|   0      0       MIG 1c.3g.0gb        0      2/2            6        2     2     1   |
|                                                                      1     2         |
+--------------------------------------------------------------------------------------+
|   0      0       MIG 2c.3g.0gb        1      2/1            6        2     2     1   |
|                                                                      1     2         |
+--------------------------------------------------------------------------------------+
|   0      0       MIG 3g.0gb           2*     1/1           18        2     2     1   |
|                                                                      1     2         |
+--------------------------------------------------------------------------------------+
root@flatbrick:~# nvidia-smi mig -lgi
+---------------------------------------------------------+
| GPU instances:                                          |
| GPU   Name               Profile  Instance   Placement  |
|                            ID       ID       Start:Size |
|=========================================================|
|   0  MIG 3g.0gb             0        0          0:3     |
+---------------------------------------------------------+
root@flatbrick:~# nvidia-smi mig -cci 1 -gi 0
Unable to create a compute instance on GPU  0 GPU instance ID  0 using profile 1: Unknown Error
Failed to create compute instances: Unknown Error
root@flatbrick:~# nvidia-smi mig -cci 0 -gi 0
Unable to create a compute instance on GPU  0 GPU instance ID  0 using profile 0: Unknown Error
Failed to create compute instances: Unknown Error
root@flatbrick:~# nvidia-smi mig -cci 2 -gi 0
Unable to create a compute instance on GPU  0 GPU instance ID  0 using profile 2: Insufficient Resources
Failed to create compute instances: Insufficient Resources
root@flatbrick:~# nvidia-smi mig -cgi 0 -C
Successfully created GPU instance ID  0 on GPU  0 using profile MIG 3g.0gb (ID  0)
Unable to create a compute instance on GPU  0 GPU instance ID  0 using profile default: Insufficient Resources
Failed to create GPU instances: Insufficient Resources
root@flatbrick:~# nvidia-smi mig -lgi
+---------------------------------------------------------+
| GPU instances:                                          |
| GPU   Name               Profile  Instance   Placement  |
|                            ID       ID       Start:Size |
|=========================================================|
|   0  MIG 3g.0gb             0        0          0:3     |
+---------------------------------------------------------+
root@flatbrick:~# nvidia-smi mig -lci
No compute instances found: Not Found


  • Going off a recent technical blog post it sounds like NVML/nvidia-smi is in active development and that mig isn’t fully developed yet.

Copy performance (Jetson Thor)
Avg. time: 0.488719 ms / Copy throughput: 95.281815 GB/s
Copy performance (6000 blackwell)
Avg. time: 0.023844 ms / Copy throughput: 1952.964983 GB/s.

Enable max performance profile and clocks
nvpmodel -m 0 && nvpmodel -q && jetson_clocks

Install actual required stuff to do anything

sudo apt update
sudo apt dist-upgrade
sudo apt install nvidia-jetpack
sudo apt install nvidia-cuda-dev

Setup environment

echo "export PATH=/usr/local/cuda/bin:$PATH" >> ~/.bashrc
echo "export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH" >> ~/.bashrc

Useful links
https://docs.nvidia.com/jetson/agx-thor-devkit/user-guide/0.1.0/setup_cuda.html

Install jetson-stats

sudo pip3 install --break-system-packages jetson-stats
sudo groupadd jtop
sudo vigr # Add user to jtop group
sudo vigr -s # Add user to jtop group
sudo ln -s /usr/local/jetson_stats/jtop.service /etc/systemd/system/jtop.service
sudo systemctl enable jtop.service
sudo systemctl restart jtop.service
sudo systemctl status jtop.service

Fix jtop JETPACK not installed message

44,45d43
<     # -------- THOR -------
<     "38.2.0": "7.0",

/usr/local/lib/python3.12/dist-packages/jtop/core/jetson_variables.py

OR (Use my fork that has the variables and nvml fixes)

I added nvml support to jtop so it now is able to get some metrics from the dGPU

1 Like

Float32 mamf

Jetson Thor

** Command line:
/home/ptrck/pytorch/venv/bin/python mamf-finder.py --m_range 0 4096 512 --n_range 0 4096 512 --k_range 0 4096 512 --dtype float32

** Dtype: torch.float32

** Platform/Device info:
Linux flatbrick 6.8.12-tegra #1 SMP PREEMPT Thu Aug 21 17:27:43 PDT 2025 aarch64 aarch64
_CudaDeviceProperties(name='NVIDIA Thor', major=11, minor=0, total_memory=125772MB, multi_processor_count=20, uuid=, pci_bus_id=1, pci_device_id=0, pci_domain_id=0, L2_cache_size=32MB)

** Critical software versions:
torch=2.9.0a0+gitec2c137
cuda=13.0

** Additional notes:
benchmark version: 2


--------------------------------------------------------------------------------


Warming up the accelerator for 30 secs ... accelerator warmup finished

Tried  343 shapes => the best outcomes were:
mean:   5.3 TFLOPS @ 1536x2560x2048 (MxNxK)
median: 5.3 TFLOPS @ 1536x2560x2048 (MxNxK)
max:    5.3 TFLOPS @ 1536x2560x2048 (MxNxK)

geomean: 4.6 TFLOPS for 343 shapes in range: m=[0, 4096, 512] | n=[0, 4096, 512] | k=[0, 4096, 512]

Legend: TFLOPS = 10**12 FLOPS
Elapsed time: 0:05:08

Blackwell 6000

Tried  343 shapes => the best outcomes were:
mean:   65.9 TFLOPS @ 2560x3584x2048 (MxNxK)
median: 66.0 TFLOPS @ 2560x3584x2048 (MxNxK)
max:    66.3 TFLOPS @ 2560x3584x2048 (MxNxK)

geomean: 46.9 TFLOPS for 343 shapes in range: m=[0, 4096, 512] | n=[0, 4096, 512] | k=[0, 4096, 512]

Legend: TFLOPS = 10**12 FLOPS
Elapsed time: 0:00:55

Float16

Jetson Thor

Tried  343 shapes => the best outcomes were:
mean:   155.5 TFLOPS @ 3584x3072x3584 (MxNxK)
median: 156.0 TFLOPS @ 3584x3072x3584 (MxNxK)
max:    163.3 TFLOPS @ 3584x3072x3584 (MxNxK)

geomean: 66.5 TFLOPS for 343 shapes in range: m=[0, 4096, 512] | n=[0, 4096, 512] | k=[0, 4096, 512]

Legend: TFLOPS = 10**12 FLOPS
Elapsed time: 0:01:48

Blackwell 6000

If I increase the searchable range this is over 400, but keeping it fair

Tried  343 shapes => the best outcomes were:
mean:   357.7 TFLOPS @ 1536x3584x1536 (MxNxK)
median: 359.0 TFLOPS @ 1536x3584x1536 (MxNxK)
max:    363.2 TFLOPS @ 1536x3584x1536 (MxNxK)

geomean: 215.0 TFLOPS for 343 shapes in range: m=[0, 4096, 512] | n=[0, 4096, 512] | k=[0, 4096, 512]

Legend: TFLOPS = 10**12 FLOPS
Elapsed time: 0:00:41

Float8_e4m3fn

Jetson Thor

Tried  343 shapes => the best outcomes were:
mean:   276.5 TFLOPS @ 3584x3072x2048 (MxNxK)
median: 278.8 TFLOPS @ 3072x3584x2048 (MxNxK)
max:    288.1 TFLOPS @ 3072x3584x2048 (MxNxK)

geomean: 88.6 TFLOPS for 343 shapes in range: m=[0, 4096, 512] | n=[0, 4096, 512] | k=[0, 4096, 512]

Legend: TFLOPS = 10**12 FLOPS
Elapsed time: 0:02:03

Blackwell 6000

Tried  343 shapes => the best outcomes were:
mean:   658.4 TFLOPS @ 2048x2560x3072 (MxNxK)
median: 659.4 TFLOPS @ 3584x1536x3072 (MxNxK)
max:    690.4 TFLOPS @ 2048x2560x3072 (MxNxK)

geomean: 353.8 TFLOPS for 343 shapes in range: m=[0, 4096, 512] | n=[0, 4096, 512] | k=[0, 4096, 512]

Legend: TFLOPS = 10**12 FLOPS
Elapsed time: 0:00:40

pytorch’s support of float4 is too broken atm to actually benchmark in this fashion atm on either platform