4x R9700 - Getting Started Ubuntu 24.04 Setup Guide

Background

We’re Testing 4 Radeon AI Pro R9700s… What do you want to see? Community Event and Giveaway

Click Here to throw your hat into the Giveaway Thread

AMD has really been stepping up their own write-ups, but it can be tough to keep them current without a cohesive community to constantly update them. This is a good writeup:

radeon-ai-pro-rocm-pytorch-guide.pdf

This PDF references resources from here:
JoergR75 (Jörg Roskowetz)

It is better to watch this git repository directly than rely on AMD.com – more on that in a bit.

The context you might not have for this – if you read it you’ll immediately notice ROCm version 6.4. It is often the case that ROCm releases are targeted at CDNA hardware, not RDNA hardware, so there can “kind of” be parallel versions of ROCm and you can’t run the newest versions depending on your hardware.

This is especially a minefield if you have an older graphics card like a 7900XTX, and this is a source of a lot of friction and confusion and perhaps the “AMD Reputation” with card setup.

Going into 2026, things have never been better with AMD and a lot of work has gone into helping newcomers into the ecosystem. Even with that, however, one must keep an open mind if you have experience with other platforms. AMD has, in a lot of cases, chosen a different architectural philosophy than competitors… so don’t assume anything.

In addition to the PDF, the AMD Blogs can be an excellent resource to just go read. They are a tremendous resource of enlightenment. One caution on that, however, is that one will get more out of reading the blogs for subject matter interest rather than using them as a resource to solve a specific problem. What do I mean by that? Well, assume you have shiny new R9700s and you search for that:

…Accelerating llama.cpp on AMD Instinct? Well I’m not instinct, but this search result seems responsive to my search right? No. Red herring. In fact if you search for RDNA in this specific blog post there is no result for RDNA or R9700.

…if you try these steps on RDNA anyway, it doesn’t work.

Getting Started

This is a vanilla install of Ubuntu. Before diving in and installing ROCm and Docker dependencies, notice the lay of the land:

# dmesg 
[   19.629810] audit: type=1400 audit(1766333665.398:195): apparmor="DENIED" operation="open" class="file" profile="snap.thunderbird.hook.install" name="/var/lib/snapd/ho                                  stfs/usr/share/hunspell/" pid=4750 comm="find" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[   22.535982] TCP: enp97s0: Driver has suspect GRO implementation, TCP performance may be compromised.
[   36.537298] rfkill: input handler enabled
[   36.826119] amdgpu 0000:c3:00.0: amdgpu: PCIE GART of 512M enabled (table at 0x00000087D6B00000).
[   36.826143] amdgpu 0000:c3:00.0: amdgpu: PSP is resuming...
[   37.146895] amdgpu 0000:03:00.0: amdgpu: PCIE GART of 512M enabled (table at 0x00000087D6B00000).
[   37.146915] amdgpu 0000:03:00.0: amdgpu: PSP is resuming...
[   38.842624] amdgpu 0000:c3:00.0: amdgpu: PSP load kdb failed!
[   38.842628] amdgpu 0000:c3:00.0: amdgpu: PSP resume failed
[   38.842629] amdgpu 0000:c3:00.0: amdgpu: resume of IP block <psp> failed -62
[   38.842631] amdgpu 0000:c3:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
[   39.131515] amdgpu 0000:03:00.0: amdgpu: PSP load kdb failed!
[   39.131521] amdgpu 0000:03:00.0: amdgpu: PSP resume failed
[   39.131523] amdgpu 0000:03:00.0: amdgpu: resume of IP block <psp> failed -62
[   39.131525] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
[   39.268313] amdgpu 0000:83:00.0: amdgpu: PCIE GART of 512M enabled (table at 0x00000087D6B00000).
[   39.268338] amdgpu 0000:83:00.0: amdgpu: PSP is resuming...
[   41.392373] amdgpu 0000:83:00.0: amdgpu: PSP load kdb failed!
[   41.392377] amdgpu 0000:83:00.0: amdgpu: PSP resume failed
[   41.392378] amdgpu 0000:83:00.0: amdgpu: resume of IP block <psp> failed -62
[   41.392379] amdgpu 0000:83:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
[   41.751126] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
[   41.791527] rfkill: input handler disabled

# uname -a
Linux aiwks 6.14.0-37-generic #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

You’ll want to make sure Resize Bar is enabled in Bios, and your bios is fully up to date.

In the video we’re using the older version of the Asus TRX 50 motherboard. The PCIe layout is:

  • x16 Gen5
  • x16 Gen5
  • x8 Gen 5 (9000 cpu only, otherwise gen4)
  • x16 Gen 4

A WRX90 layout would be full x16 to all 4 cards, but I wondered if this would negatively impact our testing, and we can do more testing later before I giveaway two of these cards.

The kernel is out of date for our purposes. We can update it:

w@aiwks:~$ sudo apt install linux-oem-24.04d linux-firmware
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
linux-firmware is already the newest version (20240318.git3b128b60-0ubuntu2.21).
linux-firmware set to manually installed.
The following package was automatically installed and is no longer required:
  libllvm19
Use 'sudo apt autoremove' to remove it.
The following additional packages will be installed:
  linux-headers-6.17.0-1008-oem linux-headers-oem-24.04d linux-image-6.17.0-1008-oem linux-image-oem-24.04d linux-modules-6.17.0-1008-oem linux-oem-6.17-headers-6.17.0-1008
  linux-oem-6.17-tools-6.17.0-1008 linux-tools-6.17.0-1008-oem
Suggested packages:
  linux-perf linux-oem-6.17-tools
The following NEW packages will be installed:
  linux-headers-6.17.0-1008-oem linux-headers-oem-24.04d linux-image-6.17.0-1008-oem linux-image-oem-24.04d linux-modules-6.17.0-1008-oem linux-oem-24.04d linux-oem-6.17-headers-6.17.0-1008
  linux-oem-6.17-tools-6.17.0-1008 linux-tools-6.17.0-1008-oem
0 upgraded, 9 newly installed, 0 to remove and 143 not upgraded.
Need to get 200 MB of archives.
After this operation, 304 MB of additional disk space will be used.
Do you want to continue? [Y/n]

…after reboot…

$ uname -a
Linux aiwks 6.17.0-1008-oem #8-Ubuntu SMP PREEMPT_DYNAMIC Wed Dec 10 05:19:21 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

and if you check dmesg output the amdgpu situation looks a lot better with all four GPUs having been properly bound to the amdgpu driver.

These are other packages I’d recommend for good quality of life:

apt install rasdaemon btop tmux python3.12-venv

Installing AMD Stuff

The PDF AMD wrote above comes in here, at least, philosophically, and once again, this git repository:

JoergR75 (Jörg Roskowetz)

At the time I’m doing this how-to this seems to be the most recent and coherent guide:

JoergR75/rocm-7.1.1-pytorch-2.10.0-docker-cdna3-rdna4-automated-deployment

**Note: ** this guide recommends disabling secure boot because it customizes the kernel

… installing with secure boot can certainly create more trouble if you aren’t familiar with the steps required to add your own key and do kernel signing yourself.

The output of the script is also super helpful:


ROCm 7.1.1 + OCL 2.x + PyTorch 2.10.0 (nightly for ROCm7.1) + Transformers environment installation and setup.

After the reboot, test your installation with:
  • rocminfo
  • clinfo
  • rocm-smi
  • amd-smi
  • rocm-bandwidth-tool

Verify the active PyTorch device:
  python3 test.py

Install the latest vLLM Docker images:
  RDNA4 → sudo docker pull rocm/vllm-dev:rocm7.1.1_navi_ubuntu24.04_py3.12_pytorch_2.8_vllm_0.10.2rc1
  CDNA3 → sudo docker pull rocm/vllm:latest

Start the vLLM Docker container:
  sudo docker run -it --device=/dev/kfd --device=/dev/dri \
    --security-opt seccomp=unconfined --group-add video rocm/vllm-dev:rocm7.1.1_navi_ubuntu24.04_py3.12_pytorch_2.8_vllm_0.10.2rc1

The container will run using the image 'rocm/vllm-dev:rocm7.1.1_navi_ubuntu24.04_py3.12_pytorch_2.8_vllm_0.10.2rc1', with flags enabling AMD GPU access via ROCm.

And after another reboot (and enrolling our secure boot key, in my case)

w@aiwks:~$ rocm-smi


WARNING: AMD GPU device(s) is/are in a low-power state. Check power control/runtime_status

========================================== ROCm System Management Interface ==========================================
==================================================== Concise Info ====================================================
Device  Node  IDs              Temp    Power  Partitions          SCLK  MCLK     Fan    Perf     PwrCap  VRAM%  GPU%
              (DID,     GUID)  (Edge)  (Avg)  (Mem, Compute, ID)
======================================================================================================================
0       1     0x7551,   15371  N/A     N/A    N/A, N/A, 0         N/A   N/A      0%     unknown  N/A     0%     0%
1       4     0x7551,   58386  52.0°C  38.0W  N/A, N/A, 0         0Mhz  1258Mhz  20.0%  auto     300.0W  0%     0%
2       2     0x7551,   8916   N/A     N/A    N/A, N/A, 0         N/A   N/A      0%     unknown  N/A     0%     0%
3       3     0x7551,   53989  N/A     N/A    N/A, N/A, 0         N/A   N/A      0%     unknown  N/A     0%     0%
======================================================================================================================
================================================ End of ROCm SMI Log =================================================
w@aiwks:~$ uname -a
Linux aiwks 6.17.0-1008-oem #8-Ubuntu SMP PREEMPT_DYNAMIC Wed Dec 10 05:19:21 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

kernel version and gpu detection looking… pretty good.

Setup Docker

Checks from the github guide

P2P Bandwidth with 4x R9700s on TRX50

 sudo /opt/rocm/bin/rocm_bandwidth_test plugin --run tb p2p
[sudo] password for w:
TransferBench v1.64.00
===============================================================
[Common]                              (Suppress by setting HIDE_ENV=1)
ALWAYS_VALIDATE      =            0 : Validating after all iterations
BLOCK_BYTES          =          256 : Each CU gets a mulitple of 256 bytes to copy
BYTE_OFFSET          =            0 : Using byte offset of 0
CU_MASK              =            0 : All
FILL_COMPRESS        =            0 : Not specified
FILL_PATTERN         =            0 : Element i = ((i * 517) modulo 383 + 31) * (srcBufferIdx + 1)
GFX_BLOCK_ORDER      =            0 : Thread block ordering: Sequential
GFX_BLOCK_SIZE       =          256 : Threadblock size of 256
GFX_SINGLE_TEAM      =            1 : Combining CUs to work across entire data array
GFX_TEMPORAL         =            0 : Not using non-temporal loads/stores
GFX_UNROLL           =            4 : Using GFX unroll factor of 4
GFX_WAVE_ORDER       =            0 : Using GFX wave ordering of Unroll,Wavefront,CU
GFX_WORD_SIZE        =            4 : Using GFX word size of 4 (DWORDx4)
MIN_VAR_SUBEXEC      =            1 : Using at least 1 subexecutor(s) for variable subExec tranfers
MAX_VAR_SUBEXEC      =            0 : Using up to all available subexecutors for variable subExec transfers
NUM_ITERATIONS       =           10 : Running 10  timed iteration(s)
NUM_SUBITERATIONS    =            1 : Running 1 subiterations
NUM_WARMUPS          =            3 : Running 3 warmup iteration(s) per Test
SHOW_ITERATIONS      =            0 : Hiding per-iteration timing
USE_HIP_EVENTS       =            1 : Using HIP events for GFX/DMA Executor timing
USE_HSA_DMA          =            0 : Using hipMemcpyAsync for DMA execution
USE_INTERACTIVE      =            0 : Running in non-interactive mode
USE_SINGLE_STREAM    =            1 : Using single stream per GFX device
VALIDATE_DIRECT      =            0 : Validate GPU destination memory via CPU staging buffer
VALIDATE_SOURCE      =            0 : Do not perform source validation after prep

[P2P Related]
NUM_CPU_DEVICES      =            1 : Using 1 CPUs
NUM_CPU_SE           =            4 : Using 4 CPU threads per Transfer
NUM_GPU_DEVICES      =            4 : Using 4 GPUs
NUM_GPU_SE           =           32 : Using 32 GPU subexecutors/CUs per Transfer
P2P_MODE             =            0 : Running Uni + Bi transfers
USE_FINE_GRAIN       =            0 : Using coarse-grained memory
USE_GPU_DMA          =            0 : Using GPU-GFX as GPU executor
USE_REMOTE_READ      =            0 : Using SRC as executor

Bytes Per Direction 268435456
Unidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX)
 SRC+EXE\DST    CPU 00       GPU 00    GPU 01    GPU 02    GPU 03
  CPU 00  ->     55.24        44.31     22.52     22.54     41.73

  GPU 00  ->     56.90       256.42     28.50     28.54     57.04
  GPU 01  ->     28.52        28.56    257.24     28.56     28.56
  GPU 02  ->     28.53        28.57     28.56    257.66     28.57
  GPU 03  ->     56.96        57.06     28.57     28.61    255.72
                           CPU->CPU  CPU->GPU  GPU->CPU  GPU->GPU
Averages (During UniDir):       N/A     32.78     42.73     33.31

Bidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX)
     SRC\DST    CPU 00       GPU 00    GPU 01    GPU 02    GPU 03
  CPU 00  ->       N/A        43.15     22.36     22.31     40.39
  CPU 00 <-        N/A        56.72     28.35     28.18     56.73
  CPU 00 <->       N/A        99.87     50.70     50.49     97.12


  GPU 00  ->     56.44          N/A     28.34     28.30     56.64
  GPU 00 <-      43.53          N/A     28.39     28.29     56.67
  GPU 00 <->     99.97          N/A     56.73     56.59    113.31

  GPU 01  ->     28.36        28.39       N/A     28.30     28.39
  GPU 01 <-      22.36        28.34       N/A     28.29     28.35
  GPU 01 <->     50.73        56.73       N/A     56.58     56.74

  GPU 02  ->     28.22        28.29     28.29       N/A     28.29
  GPU 02 <-      22.33        28.30     28.30       N/A     28.30
  GPU 02 <->     50.55        56.59     56.59       N/A     56.59

  GPU 03  ->     56.66        56.66     28.35     28.30       N/A
  GPU 03 <-      41.28        56.64     28.39     28.29       N/A
  GPU 03 <->     97.94       113.30     56.73     56.59       N/A
                           CPU->CPU  CPU->GPU  GPU->CPU  GPU->GPU
Averages (During  BiDir):       N/A     37.27     37.40     33.04

(We can expect “Full” GPU to GPU bandwidth on WRX90 but this isn’t bad!)

From here we have full vLLM! And that’s what we’ll be using in our follow up video

TODO: Talk about TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1

AI Toolkit added ROCm support

Training with Z Image Turbo Lora with 1024x pictures at 4 bit is FAST on these cards.

1 Like

Hey Wendell,

Got any benchmarks to share with us? or is it still in progress?

Cheers and merry xmas

I think I went off the reservation with my Ubuntu 24.04 install. The stock AMD installer for Ubuntu was installing old ROCm versions. Copilot had to install the correct ROCm 7.1 and prioritize the ROCm 7.1 to get the installer to play nice. I probably took the long way to get there. :rofl:

Got my WRX90 4x R9700 (no room for 5th) working with this guide. Well, at least the drivers were installed and the gpus initialized.

Some weird networking and freezing issue though. I saw the other post on this about enabling SR-IOV but that didn’t seem to help. Anyway, getting closer. Thanks.

Making progress, but it’s early. I believe my issue is solved, but to do that I took several steps that are not outlined in the pdf. First, I installed rocm with --no-dkms. I also set several kernel parameters that may or may not have been required.

dev@dev-WRX90-WS-EVO:~$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.17.0-1008-oem root=UUID=d8445b5a-4f5f-4794-b2a9-cc2acdf2e36a ro quiet splash iommu=pt amd_iommu=on amdgpu.runpm=0 amdgpu.aspm=0 vt.handoff=7

Powerstate management and not forcing iommu was, I think, preventing all of my GPUs from initializing. Unfortunately, I don’t have the dmesg output for that handy.

I didnt follow the PDF verbatim, just my steps above, and especially, the kernel version helped a ton.

I did secureboot with mine, too.

iommu=pt and amd_iommu=on does seem to be critical. amdgpu,runpm=0 is important

I am not disabling aspm because I am monitoring for situations (which do occur with i.e. flex.1/flux training – gpu crashes etc)

how are your temps? I’m between 90 and spiking to 110 which seems a little worrying but things are holding up

1 Like

I am going slow to check to see that every step is working (as far as I can tell) so I haven’t run anything yet. My setup is in my basement, which is currently very cold, so everything is idling at about 20c

I will report back when I have more details. Thanks for the help.

Running the bandwidth test and then scaling a couple of times in a row got me these temps. I’ll have to do a more thorough torture test later as I’m out of time.

dev@dev-WRX90-WS-EVO:~$ amd-smi
±-----------------------------------------------------------------------------+
| AMD-SMI 26.2.0+021c61fc amdgpu version: Linuxver ROCm version: 7.1.1 |
| VBIOS version: 023.008.000.068.000001 |
| Platform: Linux Baremetal |
|-------------------------------------±---------------------------------------|
| BDF GPU-Name | Mem-Uti Temp UEC Power-Usage |
| GPU HIP-ID OAM-ID Partition-Mode | GFX-Uti Fan Mem-Usage |
|=====================================+========================================|
| 0000:03:00.0 …Radeon AI PRO R9700 | 0 % 26 °C 0 29/300 W |
| 0 2 N/A N/A | 36 % 20.0 % 269/30576 MB |
|-------------------------------------±---------------------------------------|
| 0000:23:00.0 …Radeon AI PRO R9700 | 0 % 20 °C 0 18/300 W |
| 1 3 N/A N/A | 1 % 20.0 % 269/30576 MB |
|-------------------------------------±---------------------------------------|
| 0000:d3:00.0 …Radeon AI PRO R9700 | 0 % 24 °C 0 17/300 W |
| 2 1 N/A N/A | 3 % 20.0 % 269/30576 MB |
|-------------------------------------±---------------------------------------|
| 0000:f3:00.0 …Radeon AI PRO R9700 | 2 % 30 °C 0 49/300 W |
| 3 0 N/A N/A | 17 % 20.0 % 817/30576 MB |
±------------------------------------±---------------------------------------+
±-----------------------------------------------------------------------------+
| Processes: |
| GPU PID Process Name GTT_MEM VRAM_MEM MEM_USAGE CU % |
|==============================================================================|
| 0 9734 rocm_bandwidth_test 2.0 MB 139.4 MB 200.7 MB N/A |
| 1 9734 rocm_bandwidth_test 2.0 MB 139.4 MB 200.7 MB N/A |
| 2 9734 rocm_bandwidth_test 2.0 MB 139.4 MB 200.7 MB N/A |
| 3 9734 rocm_bandwidth_test 2.0 MB 646.2 MB 748.4 MB N/A |
±-----------------------------------------------------------------------------+

(rocm_env) dev@dev-WRX90-WS-EVO:~$ /opt/rocm-7.1.1/bin/rocm-bandwidth-test run tb
TransferBench (Plugin) v1.64.00

Usage: tb config
config: Either:
- Filename of configFile containing Transfers to execute (see example.cfg for format)
- Name of preset config:
N : (Optional) Number of bytes to copy per Transfer.
If not specified, defaults to 268435456 bytes. Must be a multiple of 4 bytes
If 0 is specified, a range of Ns will be benchmarked
May append a suffix (‘K’, ‘M’, ‘G’) for kilobytes / megabytes / gigabytes

Environment variables:

ALWAYS_VALIDATE - Validate after each iteration instead of once after all iterations
BLOCK_BYTES - Controls granularity of how work is divided across subExecutors
BYTE_OFFSET - Initial byte-offset for memory allocations. Must be multiple of 4
CU_MASK - CU mask for streams. Can specify ranges e.g ‘5,10-12,14’
FILL_COMPRESS - Percentages of 64B lines to be filled by random/1B0/2B0/4B0/32B0
FILL_PATTERN - Big-endian pattern for source data, specified in hex digits. Must be even # of digits
GFX_BLOCK_ORDER - How blocks for transfers are ordered. 0=sequential, 1=interleaved
GFX_BLOCK_SIZE - # of threads per threadblock (Must be multiple of 64)
GFX_TEMPORAL - Use of non-temporal loads or stores (0=none 1=loads 2=stores 3=both)
GFX_UNROLL - Unroll factor for GFX kernel (0=auto), must be less than 8
GFX_SINGLE_TEAM - Have subexecutors work together on full array instead of working on disjoint subarrays
GFX_WAVE_ORDER - Stride pattern for GFX kernel (0=UWC,1=UCW,2=WUC,3=WCU,4=CUW,5=CWU)
GFX_WORD_SIZE - GFX kernel packed data size (4=DWORDx4, 2=DWORDx2, 1=DWORDx1)
HIDE_ENV - Hide environment variable value listing
MIN_VAR_SUBEXEC - Minumum # of subexecutors to use for variable subExec Transfers
MAX_VAR_SUBEXEC - Maximum # of subexecutors to use for variable subExec Transfers (0 for device limits)
NUM_ITERATIONS - # of timed iterations per test. If negative, run for this many seconds instead
NUM_SUBITERATIONS - # of sub-iterations to run per iteration. Must be non-negative
NUM_WARMUPS - # of untimed warmup iterations per test
OUTPUT_TO_CSV - Outputs to CSV format if set
SAMPLING_FACTOR - Add this many samples (when possible) between powers of 2 when auto-generating data sizes
SHOW_ITERATIONS - Show per-iteration timing info
USE_HIP_EVENTS - Use HIP events for GFX executor timing
USE_HSA_DMA - Use hsa_amd_async_copy instead of hipMemcpy for non-targeted DMA execution
USE_INTERACTIVE - Pause for user-input before starting transfer loop
USE_SINGLE_STREAM - Use a single stream per GPU GFX executor instead of stream per Transfer
VALIDATE_DIRECT - Validate GPU destination memory directly instead of staging GPU memory on host
VALIDATE_SOURCE - Validate GPU src memory immediately after preparation

Available Preset Benchmarks:

           a2a - Tests parallel transfers between all pairs of GPU devices
         a2a_n - Tests parallel transfers between all pairs of GPU devices using Nearest NIC RDMA transfers
      a2asweep - Test GFX-based all-to-all transfers swept across different CU and GFX unroll counts
   healthcheck - Simple bandwidth health check (MI300X series only)
       one2all - Test all subsets of parallel transfers from one GPU to all others
           p2p -  Peer-to-peer device memory bandwidth test
        rsweep - Randomly sweep through sets of Transfers
       scaling - Run scaling test from one GPU to other devices
        schmoo - Scaling tests for local/remote read/write/copy
         sweep - Ordered sweep through sets of Transfers

Detected Topology:

1 configured CPU NUMA node(s) [1 total]
4 GPU device(s)
0 Supported NIC device(s)

        |NUMA 00| #Cpus | Closest GPU(s)

------------±------±------±--------------
NUMA 00 (00)| 10 | 32 | 0 1 2 3

    | gfx1201 | gfx1201 | gfx1201 | gfx1201 |
    | GPU 00 | GPU 01 | GPU 02 | GPU 03 | PCIe Bus ID  | #CUs | NUMA | #DMA | #XCC | NIC

--------±-------±-------±-------±-------±-------------±-----±-----±-----±-----±-----
GPU 00 | N/A | PCIE-2 | PCIE-2 | PCIE-2 | 0000:f3:00.0 | 32 | 0 | 2 | 1 | -1
GPU 01 | PCIE-2 | N/A | PCIE-2 | PCIE-2 | 0000:d3:00.0 | 32 | 0 | 2 | 1 | -1
GPU 02 | PCIE-2 | PCIE-2 | N/A | PCIE-2 | 0000:03:00.0 | 32 | 0 | 2 | 1 | -1
GPU 03 | PCIE-2 | PCIE-2 | PCIE-2 | N/A | 0000:23:00.0 | 32 | 0 | 2 | 1 | -1
(rocm_env) dev@dev-WRX90-WS-EVO:~$ /opt/rocm-7.1.1/bin/rocm-bandwidth-test run tb a2a
TransferBench v1.64.00

[Common] (Suppress by setting HIDE_ENV=1)
ALWAYS_VALIDATE = 0 : Validating after all iterations
BLOCK_BYTES = 256 : Each CU gets a mulitple of 256 bytes to copy
BYTE_OFFSET = 0 : Using byte offset of 0
CU_MASK = 0 : All
FILL_COMPRESS = 0 : Not specified
FILL_PATTERN = 0 : Element i = ((i * 517) modulo 383 + 31) * (srcBufferIdx + 1)
GFX_BLOCK_ORDER = 0 : Thread block ordering: Sequential
GFX_BLOCK_SIZE = 256 : Threadblock size of 256
GFX_SINGLE_TEAM = 1 : Combining CUs to work across entire data array
GFX_TEMPORAL = 0 : Not using non-temporal loads/stores
GFX_UNROLL = 2 : Using GFX unroll factor of 2
GFX_WAVE_ORDER = 0 : Using GFX wave ordering of Unroll,Wavefront,CU
GFX_WORD_SIZE = 4 : Using GFX word size of 4 (DWORDx4)
MIN_VAR_SUBEXEC = 1 : Using at least 1 subexecutor(s) for variable subExec tranfers
MAX_VAR_SUBEXEC = 0 : Using up to all available subexecutors for variable subExec transfers
NUM_ITERATIONS = 10 : Running 10 timed iteration(s)
NUM_SUBITERATIONS = 1 : Running 1 subiterations
NUM_WARMUPS = 3 : Running 3 warmup iteration(s) per Test
SHOW_ITERATIONS = 0 : Hiding per-iteration timing
USE_HIP_EVENTS = 1 : Using HIP events for GFX/DMA Executor timing
USE_HSA_DMA = 0 : Using hipMemcpyAsync for DMA execution
USE_INTERACTIVE = 0 : Running in non-interactive mode
USE_SINGLE_STREAM = 1 : Using single stream per GFX device
VALIDATE_DIRECT = 0 : Validate GPU destination memory via CPU staging buffer
VALIDATE_SOURCE = 0 : Do not perform source validation after prep

[AllToAll Related]
A2A_DIRECT = 1 : Only using direct links
A2A_LOCAL = 0 : Exclude local transfers
A2A_MODE = 0 : Copy
NUM_GPU_DEVICES = 4 : Using 4 GPUs
NUM_QUEUE_PAIRS = 0 : Using 0 queue pairs for NIC transfers
NUM_SUB_EXEC = 8 : Using 8 subexecutors/CUs per Transfer
USE_DMA_EXEC = 0 : Using GFX executor
USE_FINE_GRAIN = 1 : Using fine-grained memory
USE_REMOTE_READ = 0 : Using SRC as executor

GPU-GFX All-To-All benchmark:

  • Copying 268435456 bytes between directly connected pairs of GPUs using 8 CUs (0 Transfers)
    (rocm_env) dev@dev-WRX90-WS-EVO:~$ /opt/rocm-7.1.1/bin/rocm-bandwidth-test run tb p2p
    TransferBench v1.64.00
    ===============================================================
    [Common] (Suppress by setting HIDE_ENV=1)
    ALWAYS_VALIDATE = 0 : Validating after all iterations
    BLOCK_BYTES = 256 : Each CU gets a mulitple of 256 bytes to copy
    BYTE_OFFSET = 0 : Using byte offset of 0
    CU_MASK = 0 : All
    FILL_COMPRESS = 0 : Not specified
    FILL_PATTERN = 0 : Element i = ((i * 517) modulo 383 + 31) * (srcBufferIdx + 1)
    GFX_BLOCK_ORDER = 0 : Thread block ordering: Sequential
    GFX_BLOCK_SIZE = 256 : Threadblock size of 256
    GFX_SINGLE_TEAM = 1 : Combining CUs to work across entire data array
    GFX_TEMPORAL = 0 : Not using non-temporal loads/stores
    GFX_UNROLL = 4 : Using GFX unroll factor of 4
    GFX_WAVE_ORDER = 0 : Using GFX wave ordering of Unroll,Wavefront,CU
    GFX_WORD_SIZE = 4 : Using GFX word size of 4 (DWORDx4)
    MIN_VAR_SUBEXEC = 1 : Using at least 1 subexecutor(s) for variable subExec tranfers
    MAX_VAR_SUBEXEC = 0 : Using up to all available subexecutors for variable subExec transfers
    NUM_ITERATIONS = 10 : Running 10 timed iteration(s)
    NUM_SUBITERATIONS = 1 : Running 1 subiterations
    NUM_WARMUPS = 3 : Running 3 warmup iteration(s) per Test
    SHOW_ITERATIONS = 0 : Hiding per-iteration timing
    USE_HIP_EVENTS = 1 : Using HIP events for GFX/DMA Executor timing
    USE_HSA_DMA = 0 : Using hipMemcpyAsync for DMA execution
    USE_INTERACTIVE = 0 : Running in non-interactive mode
    USE_SINGLE_STREAM = 1 : Using single stream per GFX device
    VALIDATE_DIRECT = 0 : Validate GPU destination memory via CPU staging buffer
    VALIDATE_SOURCE = 0 : Do not perform source validation after prep

[P2P Related]
NUM_CPU_DEVICES = 1 : Using 1 CPUs
NUM_CPU_SE = 4 : Using 4 CPU threads per Transfer
NUM_GPU_DEVICES = 4 : Using 4 GPUs
NUM_GPU_SE = 32 : Using 32 GPU subexecutors/CUs per Transfer
P2P_MODE = 0 : Running Uni + Bi transfers
USE_FINE_GRAIN = 0 : Using coarse-grained memory
USE_GPU_DMA = 0 : Using GPU-GFX as GPU executor
USE_REMOTE_READ = 0 : Using SRC as executor

Bytes Per Direction 268435456
Unidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX)
SRC+EXE\DST CPU 00 GPU 00 GPU 01 GPU 02 GPU 03
CPU 00 → 42.43 41.04 35.29 39.21 39.48

GPU 00 → 57.10 257.87 57.12 57.14 57.13
GPU 01 → 57.00 57.13 257.34 57.13 57.13
GPU 02 → 56.91 57.13 57.13 257.29 57.13
GPU 03 → 56.93 57.13 57.13 57.13 257.51
CPU->CPU CPU->GPU GPU->CPU GPU->GPU
Averages (During UniDir): N/A 38.75 56.98 57.13

Bidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX)
SRC\DST CPU 00 GPU 00 GPU 01 GPU 02 GPU 03
CPU 00 → N/A 40.38 38.46 40.50 37.45
CPU 00 ← N/A 56.65 56.38 56.40 56.70
CPU 00 ↔ N/A 97.03 94.84 96.90 94.15

GPU 00 → 56.58 N/A 56.74 56.75 56.76
GPU 00 ← 39.54 N/A 56.76 56.76 56.75
GPU 00 ↔ 96.13 N/A 113.49 113.51 113.51

GPU 01 → 56.71 56.77 N/A 56.73 56.75
GPU 01 ← 36.98 56.76 N/A 56.74 56.74
GPU 01 ↔ 93.69 113.53 N/A 113.47 113.49

GPU 02 → 56.58 56.76 56.76 N/A 56.73
GPU 02 ← 38.88 56.75 56.73 N/A 56.73
GPU 02 ↔ 95.45 113.52 113.49 N/A 113.46

GPU 03 → 56.46 56.75 56.77 56.75 N/A
GPU 03 ← 41.21 56.72 56.74 56.71 N/A
GPU 03 ↔ 97.67 113.47 113.51 113.46 N/A
CPU->CPU CPU->GPU GPU->CPU GPU->GPU
Averages (During BiDir): N/A 47.86 47.87 56.75

(rocm_env) dev@dev-WRX90-WS-EVO:~$ /opt/rocm-7.1.1/bin/rocm-bandwidth-test run tb scaling
TransferBench v1.64.00

[Common] (Suppress by setting HIDE_ENV=1)
ALWAYS_VALIDATE = 0 : Validating after all iterations
BLOCK_BYTES = 256 : Each CU gets a mulitple of 256 bytes to copy
BYTE_OFFSET = 0 : Using byte offset of 0
CU_MASK = 0 : All
FILL_COMPRESS = 0 : Not specified
FILL_PATTERN = 0 : Element i = ((i * 517) modulo 383 + 31) * (srcBufferIdx + 1)
GFX_BLOCK_ORDER = 0 : Thread block ordering: Sequential
GFX_BLOCK_SIZE = 256 : Threadblock size of 256
GFX_SINGLE_TEAM = 1 : Combining CUs to work across entire data array
GFX_TEMPORAL = 0 : Not using non-temporal loads/stores
GFX_UNROLL = 4 : Using GFX unroll factor of 4
GFX_WAVE_ORDER = 0 : Using GFX wave ordering of Unroll,Wavefront,CU
GFX_WORD_SIZE = 4 : Using GFX word size of 4 (DWORDx4)
MIN_VAR_SUBEXEC = 1 : Using at least 1 subexecutor(s) for variable subExec tranfers
MAX_VAR_SUBEXEC = 0 : Using up to all available subexecutors for variable subExec transfers
NUM_ITERATIONS = 10 : Running 10 timed iteration(s)
NUM_SUBITERATIONS = 1 : Running 1 subiterations
NUM_WARMUPS = 3 : Running 3 warmup iteration(s) per Test
SHOW_ITERATIONS = 0 : Hiding per-iteration timing
USE_HIP_EVENTS = 1 : Using HIP events for GFX/DMA Executor timing
USE_HSA_DMA = 0 : Using hipMemcpyAsync for DMA execution
USE_INTERACTIVE = 0 : Running in non-interactive mode
USE_SINGLE_STREAM = 1 : Using single stream per GFX device
VALIDATE_DIRECT = 0 : Validate GPU destination memory via CPU staging buffer
VALIDATE_SOURCE = 0 : Do not perform source validation after prep

[Scaling Related]
LOCAL_IDX = 0 : Local GPU index
SWEEP_MAX = 32 : Max number of subExecutors to use
SWEEP_MIN = 1 : Min number of subExecutors to use

GPU-GFX Scaling benchmark:

  • Copying 268435456 bytes from GPU 0 to other devices
  • All numbers reported as GB/sec

NumCUs CPU00 GPU00 GPU01 GPU02 GPU03
1 45.36 54.09 55.16 55.01 54.74
2 53.57 98.37 56.83 56.73 56.75
3 56.63 131.03 56.91 56.91 56.89
4 55.10 164.87 56.96 56.97 56.98
5 57.05 191.96 57.02 56.98 57.01
6 57.10 212.33 57.04 57.03 57.04
7 57.11 224.46 57.07 57.06 57.07
8 56.80 237.17 57.10 57.11 57.11
9 57.11 241.47 57.10 57.11 57.11
10 57.11 246.12 57.09 57.10 57.10
11 57.12 247.48 57.11 57.11 57.11
12 56.78 249.92 57.13 57.12 57.12
13 57.11 250.95 57.11 57.11 57.11
14 56.65 252.28 57.06 57.06 57.06
15 56.91 250.80 57.11 57.12 57.12
16 56.87 254.97 57.14 57.14 57.14
17 56.84 251.79 57.03 57.01 57.03
18 56.76 254.53 57.03 57.03 57.03
19 56.94 253.75 56.95 56.99 56.97
20 56.69 253.60 57.14 57.14 57.14
21 56.98 254.03 56.98 56.97 56.98
22 56.79 255.83 56.98 56.94 56.94
23 56.97 254.90 56.99 56.98 56.99
24 56.96 256.34 57.13 57.13 57.13
25 56.96 254.56 56.99 56.98 56.99
26 56.95 256.47 57.00 56.99 56.98
27 57.00 255.25 56.99 56.99 57.00
28 56.89 256.68 57.14 57.14 57.13
29 57.04 255.55 57.04 57.05 57.06
30 57.04 256.84 57.07 57.07 57.06
31 57.11 255.83 57.08 57.08 57.08
32 57.08 257.76 57.13 57.13 57.13
Best 57.12( 11) 257.76( 32) 57.14( 28) 57.14( 28) 57.14( 16)
(rocm_env) dev@dev-WRX90-WS-EVO:~$ /opt/rocm-7.1.1/bin/rocm-bandwidth-test run tb scaling
TransferBench v1.64.00

[Common] (Suppress by setting HIDE_ENV=1)
ALWAYS_VALIDATE = 0 : Validating after all iterations
BLOCK_BYTES = 256 : Each CU gets a mulitple of 256 bytes to copy
BYTE_OFFSET = 0 : Using byte offset of 0
CU_MASK = 0 : All
FILL_COMPRESS = 0 : Not specified
FILL_PATTERN = 0 : Element i = ((i * 517) modulo 383 + 31) * (srcBufferIdx + 1)
GFX_BLOCK_ORDER = 0 : Thread block ordering: Sequential
GFX_BLOCK_SIZE = 256 : Threadblock size of 256
GFX_SINGLE_TEAM = 1 : Combining CUs to work across entire data array
GFX_TEMPORAL = 0 : Not using non-temporal loads/stores
GFX_UNROLL = 4 : Using GFX unroll factor of 4
GFX_WAVE_ORDER = 0 : Using GFX wave ordering of Unroll,Wavefront,CU
GFX_WORD_SIZE = 4 : Using GFX word size of 4 (DWORDx4)
MIN_VAR_SUBEXEC = 1 : Using at least 1 subexecutor(s) for variable subExec tranfers
MAX_VAR_SUBEXEC = 0 : Using up to all available subexecutors for variable subExec transfers
NUM_ITERATIONS = 10 : Running 10 timed iteration(s)
NUM_SUBITERATIONS = 1 : Running 1 subiterations
NUM_WARMUPS = 3 : Running 3 warmup iteration(s) per Test
SHOW_ITERATIONS = 0 : Hiding per-iteration timing
USE_HIP_EVENTS = 1 : Using HIP events for GFX/DMA Executor timing
USE_HSA_DMA = 0 : Using hipMemcpyAsync for DMA execution
USE_INTERACTIVE = 0 : Running in non-interactive mode
USE_SINGLE_STREAM = 1 : Using single stream per GFX device
VALIDATE_DIRECT = 0 : Validate GPU destination memory via CPU staging buffer
VALIDATE_SOURCE = 0 : Do not perform source validation after prep

[Scaling Related]
LOCAL_IDX = 0 : Local GPU index
SWEEP_MAX = 32 : Max number of subExecutors to use
SWEEP_MIN = 1 : Min number of subExecutors to use

GPU-GFX Scaling benchmark:

  • Copying 268435456 bytes from GPU 0 to other devices
  • All numbers reported as GB/sec

NumCUs CPU00 GPU00 GPU01 GPU02 GPU03
1 43.47 54.24 55.16 55.00 54.72
2 51.82 99.34 56.76 56.77 56.76
3 56.87 134.27 56.90 56.91 56.90
4 53.50 166.40 56.95 56.97 56.98
5 56.99 191.50 57.01 57.00 57.00
6 56.99 213.76 57.01 56.99 56.98
7 57.12 225.28 57.09 57.08 57.07
8 56.78 237.29 57.10 57.11 57.10
9 57.07 242.02 57.11 57.10 57.10
10 57.06 246.57 57.10 57.10 57.11
11 56.80 247.66 57.10 57.11 57.10
12 56.58 249.88 57.12 57.13 57.12
13 56.80 251.37 57.12 57.11 57.11
14 56.68 252.05 57.07 57.06 57.06
15 56.78 251.12 57.11 57.12 57.12
16 56.76 254.81 57.14 57.14 57.15
17 56.77 251.85 57.02 57.01 57.03
18 56.70 254.47 57.03 57.02 57.06
19 56.93 253.49 56.96 56.98 56.96
20 56.54 253.46 57.14 57.14 57.14
21 57.01 254.40 56.98 56.97 56.99
22 56.75 255.91 56.95 56.93 56.93
23 56.97 254.58 56.99 56.98 57.00
24 56.91 255.95 57.12 57.12 57.13
25 57.05 255.04 56.99 56.98 56.97
26 56.91 256.18 56.99 57.00 57.00
27 56.97 255.14 57.01 57.00 57.00
28 56.65 256.92 57.14 57.14 57.14
29 57.08 255.32 57.04 57.04 57.06
30 57.05 256.08 57.06 57.07 57.07
31 57.09 256.00 57.08 57.08 57.08
32 56.99 257.97 57.12 57.12 57.13
Best 57.12( 7) 257.97( 32) 57.14( 16) 57.14( 16) 57.15( 16)
(rocm_env) dev@dev-WRX90-WS-EVO:~$

1 Like

root@10506e8af016:/app# vllm bench throughput
–model mistralai/Mistral-Small-24B-Instruct-2501
–num-prompts 32
–input-len 128
–output-len 128
–tensor-parallel-size 4
INFO 12-25 22:08:16 [init.py:241] Automatically detected platform rocm.
When dataset path is not set, it will default to random dataset
The tokenizer you are loading from ‘mistralai/Mistral-Small-24B-Instruct-2501’ with an incorrect regex pattern: mistralai/Mistral-Small-3.1-24B-Instruct-2503 · regex pattern. This will lead to incorrect tokenization. You should set the fix_mistral_regex=True flag when loading this tokenizer to fix this issue.
INFO 12-25 22:08:18 [datasets.py:509] Sampling input_len from [127, 127] and output_len from [128, 128]
INFO 12-25 22:08:18 [utils.py:328] non-default args: {‘tokenizer’: ‘mistralai/Mistral-Small-24B-Instruct-2501’, ‘seed’: 0, ‘tensor_parallel_size’: 4, ‘num_redundant_experts’: None, ‘eplb_window_size’: None, ‘eplb_step_interval’: None, ‘eplb_log_balancedness’: None, ‘enable_lora’: None, ‘model’: ‘mistralai/Mistral-Small-24B-Instruct-2501’}
INFO 12-25 22:08:22 [init.py:744] Resolved architecture: MistralForCausalLM
torch_dtype is deprecated! Use dtype instead!
INFO 12-25 22:08:22 [init.py:1773] Using max model len 32768
INFO 12-25 22:08:22 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
/opt/venv/lib/python3.12/site-packages/vllm/transformers_utils/tokenizer_group.py:27: FutureWarning: It is strongly recommended to run mistral models with --tokenizer-mode "mistral" to ensure correct encoding and decoding.
self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
The tokenizer you are loading from ‘mistralai/Mistral-Small-24B-Instruct-2501’ with an incorrect regex pattern: mistralai/Mistral-Small-3.1-24B-Instruct-2503 · regex pattern. This will lead to incorrect tokenization. You should set the fix_mistral_regex=True flag when loading this tokenizer to fix this issue.
INFO 12-25 22:08:25 [init.py:241] Automatically detected platform rocm.
(EngineCore_0 pid=5310) INFO 12-25 22:08:26 [core.py:648] Waiting for init message from front-end.
(EngineCore_0 pid=5310) INFO 12-25 22:08:26 [core.py:75] Initializing a V1 LLM engine (v0.10.2rc2.dev3+gdff7f6f03) with config: model=‘mistralai/Mistral-Small-24B-Instruct-2501’, speculative_config=None, tokenizer=‘mistralai/Mistral-Small-24B-Instruct-2501’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=mistralai/Mistral-Small-24B-Instruct-2501, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={“level”:3,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:,“splitting_ops”:[“vllm.unified_attention”,“vllm.unified_attention_with_output”,“vllm.mamba_mixer2”,“vllm.mamba_mixer”,“vllm.short_conv”,“vllm.linear_attention”],“use_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“cudagraph_mode”:1,“use_cudagraph”:true,“cudagraph_num_of_warmups”:1,“cudagraph_capture_sizes”:[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“pass_config”:{},“max_capture_size”:512,“local_cache_dir”:null}
(EngineCore_0 pid=5310) WARNING 12-25 22:08:26 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_0 pid=5310) INFO 12-25 22:08:26 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, ‘psm_ac49546c’), local_subscribe_addr=‘ipc:///tmp/7109014f-8e5c-4968-b4d7-06e41d7c4861’, remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 12-25 22:08:29 [init.py:241] Automatically detected platform rocm.
INFO 12-25 22:08:29 [init.py:241] Automatically detected platform rocm.
INFO 12-25 22:08:29 [init.py:241] Automatically detected platform rocm.
INFO 12-25 22:08:29 [init.py:241] Automatically detected platform rocm.
(VllmWorker TP1 pid=5383) INFO 12-25 22:08:30 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_fefb6af4’), local_subscribe_addr=‘ipc:///tmp/8cdde45d-79d6-4ca5-80fe-35b7b5cd4c3c’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker TP3 pid=5385) INFO 12-25 22:08:30 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_980ee535’), local_subscribe_addr=‘ipc:///tmp/40b1831f-41eb-49a3-8e81-6b2e3716554f’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker TP2 pid=5384) INFO 12-25 22:08:30 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_a132d6cf’), local_subscribe_addr=‘ipc:///tmp/aa142362-48ed-47fa-b57a-930e282c3ac2’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker TP0 pid=5382) INFO 12-25 22:08:30 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_ea645c27’), local_subscribe_addr=‘ipc:///tmp/dfa6bc38-f2b1-4254-b39e-b1d7af645d43’, remote_subscribe_addr=None, remote_addr_ipv6=False)
[W1225 22:08:31.865554292 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W1225 22:08:31.865776114 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W1225 22:08:31.865953016 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W1225 22:08:31.875680923 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
(VllmWorker TP3 pid=5385) INFO 12-25 22:08:31 [init.py:1432] Found nccl from library librccl.so.1
(VllmWorker TP1 pid=5383) INFO 12-25 22:08:31 [init.py:1432] Found nccl from library librccl.so.1
(VllmWorker TP0 pid=5382) INFO 12-25 22:08:31 [init.py:1432] Found nccl from library librccl.so.1
(VllmWorker TP2 pid=5384) INFO 12-25 22:08:31 [init.py:1432] Found nccl from library librccl.so.1
(VllmWorker TP3 pid=5385) INFO 12-25 22:08:31 [pynccl.py:70] vLLM is using nccl==2.27.7
(VllmWorker TP1 pid=5383) INFO 12-25 22:08:31 [pynccl.py:70] vLLM is using nccl==2.27.7
(VllmWorker TP0 pid=5382) INFO 12-25 22:08:31 [pynccl.py:70] vLLM is using nccl==2.27.7
(VllmWorker TP2 pid=5384) INFO 12-25 22:08:31 [pynccl.py:70] vLLM is using nccl==2.27.7
(VllmWorker TP0 pid=5382) INFO 12-25 22:08:34 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, ‘psm_36af536e’), local_subscribe_addr=‘ipc:///tmp/483988ed-519c-41b0-ad79-e46e83a8b6e1’, remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
(VllmWorker TP0 pid=5382) INFO 12-25 22:08:34 [parallel_state.py:1134] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker TP3 pid=5385) INFO 12-25 22:08:34 [parallel_state.py:1134] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
(VllmWorker TP2 pid=5384) INFO 12-25 22:08:34 [parallel_state.py:1134] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
(VllmWorker TP1 pid=5383) INFO 12-25 22:08:34 [parallel_state.py:1134] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker TP2 pid=5384) INFO 12-25 22:08:34 [gpu_model_runner.py:1932] Starting to load model mistralai/Mistral-Small-24B-Instruct-2501…
(VllmWorker TP3 pid=5385) INFO 12-25 22:08:34 [gpu_model_runner.py:1932] Starting to load model mistralai/Mistral-Small-24B-Instruct-2501…
(VllmWorker TP0 pid=5382) WARNING 12-25 22:08:34 [rocm.py:354] Model architecture ‘MistralForCausalLM’ is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting VLLM_USE_TRITON_FLASH_ATTN=0
(VllmWorker TP1 pid=5383) INFO 12-25 22:08:34 [gpu_model_runner.py:1932] Starting to load model mistralai/Mistral-Small-24B-Instruct-2501…
(VllmWorker TP0 pid=5382) INFO 12-25 22:08:34 [gpu_model_runner.py:1932] Starting to load model mistralai/Mistral-Small-24B-Instruct-2501…
(VllmWorker TP1 pid=5383) INFO 12-25 22:08:34 [gpu_model_runner.py:1964] Loading model from scratch…
(VllmWorker TP1 pid=5383) WARNING 12-25 22:08:34 [rocm.py:354] Model architecture ‘MistralForCausalLM’ is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting VLLM_USE_TRITON_FLASH_ATTN=0
(VllmWorker TP0 pid=5382) INFO 12-25 22:08:34 [gpu_model_runner.py:1964] Loading model from scratch…
(VllmWorker TP3 pid=5385) INFO 12-25 22:08:34 [gpu_model_runner.py:1964] Loading model from scratch…
(VllmWorker TP3 pid=5385) WARNING 12-25 22:08:34 [rocm.py:354] Model architecture ‘MistralForCausalLM’ is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting VLLM_USE_TRITON_FLASH_ATTN=0
(VllmWorker TP2 pid=5384) INFO 12-25 22:08:34 [gpu_model_runner.py:1964] Loading model from scratch…
(VllmWorker TP2 pid=5384) WARNING 12-25 22:08:34 [rocm.py:354] Model architecture ‘MistralForCausalLM’ is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting VLLM_USE_TRITON_FLASH_ATTN=0
(VllmWorker TP1 pid=5383) INFO 12-25 22:08:35 [rocm.py:248] Using Triton Attention backend on V1 engine.
(VllmWorker TP1 pid=5383) INFO 12-25 22:08:35 [triton_attn.py:257] Using vllm unified attention for TritonAttentionImpl
(VllmWorker TP0 pid=5382) INFO 12-25 22:08:35 [rocm.py:248] Using Triton Attention backend on V1 engine.
(VllmWorker TP0 pid=5382) INFO 12-25 22:08:35 [triton_attn.py:257] Using vllm unified attention for TritonAttentionImpl
(VllmWorker TP2 pid=5384) INFO 12-25 22:08:35 [rocm.py:248] Using Triton Attention backend on V1 engine.
(VllmWorker TP3 pid=5385) INFO 12-25 22:08:35 [rocm.py:248] Using Triton Attention backend on V1 engine.
(VllmWorker TP2 pid=5384) INFO 12-25 22:08:35 [triton_attn.py:257] Using vllm unified attention for TritonAttentionImpl
(VllmWorker TP3 pid=5385) INFO 12-25 22:08:35 [triton_attn.py:257] Using vllm unified attention for TritonAttentionImpl
(VllmWorker TP2 pid=5384) INFO 12-25 22:08:35 [weight_utils.py:304] Using model weights format [‘.safetensors’]
(VllmWorker TP1 pid=5383) INFO 12-25 22:08:35 [weight_utils.py:304] Using model weights format ['
.safetensors’]
(VllmWorker TP0 pid=5382) INFO 12-25 22:08:35 [weight_utils.py:304] Using model weights format [‘.safetensors’]
(VllmWorker TP3 pid=5385) INFO 12-25 22:08:35 [weight_utils.py:304] Using model weights format ['
.safetensors’]
Loading safetensors checkpoint shards: 0% Completed | 0/10 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 10% Completed | 1/10 [00:00<00:01, 5.11it/s]
Loading safetensors checkpoint shards: 20% Completed | 2/10 [00:00<00:01, 4.45it/s]
Loading safetensors checkpoint shards: 30% Completed | 3/10 [00:00<00:01, 4.10it/s]
Loading safetensors checkpoint shards: 40% Completed | 4/10 [00:00<00:01, 3.99it/s]
Loading safetensors checkpoint shards: 50% Completed | 5/10 [00:01<00:01, 3.98it/s]
Loading safetensors checkpoint shards: 60% Completed | 6/10 [00:01<00:01, 3.92it/s]
Loading safetensors checkpoint shards: 70% Completed | 7/10 [00:01<00:00, 3.86it/s]
Loading safetensors checkpoint shards: 80% Completed | 8/10 [00:01<00:00, 3.93it/s]
Loading safetensors checkpoint shards: 90% Completed | 9/10 [00:02<00:00, 4.17it/s]
(VllmWorker TP2 pid=5384) INFO 12-25 22:08:37 [default_loader.py:267] Loading weights took 2.44 seconds
Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:02<00:00, 4.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:02<00:00, 4.07it/s]
(VllmWorker TP0 pid=5382)
(VllmWorker TP0 pid=5382) INFO 12-25 22:08:37 [default_loader.py:267] Loading weights took 2.48 seconds
(VllmWorker TP3 pid=5385) INFO 12-25 22:08:38 [default_loader.py:267] Loading weights took 2.45 seconds
(VllmWorker TP1 pid=5383) INFO 12-25 22:08:38 [default_loader.py:267] Loading weights took 2.46 seconds
(VllmWorker TP2 pid=5384) INFO 12-25 22:08:38 [gpu_model_runner.py:1986] Model loading took 11.0488 GiB and 2.866677 seconds
(VllmWorker TP0 pid=5382) INFO 12-25 22:08:38 [gpu_model_runner.py:1986] Model loading took 11.0488 GiB and 3.002226 seconds
(VllmWorker TP3 pid=5385) INFO 12-25 22:08:38 [gpu_model_runner.py:1986] Model loading took 11.0488 GiB and 3.086888 seconds
(VllmWorker TP1 pid=5383) INFO 12-25 22:08:38 [gpu_model_runner.py:1986] Model loading took 11.0488 GiB and 3.250037 seconds
(VllmWorker TP1 pid=5383) INFO 12-25 22:08:41 [backends.py:538] Using cache directory: /root/.cache/vllm/torch_compile_cache/8c662d49f0/rank_1_0/backbone for vLLM’s torch.compile
(VllmWorker TP1 pid=5383) INFO 12-25 22:08:41 [backends.py:549] Dynamo bytecode transform time: 2.42 s
(VllmWorker TP0 pid=5382) INFO 12-25 22:08:41 [backends.py:538] Using cache directory: /root/.cache/vllm/torch_compile_cache/8c662d49f0/rank_0_0/backbone for vLLM’s torch.compile
(VllmWorker TP2 pid=5384) INFO 12-25 22:08:41 [backends.py:538] Using cache directory: /root/.cache/vllm/torch_compile_cache/8c662d49f0/rank_2_0/backbone for vLLM’s torch.compile
(VllmWorker TP0 pid=5382) INFO 12-25 22:08:41 [backends.py:549] Dynamo bytecode transform time: 2.43 s
(VllmWorker TP2 pid=5384) INFO 12-25 22:08:41 [backends.py:549] Dynamo bytecode transform time: 2.43 s
(VllmWorker TP3 pid=5385) INFO 12-25 22:08:41 [backends.py:538] Using cache directory: /root/.cache/vllm/torch_compile_cache/8c662d49f0/rank_3_0/backbone for vLLM’s torch.compile
(VllmWorker TP3 pid=5385) INFO 12-25 22:08:41 [backends.py:549] Dynamo bytecode transform time: 2.44 s
(VllmWorker TP0 pid=5382) INFO 12-25 22:08:41 [backends.py:194] Cache the graph for dynamic shape for later use
(VllmWorker TP1 pid=5383) INFO 12-25 22:08:41 [backends.py:194] Cache the graph for dynamic shape for later use
(VllmWorker TP3 pid=5385) INFO 12-25 22:08:41 [backends.py:194] Cache the graph for dynamic shape for later use
(VllmWorker TP2 pid=5384) INFO 12-25 22:08:41 [backends.py:194] Cache the graph for dynamic shape for later use
(VllmWorker TP3 pid=5385) INFO 12-25 22:08:45 [backends.py:215] Compiling a graph for dynamic shape takes 3.26 s
(VllmWorker TP1 pid=5383) INFO 12-25 22:08:45 [backends.py:215] Compiling a graph for dynamic shape takes 3.27 s
(VllmWorker TP2 pid=5384) INFO 12-25 22:08:45 [backends.py:215] Compiling a graph for dynamic shape takes 3.26 s
(VllmWorker TP0 pid=5382) INFO 12-25 22:08:45 [backends.py:215] Compiling a graph for dynamic shape takes 3.28 s
(VllmWorker TP3 pid=5385) INFO 12-25 22:08:45 [monitor.py:34] torch.compile takes 5.70 s in total
(VllmWorker TP1 pid=5383) INFO 12-25 22:08:45 [monitor.py:34] torch.compile takes 5.69 s in total
(VllmWorker TP2 pid=5384) INFO 12-25 22:08:45 [monitor.py:34] torch.compile takes 5.69 s in total
(VllmWorker TP0 pid=5382) INFO 12-25 22:08:45 [monitor.py:34] torch.compile takes 5.71 s in total
(VllmWorker TP1 pid=5383) INFO 12-25 22:08:48 [gpu_worker.py:276] Available KV cache memory: 13.43 GiB
(VllmWorker TP3 pid=5385) INFO 12-25 22:08:48 [gpu_worker.py:276] Available KV cache memory: 13.43 GiB
(VllmWorker TP2 pid=5384) INFO 12-25 22:08:48 [gpu_worker.py:276] Available KV cache memory: 13.43 GiB
(VllmWorker TP0 pid=5382) INFO 12-25 22:08:48 [gpu_worker.py:276] Available KV cache memory: 13.43 GiB
(EngineCore_0 pid=5310) INFO 12-25 22:08:49 [kv_cache_utils.py:850] GPU KV cache size: 352,064 tokens
(EngineCore_0 pid=5310) INFO 12-25 22:08:49 [kv_cache_utils.py:854] Maximum concurrency for 32,768 tokens per request: 10.74x
(EngineCore_0 pid=5310) INFO 12-25 22:08:49 [kv_cache_utils.py:850] GPU KV cache size: 352,064 tokens
(EngineCore_0 pid=5310) INFO 12-25 22:08:49 [kv_cache_utils.py:854] Maximum concurrency for 32,768 tokens per request: 10.74x
(EngineCore_0 pid=5310) INFO 12-25 22:08:49 [kv_cache_utils.py:850] GPU KV cache size: 352,064 tokens
(EngineCore_0 pid=5310) INFO 12-25 22:08:49 [kv_cache_utils.py:854] Maximum concurrency for 32,768 tokens per request: 10.74x
(EngineCore_0 pid=5310) INFO 12-25 22:08:49 [kv_cache_utils.py:850] GPU KV cache size: 352,064 tokens
(EngineCore_0 pid=5310) INFO 12-25 22:08:49 [kv_cache_utils.py:854] Maximum concurrency for 32,768 tokens per request: 10.74x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 66/67 [00:12<00:00, 5.94it/s](VllmWorker TP2 pid=5384) INFO 12-25 22:09:02 [gpu_model_runner.py:2687] Graph capturing finished in 13 secs, took 4.08 GiB
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:12<00:00, 5.20it/s]
(VllmWorker TP0 pid=5382) INFO 12-25 22:09:02 [gpu_model_runner.py:2687] Graph capturing finished in 13 secs, took 4.08 GiB
(VllmWorker TP3 pid=5385) INFO 12-25 22:09:02 [gpu_model_runner.py:2687] Graph capturing finished in 13 secs, took 4.08 GiB
(VllmWorker TP1 pid=5383) INFO 12-25 22:09:02 [gpu_model_runner.py:2687] Graph capturing finished in 13 secs, took 4.08 GiB
(EngineCore_0 pid=5310) INFO 12-25 22:09:02 [core.py:217] init engine (profile, create kv cache, warmup model) took 23.71 seconds
(EngineCore_0 pid=5310) /opt/venv/lib/python3.12/site-packages/vllm/transformers_utils/tokenizer_group.py:27: FutureWarning: It is strongly recommended to run mistral models with --tokenizer-mode "mistral" to ensure correct encoding and decoding.
(EngineCore_0 pid=5310) self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
(EngineCore_0 pid=5310) The tokenizer you are loading from ‘mistralai/Mistral-Small-24B-Instruct-2501’ with an incorrect regex pattern: mistralai/Mistral-Small-3.1-24B-Instruct-2503 · regex pattern. This will lead to incorrect tokenization. You should set the fix_mistral_regex=True flag when loading this tokenizer to fix this issue.
INFO 12-25 22:09:03 [llm.py:285] Supported_tasks: [‘generate’]
INFO 12-25 22:09:03 [init.py:36] No IOProcessor plugins requested by the model
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 3218.80it/s]
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:06<00:00, 4.70it/s, est. speed input: 601.45 toks/s, output: 601.45 toks/s]
(VllmWorker TP0 pid=5382) INFO 12-25 22:09:09 [multiproc_executor.py:535] Parent process exited, terminating worker
(VllmWorker TP1 pid=5383) INFO 12-25 22:09:09 [multiproc_executor.py:535] Parent process exited, terminating worker
(VllmWorker TP2 pid=5384) INFO 12-25 22:09:09 [multiproc_executor.py:535] Parent process exited, terminating worker
(VllmWorker TP3 pid=5385) INFO 12-25 22:09:09 [multiproc_executor.py:535] Parent process exited, terminating worker
Throughput: 4.69 requests/s, 1201.01 total tokens/s, 600.50 output tokens/s
Total num prompt tokens: 4096
Total num output tokens: 4096

Keep in mind, my system is essentially in a refrigerator (close to 40F)…

dev@dev-WRX90-WS-EVO:~$ amd-smi monitor
GPU XCP POWER GPU_T MEM_T GFX_CLK GFX% MEM% ENC% DEC% VRAM_USAGE
0 0 4 W 57 °C 58 °C 8 MHz 1 % 0 % N/A 0 % 0.2/ 29.9 GB
1 0 4 W 45 °C 46 °C 6 MHz 1 % 0 % N/A 0 % 0.1/ 29.9 GB
2 0 24 W 56 °C 54 °C 468 MHz 1 % 0 % N/A 0 % 0.1/ 29.9 GB
3 0 18 W 47 °C 46 °C 1784 MHz 6 % 0 % N/A 0 % 0.1/ 29.9 GB

I will stop spamming your thread now. Last one.

Oh, and here is how I started the container:

(rocm_env) dev@dev-WRX90-WS-EVO:~$ docker run -it --rm -p 8000:8000 --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ipc=host -v ~/.cache/huggingface:/root/.cache/huggingface rocm/vllm-dev:rocm7.1.1_navi_ubuntu24.04_py3.12_pytorch_2.8_vllm_0.10.2rc1

Single requests are much slower. I think it’s because my prompts are fairly small and it’s split up between 4 GPUs so there is some comm traffic that slows it down. Just a guess based on what Gemini said, but it makes sense.

root@10506e8af016:/app# vllm serve mistralai/Mistral-Small-24B-Instruct-2501 --tensor-parallel-size 4 --max-model-len 8192 --host 0.0.0.0 --port 8000
INFO 12-25 21:40:04 [init.py:241] Automatically detected platform rocm.
(APIServer pid=734) INFO 12-25 21:40:06 [api_server.py:1882] vLLM API server version 0.10.2rc2.dev3+gdff7f6f03
(APIServer pid=734) INFO 12-25 21:40:06 [utils.py:328] non-default args: {‘model_tag’: ‘mistralai/Mistral-Small-24B-Instruct-2501’, ‘host’: ‘0.0.0.0’, ‘model’: ‘mistralai/Mistral-Small-24B-Instruct-2501’, ‘max_model_len’: 8192, ‘tensor_parallel_size’: 4}
(APIServer pid=734) INFO 12-25 21:40:10 [init.py:744] Resolved architecture: MistralForCausalLM
(APIServer pid=734) torch_dtype is deprecated! Use dtype instead!
(APIServer pid=734) INFO 12-25 21:40:10 [init.py:1773] Using max model len 8192
(APIServer pid=734) INFO 12-25 21:40:10 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=734) /opt/venv/lib/python3.12/site-packages/vllm/transformers_utils/tokenizer_group.py:27: FutureWarning: It is strongly recommended to run mistral models with --tokenizer-mode "mistral" to ensure correct encoding and decoding.
(APIServer pid=734) self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
(APIServer pid=734) The tokenizer you are loading from ‘mistralai/Mistral-Small-24B-Instruct-2501’ with an incorrect regex pattern: mistralai/Mistral-Small-3.1-24B-Instruct-2503 · regex pattern. This will lead to incorrect tokenization. You should set the fix_mistral_regex=True flag when loading this tokenizer to fix this issue.
INFO 12-25 21:40:13 [init.py:241] Automatically detected platform rocm.
(EngineCore_0 pid=876) INFO 12-25 21:40:14 [core.py:648] Waiting for init message from front-end.
(EngineCore_0 pid=876) INFO 12-25 21:40:14 [core.py:75] Initializing a V1 LLM engine (v0.10.2rc2.dev3+gdff7f6f03) with config: model=‘mistralai/Mistral-Small-24B-Instruct-2501’, speculative_config=None, tokenizer=‘mistralai/Mistral-Small-24B-Instruct-2501’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=mistralai/Mistral-Small-24B-Instruct-2501, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={“level”:3,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:,“splitting_ops”:[“vllm.unified_attention”,“vllm.unified_attention_with_output”,“vllm.mamba_mixer2”,“vllm.mamba_mixer”,“vllm.short_conv”,“vllm.linear_attention”],“use_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“cudagraph_mode”:1,“use_cudagraph”:true,“cudagraph_num_of_warmups”:1,“cudagraph_capture_sizes”:[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“pass_config”:{},“max_capture_size”:512,“local_cache_dir”:null}
(EngineCore_0 pid=876) WARNING 12-25 21:40:14 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_0 pid=876) INFO 12-25 21:40:14 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, ‘psm_9d939a9f’), local_subscribe_addr=‘ipc:///tmp/42c5b777-729d-4bc4-893e-cdef5539d298’, remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 12-25 21:40:17 [init.py:241] Automatically detected platform rocm.
INFO 12-25 21:40:17 [init.py:241] Automatically detected platform rocm.
INFO 12-25 21:40:17 [init.py:241] Automatically detected platform rocm.
INFO 12-25 21:40:17 [init.py:241] Automatically detected platform rocm.
(VllmWorker TP1 pid=949) INFO 12-25 21:40:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_022650eb’), local_subscribe_addr=‘ipc:///tmp/89e7fcbe-2bbd-4c64-86b7-4d53f023bcd4’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker TP2 pid=950) INFO 12-25 21:40:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_e95f6a81’), local_subscribe_addr=‘ipc:///tmp/ccc715e1-7117-4ce6-adb4-373048ddf6a5’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker TP0 pid=948) INFO 12-25 21:40:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_0a2f3e78’), local_subscribe_addr=‘ipc:///tmp/3a7f499a-83ef-4aba-9c9f-003f8e0975ae’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker TP3 pid=951) INFO 12-25 21:40:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_e7af2da4’), local_subscribe_addr=‘ipc:///tmp/8ca7025c-1abb-4182-93de-a262e079d811’, remote_subscribe_addr=None, remote_addr_ipv6=False)
[W1225 21:40:19.351781877 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W1225 21:40:19.493788193 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W1225 21:40:19.711231544 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W1225 21:40:19.720086647 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
(VllmWorker TP1 pid=949) INFO 12-25 21:40:19 [init.py:1432] Found nccl from library librccl.so.1
(VllmWorker TP0 pid=948) INFO 12-25 21:40:19 [init.py:1432] Found nccl from library librccl.so.1
(VllmWorker TP3 pid=951) INFO 12-25 21:40:19 [init.py:1432] Found nccl from library librccl.so.1
(VllmWorker TP1 pid=949) INFO 12-25 21:40:19 [pynccl.py:70] vLLM is using nccl==2.27.7
(VllmWorker TP2 pid=950) INFO 12-25 21:40:19 [init.py:1432] Found nccl from library librccl.so.1
(VllmWorker TP0 pid=948) INFO 12-25 21:40:19 [pynccl.py:70] vLLM is using nccl==2.27.7
(VllmWorker TP3 pid=951) INFO 12-25 21:40:19 [pynccl.py:70] vLLM is using nccl==2.27.7
(VllmWorker TP2 pid=950) INFO 12-25 21:40:19 [pynccl.py:70] vLLM is using nccl==2.27.7
(VllmWorker TP0 pid=948) INFO 12-25 21:40:23 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, ‘psm_fcb6458b’), local_subscribe_addr=‘ipc:///tmp/126e5418-32f4-4119-aab8-c550c5d590f6’, remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
(VllmWorker TP2 pid=950) INFO 12-25 21:40:23 [parallel_state.py:1134] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
(VllmWorker TP1 pid=949) INFO 12-25 21:40:23 [parallel_state.py:1134] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker TP0 pid=948) INFO 12-25 21:40:23 [parallel_state.py:1134] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker TP3 pid=951) INFO 12-25 21:40:23 [parallel_state.py:1134] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
(VllmWorker TP3 pid=951) INFO 12-25 21:40:23 [gpu_model_runner.py:1932] Starting to load model mistralai/Mistral-Small-24B-Instruct-2501…
(VllmWorker TP0 pid=948) WARNING 12-25 21:40:23 [rocm.py:354] Model architecture ‘MistralForCausalLM’ is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting VLLM_USE_TRITON_FLASH_ATTN=0
(VllmWorker TP2 pid=950) INFO 12-25 21:40:23 [gpu_model_runner.py:1932] Starting to load model mistralai/Mistral-Small-24B-Instruct-2501…
(VllmWorker TP1 pid=949) INFO 12-25 21:40:23 [gpu_model_runner.py:1932] Starting to load model mistralai/Mistral-Small-24B-Instruct-2501…
(VllmWorker TP0 pid=948) INFO 12-25 21:40:23 [gpu_model_runner.py:1932] Starting to load model mistralai/Mistral-Small-24B-Instruct-2501…
(VllmWorker TP1 pid=949) INFO 12-25 21:40:23 [gpu_model_runner.py:1964] Loading model from scratch…
(VllmWorker TP1 pid=949) WARNING 12-25 21:40:23 [rocm.py:354] Model architecture ‘MistralForCausalLM’ is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting VLLM_USE_TRITON_FLASH_ATTN=0
(VllmWorker TP2 pid=950) INFO 12-25 21:40:23 [gpu_model_runner.py:1964] Loading model from scratch…
(VllmWorker TP2 pid=950) WARNING 12-25 21:40:23 [rocm.py:354] Model architecture ‘MistralForCausalLM’ is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting VLLM_USE_TRITON_FLASH_ATTN=0
(VllmWorker TP3 pid=951) INFO 12-25 21:40:23 [gpu_model_runner.py:1964] Loading model from scratch…
(VllmWorker TP3 pid=951) WARNING 12-25 21:40:23 [rocm.py:354] Model architecture ‘MistralForCausalLM’ is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting VLLM_USE_TRITON_FLASH_ATTN=0
(VllmWorker TP0 pid=948) INFO 12-25 21:40:23 [gpu_model_runner.py:1964] Loading model from scratch…
(VllmWorker TP2 pid=950) INFO 12-25 21:40:24 [rocm.py:248] Using Triton Attention backend on V1 engine.
(VllmWorker TP2 pid=950) INFO 12-25 21:40:24 [triton_attn.py:257] Using vllm unified attention for TritonAttentionImpl
(VllmWorker TP3 pid=951) INFO 12-25 21:40:24 [rocm.py:248] Using Triton Attention backend on V1 engine.
(VllmWorker TP3 pid=951) INFO 12-25 21:40:24 [triton_attn.py:257] Using vllm unified attention for TritonAttentionImpl
(VllmWorker TP1 pid=949) INFO 12-25 21:40:24 [rocm.py:248] Using Triton Attention backend on V1 engine.
(VllmWorker TP1 pid=949) INFO 12-25 21:40:24 [triton_attn.py:257] Using vllm unified attention for TritonAttentionImpl
(VllmWorker TP0 pid=948) INFO 12-25 21:40:24 [rocm.py:248] Using Triton Attention backend on V1 engine.
(VllmWorker TP0 pid=948) INFO 12-25 21:40:24 [triton_attn.py:257] Using vllm unified attention for TritonAttentionImpl
(VllmWorker TP2 pid=950) INFO 12-25 21:40:24 [weight_utils.py:304] Using model weights format [‘.safetensors’]
(VllmWorker TP1 pid=949) INFO 12-25 21:40:24 [weight_utils.py:304] Using model weights format ['
.safetensors’]
(VllmWorker TP0 pid=948) INFO 12-25 21:40:24 [weight_utils.py:304] Using model weights format [‘.safetensors’]
(VllmWorker TP3 pid=951) INFO 12-25 21:40:24 [weight_utils.py:304] Using model weights format ['
.safetensors’]
model-00004-of-00010.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.89G/4.89G [04:09<00:00, 19.6MB/s]
model-00001-of-00010.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.78G/4.78G [04:44<00:00, 16.8MB/s]
model-00003-of-00010.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.78G/4.78G [04:53<00:00, 16.3MB/s]
model-00002-of-00010.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.78G/4.78G [04:58<00:00, 16.0MB/s]
model-00007-of-00010.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.89G/4.89G [05:00<00:00, 16.3MB/s]
model-00005-of-00010.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.78G/4.78G [05:24<00:00, 14.7MB/s]
model-00006-of-00010.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.78G/4.78G [05:28<00:00, 14.5MB/s]
model-00010-of-00010.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.90G/3.90G [02:12<00:00, 29.4MB/s]
model-00008-of-00010.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.78G/4.78G [03:20<00:00, 23.9MB/s]
model-00009-of-00010.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.78G/4.78G [02:47<00:00, 28.5MB/s]
consolidated.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 47.1G/47.1G [12:40<00:00, 62.0MB/s]
(VllmWorker TP2 pid=950) INFO 12-25 21:53:04 [weight_utils.py:325] Time spent downloading weights for mistralai/Mistral-Small-24B-Instruct-2501: 760.692362 seconds | 5.38G/47.1G [07:06<13:52, 50.2MB/s]
model.safetensors.index.json: 29.9kB [00:00, 242MB/s]████▌ | 7.54G/47.1G [07:31<03:33, 185MB/s]
Loading safetensors checkpoint shards: 0% Completed | 0/10 [00:00<?, ?it/s] | 7.97G/47.1G [07:34<04:17, 152MB/s]
Loading safetensors checkpoint shards: 10% Completed | 1/10 [00:00<00:02, 4.41it/s]████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 47.1G/47.1G [12:40<00:00, 514MB/s]
Loading safetensors checkpoint shards: 20% Completed | 2/10 [00:00<00:01, 4.10it/s]████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 4.71G/4.78G [02:40<00:03, 21.2MB/s]
Loading safetensors checkpoint shards: 30% Completed | 3/10 [00:00<00:01, 3.81it/s]███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.78G/4.78G [02:47<00:00, 15.4MB/s]
Loading safetensors checkpoint shards: 40% Completed | 4/10 [00:01<00:01, 3.81it/s]
Loading safetensors checkpoint shards: 50% Completed | 5/10 [00:01<00:01, 3.79it/s]
Loading safetensors checkpoint shards: 60% Completed | 6/10 [00:01<00:01, 3.73it/s]
Loading safetensors checkpoint shards: 70% Completed | 7/10 [00:01<00:00, 3.69it/s]
Loading safetensors checkpoint shards: 80% Completed | 8/10 [00:02<00:00, 3.69it/s]
(VllmWorker TP2 pid=950) INFO 12-25 21:53:07 [default_loader.py:267] Loading weights took 2.61 seconds
Loading safetensors checkpoint shards: 90% Completed | 9/10 [00:02<00:00, 3.93it/s]
(VllmWorker TP1 pid=949) INFO 12-25 21:53:07 [default_loader.py:267] Loading weights took 2.64 seconds
Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:02<00:00, 4.00it/s]
Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:02<00:00, 3.88it/s]
(VllmWorker TP0 pid=948)
(VllmWorker TP0 pid=948) INFO 12-25 21:53:08 [default_loader.py:267] Loading weights took 2.60 seconds
(VllmWorker TP3 pid=951) INFO 12-25 21:53:08 [default_loader.py:267] Loading weights took 2.56 seconds
(VllmWorker TP2 pid=950) INFO 12-25 21:53:08 [gpu_model_runner.py:1986] Model loading took 11.0645 GiB and 763.990910 seconds
(VllmWorker TP1 pid=949) INFO 12-25 21:53:08 [gpu_model_runner.py:1986] Model loading took 11.0645 GiB and 764.199423 seconds
(VllmWorker TP0 pid=948) INFO 12-25 21:53:08 [gpu_model_runner.py:1986] Model loading took 11.0645 GiB and 764.318582 seconds
(VllmWorker TP3 pid=951) INFO 12-25 21:53:08 [gpu_model_runner.py:1986] Model loading took 11.0645 GiB and 764.454001 seconds
(VllmWorker TP3 pid=951) INFO 12-25 21:53:11 [backends.py:538] Using cache directory: /root/.cache/vllm/torch_compile_cache/1d1592f64b/rank_3_0/backbone for vLLM’s torch.compile
(VllmWorker TP1 pid=949) INFO 12-25 21:53:11 [backends.py:538] Using cache directory: /root/.cache/vllm/torch_compile_cache/1d1592f64b/rank_1_0/backbone for vLLM’s torch.compile
(VllmWorker TP0 pid=948) INFO 12-25 21:53:11 [backends.py:538] Using cache directory: /root/.cache/vllm/torch_compile_cache/1d1592f64b/rank_0_0/backbone for vLLM’s torch.compile
(VllmWorker TP3 pid=951) INFO 12-25 21:53:11 [backends.py:549] Dynamo bytecode transform time: 2.58 s
(VllmWorker TP1 pid=949) INFO 12-25 21:53:11 [backends.py:549] Dynamo bytecode transform time: 2.58 s
(VllmWorker TP2 pid=950) INFO 12-25 21:53:11 [backends.py:538] Using cache directory: /root/.cache/vllm/torch_compile_cache/1d1592f64b/rank_2_0/backbone for vLLM’s torch.compile
(VllmWorker TP0 pid=948) INFO 12-25 21:53:11 [backends.py:549] Dynamo bytecode transform time: 2.58 s
(VllmWorker TP2 pid=950) INFO 12-25 21:53:11 [backends.py:549] Dynamo bytecode transform time: 2.58 s
(VllmWorker TP0 pid=948) INFO 12-25 21:53:13 [backends.py:194] Cache the graph for dynamic shape for later use
(VllmWorker TP3 pid=951) INFO 12-25 21:53:13 [backends.py:194] Cache the graph for dynamic shape for later use
(VllmWorker TP2 pid=950) INFO 12-25 21:53:13 [backends.py:194] Cache the graph for dynamic shape for later use
(VllmWorker TP1 pid=949) INFO 12-25 21:53:13 [backends.py:194] Cache the graph for dynamic shape for later use
(VllmWorker TP3 pid=951) INFO 12-25 21:53:24 [backends.py:215] Compiling a graph for dynamic shape takes 11.97 s
(VllmWorker TP1 pid=949) INFO 12-25 21:53:24 [backends.py:215] Compiling a graph for dynamic shape takes 11.97 s
(VllmWorker TP0 pid=948) INFO 12-25 21:53:24 [backends.py:215] Compiling a graph for dynamic shape takes 12.04 s
(VllmWorker TP2 pid=950) INFO 12-25 21:53:24 [backends.py:215] Compiling a graph for dynamic shape takes 12.04 s
(VllmWorker TP0 pid=948) INFO 12-25 21:53:25 [monitor.py:34] torch.compile takes 14.62 s in total
(VllmWorker TP2 pid=950) INFO 12-25 21:53:25 [monitor.py:34] torch.compile takes 14.62 s in total
(VllmWorker TP3 pid=951) INFO 12-25 21:53:25 [monitor.py:34] torch.compile takes 14.54 s in total
(VllmWorker TP1 pid=949) INFO 12-25 21:53:25 [monitor.py:34] torch.compile takes 14.56 s in total
(VllmWorker TP2 pid=950) INFO 12-25 21:53:27 [gpu_worker.py:276] Available KV cache memory: 13.46 GiB
(VllmWorker TP1 pid=949) INFO 12-25 21:53:27 [gpu_worker.py:276] Available KV cache memory: 13.46 GiB
(VllmWorker TP3 pid=951) INFO 12-25 21:53:27 [gpu_worker.py:276] Available KV cache memory: 13.46 GiB
(VllmWorker TP0 pid=948) INFO 12-25 21:53:27 [gpu_worker.py:276] Available KV cache memory: 13.46 GiB
(EngineCore_0 pid=876) INFO 12-25 21:53:27 [kv_cache_utils.py:850] GPU KV cache size: 352,752 tokens
(EngineCore_0 pid=876) INFO 12-25 21:53:27 [kv_cache_utils.py:854] Maximum concurrency for 8,192 tokens per request: 43.06x
(EngineCore_0 pid=876) INFO 12-25 21:53:27 [kv_cache_utils.py:850] GPU KV cache size: 352,752 tokens
(EngineCore_0 pid=876) INFO 12-25 21:53:27 [kv_cache_utils.py:854] Maximum concurrency for 8,192 tokens per request: 43.06x
(EngineCore_0 pid=876) INFO 12-25 21:53:27 [kv_cache_utils.py:850] GPU KV cache size: 352,752 tokens
(EngineCore_0 pid=876) INFO 12-25 21:53:27 [kv_cache_utils.py:854] Maximum concurrency for 8,192 tokens per request: 43.06x
(EngineCore_0 pid=876) INFO 12-25 21:53:27 [kv_cache_utils.py:850] GPU KV cache size: 352,752 tokens
(EngineCore_0 pid=876) INFO 12-25 21:53:27 [kv_cache_utils.py:854] Maximum concurrency for 8,192 tokens per request: 43.06x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:13<00:00, 5.04it/s]
(VllmWorker TP0 pid=948) INFO 12-25 21:53:41 [gpu_model_runner.py:2687] Graph capturing finished in 14 secs, took 3.81 GiB
(VllmWorker TP1 pid=949) INFO 12-25 21:53:41 [gpu_model_runner.py:2687] Graph capturing finished in 14 secs, took 3.81 GiB
(VllmWorker TP2 pid=950) INFO 12-25 21:53:41 [gpu_model_runner.py:2687] Graph capturing finished in 14 secs, took 3.81 GiB
(VllmWorker TP3 pid=951) INFO 12-25 21:53:41 [gpu_model_runner.py:2687] Graph capturing finished in 14 secs, took 3.81 GiB
(EngineCore_0 pid=876) INFO 12-25 21:53:41 [core.py:217] init engine (profile, create kv cache, warmup model) took 32.93 seconds
(EngineCore_0 pid=876) /opt/venv/lib/python3.12/site-packages/vllm/transformers_utils/tokenizer_group.py:27: FutureWarning: It is strongly recommended to run mistral models with --tokenizer-mode "mistral" to ensure correct encoding and decoding.
(EngineCore_0 pid=876) self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
(EngineCore_0 pid=876) The tokenizer you are loading from ‘mistralai/Mistral-Small-24B-Instruct-2501’ with an incorrect regex pattern: mistralai/Mistral-Small-3.1-24B-Instruct-2503 · regex pattern. This will lead to incorrect tokenization. You should set the fix_mistral_regex=True flag when loading this tokenizer to fix this issue.
(APIServer pid=734) INFO 12-25 21:53:42 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 22047
(APIServer pid=734) INFO 12-25 21:53:42 [async_llm.py:166] Torch profiler disabled. AsyncLLM CPU traces will not be collected.
(APIServer pid=734) INFO 12-25 21:53:42 [api_server.py:1680] Supported_tasks: [‘generate’]
(APIServer pid=734) WARNING 12-25 21:53:42 [init.py:1673] Default sampling parameters have been overridden by the model’s Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with --generation-config vllm.
(APIServer pid=734) INFO 12-25 21:53:42 [serving_responses.py:126] Using default chat sampling params from model: {‘temperature’: 0.15}
(APIServer pid=734) INFO 12-25 21:53:42 [serving_chat.py:137] Using default chat sampling params from model: {‘temperature’: 0.15}
(APIServer pid=734) INFO 12-25 21:53:42 [serving_completion.py:79] Using default completion sampling params from model: {‘temperature’: 0.15}
(APIServer pid=734) INFO 12-25 21:53:42 [api_server.py:1957] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:36] Available routes are:
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /docs, Methods: GET, HEAD
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /redoc, Methods: GET, HEAD
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /health, Methods: GET
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /load, Methods: GET
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /ping, Methods: POST
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /ping, Methods: GET
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /tokenize, Methods: POST
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /detokenize, Methods: POST
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /v1/models, Methods: GET
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /version, Methods: GET
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /v1/responses, Methods: POST
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /v1/chat/completions, Methods: POST
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /v1/completions, Methods: POST
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /v1/embeddings, Methods: POST
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /pooling, Methods: POST
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /classify, Methods: POST
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /score, Methods: POST
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /v1/score, Methods: POST
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /v1/audio/translations, Methods: POST
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /rerank, Methods: POST
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /v1/rerank, Methods: POST
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /v2/rerank, Methods: POST
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /invocations, Methods: POST
(APIServer pid=734) INFO 12-25 21:53:42 [launcher.py:44] Route: /metrics, Methods: GET
(APIServer pid=734) INFO: Started server process [734]
(APIServer pid=734) INFO: Waiting for application startup.
(APIServer pid=734) INFO: Application startup complete.
(APIServer pid=734) The tokenizer you are loading from ‘mistralai/Mistral-Small-24B-Instruct-2501’ with an incorrect regex pattern: mistralai/Mistral-Small-3.1-24B-Instruct-2503 · regex pattern. This will lead to incorrect tokenization. You should set the fix_mistral_regex=True flag when loading this tokenizer to fix this issue.
(APIServer pid=734) INFO 12-25 21:56:21 [chat_utils.py:470] Detected the chat template content format to be ‘string’. You can set --chat-template-content-format to override this.
(APIServer pid=734) INFO 12-25 21:56:22 [loggers.py:123] Engine 000: Avg prompt throughput: 19.3 tokens/s, Avg generation throughput: 2.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=734) INFO 12-25 21:56:32 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=734) INFO: 172.17.0.1:49464 - “POST /v1/chat/completions HTTP/1.1” 200 OK
(APIServer pid=734) INFO 12-25 21:56:42 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=734) INFO 12-25 21:56:52 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%