Numa isolation cpus / ram

Hi,

I am running kernel 5.9.15 if that matters. I have a OCD issues with a memory isolation in kernel. I have two numa nodes in my server and I wanted to isolate whole node to vms. Cores is pretty easy to isolate, but the memory is not that easy or I miss some kernel parameters of the memory spaces. I have a dirty work around now.

Kernel Parameters:

isolcpus=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
nohz_full=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
rcu_nocbs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
default_hugepagesz=1G
hugepagesz=1G
hugepages=128
hugepagesz=2M
hugepages=1835

Boot time service:

reserve_pages()
{
    echo $1 > $nodes_path/$2/hugepages/hugepages-2048kB/nr_hugepages
    echo $1 > $nodes_path/$2/hugepages/hugepages-1048576kB/nr_hugepages
}

reserve_pages 0 node0

Achievement:

node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 193365 MB
node 0 free: 191771 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64505 MB
node 1 free: 8 MB

Why / Problems:

  1. If there is still free memory avaible on node1 there will be huge performance penalty when process hitting node1 memory pool. Now I have shrinked down that pool to only 8 MB but that is not perfect. Actual I think hugepages allocation vary between boots. Sometimes there is 61 or 62 1 GB huge pages. Mixing 1GB and 2 MB pages fills the space.

  2. I have to use kernel parameters with double allocation because it divide for both nodes and then run early boot service to unallocate huge pages from node0.

  3. Why there is no default memory policy to isolate node1 memory space from kernel and to default it to the new processes? Biggest problems is the processes that runned by kernel. Example the mdadm.

  4. There is no possibility to define numa node for ramfs. Ramfs needed when you want to use ram disk as normal block device. Tmpfs can be defined to numa node but working only in a mount point.

1 Like

Check out vfio-isolate

Edit: vfio-isolate is also available in Arch from the AUR as well as a python package.

2 Likes

Do you know if its a wrapper for sysmem utilies or actual program?

These kernel parameters solved problem:

memmap=0x0000002f7fffffff@0x0000000100000000
memmap=0x0000001000000000$0x0000003080000000

isolcpus=1-32:1/2
nohz_full=1-32:1/2
rcu_nocbs=1-32:1/2

Achievement:

  ~ $ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 193167 MB
node 0 free: 192046 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 0 MB
node 1 free: 0 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10

 ~ $ numastat
                           node0           node1
numa_hit                  401234               0
numa_miss                      0               0
numa_foreign                   0               0
interleave_hit             15089               0
local_node                400938               0
other_node                   296               0
Summary
    mem=nn[KMG]     [KNL,BOOT] Force usage of a specific amount of memory
                    Amount of memory to be used in cases as follows:

                    1 for test;
                    2 when the kernel is not able to see the whole system memory;
                    3 memory that lies after 'mem=' boundary is excluded from
                     the hypervisor, then assigned to KVM guests.

                    [X86] Work as limiting max address. Use together
                    with memmap= to avoid physical address space collisions.
                    Without memmap= PCI devices could be placed at addresses
                    belonging to unused RAM.

                    Note that this only takes effects during boot time since
                    in above case 3, memory may need be hot added after boot
                    if system memory of hypervisor is not sufficient.

    memmap=nn[KMG]@ss[KMG]
                    [KNL] Force usage of a specific region of memory.
                    Region of memory to be used is from ss to ss+nn.
                    If @ss[KMG] is omitted, it is equivalent to mem=nn[KMG],
                    which limits max address to nn[KMG].
                    Multiple different regions can be specified,
                    comma delimited.
                    Example:
                            memmap=100M@2G,100M#3G,1G!1024G


    memmap=nn[KMG]$ss[KMG]
                    [KNL,ACPI] Mark specific memory as reserved.
                    Region of memory to be reserved is from ss to ss+nn.
                    Example: Exclude memory from 0x18690000-0x1869ffff
                             memmap=64K$0x18690000
                             or
                             memmap=0x10000$0x18690000
                    Some bootloaders may need an escape character before '$',
                    like Grub2, otherwise '$' and the following number
                    will be eaten.
2 Likes

I moved this back to unsolved for many reasons.

    mem=nn[KMG]     [KNL,BOOT] Force usage of a specific amount of memory
                    Amount of memory to be used in cases as follows:

                    1 for test;
                    2 when the kernel is not able to see the whole system memory;
                    3 memory that lies after 'mem=' boundary is excluded from
                     the hypervisor, then assigned to KVM guests.

I never figured out how the memory is intended to assign to KVM. I can not find any refence to that topic.

Now I have dived in to the systemd configuration with satisfactory results.

/systemd/system.conf

[Manager]
CPUAffinity=0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
NUMAPolicy=bind
NUMAMask=0

Actually that worked for all expect the kernel itself. I still leave boot parameter to isolate CPU because without it memory will be more fragmented and hugepage allocation fails with only 61 pages.

isolcpus=1-32:1/2
nohz_full=1-32:1/2
rcu_nocbs=1-32:1/2