[SOLVED] Lxc cpu pinning/resource limits not working (CentOS7)

OS: CentOS7
Kernel: 4.19rc2 kernel
CPU: 2990WX
LXC: lxc-1.0.11-1.el7.x86_64

CentOS7 container created from template provided by lxc package that’s part of epel-release. Works normally without special configuration (i.e. it will run as a container and consume 100% of the cores/threads/memory on the host machine with sufficient load).

Now configure to limit to 16 “near” cores - no threads.

The objective of the following setting is to pin the container’s CPU to only actual CPUs in the Threadripper CPU and within those only those that have memory directly attached.

/var/lib/lxc/mycentos/config:
lxc.cgroup.cpuset.cpus = 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30

[root@mycentos ~]# grep processor /proc/cpuinfo | wc -l
64

What am I missing here?

UPDATE/EDIT:

[SOLUTION] lxc is working as advertised in that resource scheduling if left to the kernel is obeying the lxc.cgroup.cpus value in the config file. It is NOT modifying the container’s view of /proc/cpuinfo or other information so if the application in question directly makes affinity/numactl calls with information it gathers from /proc, it will make poor choices.

This is a “feature” of lxc it seems…

This is addressed in my case by intercepting numactl calls in the container by wrapping /usr/bin/numactl with a bash script that bypasses the call to numactl and executes the requested application (arguments passed to numactl are application [application args] - so numa config is stripped and application is executed directly).

This bypass allows the container’s kernel to schedule “natively” among the CPU resources assigned by the lxc configuration.

procfs cpuinfo != what you get scheduled on.

Try running 100 threads of busy loop, or some other workload.

This does work with applications that rely on the OS to schedule. stressapp will run on the cores intended.

However, anything that sets affinity manually will get bad information and either malfunction (calling numactl with bad arguments) or make very poor decisions and put things on cores that don’t make sense.

This is why I’m going down this route - to constrain resources to make applications beyond my control designed for intel systems to resources that make sense on TR/Epyc.

I’d love to just fix the underlying application, but I don’t have access to do so at this point.

I don’t think you can create a scheduler namespace.
It’s technically possible to wrap syscalls that affect scheduling using a ptrace sandbox approach, but I don’t know of a utility that does it.

I suspect, because of the sheer number of APIs that might affect application scheduling decisions and thus be performance impacting maybe such a utility just does not exist.

I’d still like to know if it does, but I’m not all that hopeful.

What’s the app you’re trying to wrangle this way?

Afraid I can’t say too much specifically.

“They” are aware of the scheduling issue on this “new” architecture. Just looks like it might be a while before they can fix it. I’d been playing with containers for a while anyway so I was wondering if perhaps this was a viable work-around.

About a year ago I watched a conference talk where someone was implementing a pthread sandbox in go live on stage during the course of this talk,

I can try looking for it later today once I’m at a laptop or desktop, if it’s just one app you need to worry about and you have light gdb and/or golang skills, maybe that can help?

Thx.

I’m also looking into whether I can fool it into making better choices in other ways by intercepting numactl and other calls…

It’s a hack, but see above, it may have to do for a while… For now, I just want to understand what it can do if not crippled by scheduling decisions.

“Strace in 60 lines of Go” @lizrice https://hackernoon.com/strace-in-60-lines-of-go-b4b76e3ecd64

1 Like

I have a functional work-around that might get me the information I need…

I wrapped /usr/bin/numactl and stripped out the application call from the numactl arguments and called it directly allowing the guest kernel to schedule natively.

This now allows the container constraints to override core/thread selection correctly for this app.

I’ll look further into your link. I had seen others that seemed to be addressing these sorts of issues, but I could not tell what the “state of the art” was on /procfs modification. Most of the conversations I found were 2014-15 vintage. So, I assumed it had been “fixed”. I guess the problem just got dropped for lack of pressing need in most cases.