Issue trying to understand NLP Solver performance

So at this point I don’t know who else to turn to, and I’m just trying to find anyone that knows about this. Also, please do refer me to another community on the forums if this is not the right place to ask! I’m using IPOPT and casadi to solve a nonlinear optimization problem in real time (model predictive control), and I have been confused for months now as to why the performance is so different with Intel vs. AMD. I’m currently using a Thinkpad P53 with an i7-9750H and I created a benchmark that solves the NLP about 10k times and calculates the average time to solve it. I ran this on my laptop, compiled with gcc-8.4 and on a 3900x with gcc-9, and the i7-9750H is still slower in the solver part. As I realized that BLAS and Lapack are more suited for linux, I compiled ipopt with BLIS and libflame instead, but it’s still marginally slower than my laptop CPU, which according to UserBenchmarks, should not be the case. What am I doing wrong? if anyone knows or would be willing to help me in any way, I would really appreciate it.

My Benchmark results:

3900x:

Average average full iteration time over 10 passes in µS: 34488.3
Average average solver time preparation over 10 passes in µS: 362.139

i7-9750H:

Average average full iteration time over 10 passes in µS: 31113.9
Average average solver time preparation over 10 passes in µS: 388.395

The benchmark code (I’m not allowed to post proper links or attachments): [gistdotgithubdotcom]/LouKordos/b72facdc4894a7de82715b14d44e4d20

I was going there’d be someone with hands on experience using that stuff around. In true first level support fashion I can only offer generic “have you tried turning it off and on again style of advice”, and hope it helps you.

First thing I noticed is that you seem to have too many variables to even attempt to answer the question for yourself - you’d need to experiment a little bit to figure out what’s going on, and to try reducing the degrees of freedom in your problem. Eliminating differences one by one will help you understand where the differences come from and how they affect performance.

  • First thing first, use a newer gcc (e.g. 10.2 would be reasonable these days) - sometimes older compilers are faster for specific workloads, but that’s more of an anomaly usually.
  • pass in -O3 and other performance recommended flags and a correct / same set of architecture flags. For example -march=native can be finicky and containers/VMs/sandboxes, choice of distro an OS and kernel flags and libc can affect it, try not using it when producing benchmark binaries. Instead, run g++ -march=native -Q --help=target … and see what flags are set, and use those, or pick a fixed march value.
  • Do not use precompiled dynamically linked libraries - other than maybe things like libc that are tightly coupled to your system and should be called into rarely enough to not affect performance (but do pick your malloc). This will allow your compiler to further optimize everything and inline more things and apply various link time optimizations that are harder to do otherwise.
  • Once you have the same binary, or at least code produced using the same compiler and same flags for both systems, you can try comparing it between machines, followed by perhaps one more compilation step (PGO) if you have a golden set of workload samples. PGO compilation means your first compilation is an instrumented binary that has a built-in profiler, that will dump a profile into a file format that the compiler understands for second or subsequent compilations. You need to run your code/binary with some samples of your production workload to generate this profile file. You’d then give this profile to the compiler and recompile again to further optimize the binary. Mainly this tells affects branch prediction, some prefetching and some register use - could gain you nothing, or it could gain you 25% performance, it depends.
  • When finally running the binary, there’s a tool called “linux-perf” which will provide you with various hardware performance counters that you’ll then be able to compare cache hits/misses, branch prediction stats various latencies, instructions per clock between two machines.

This is pretty much everything in my generic bag of tricks I can think of right now and otoh.

You may want to consider using a more capable build system than make or cmake at this point, if you aren’t already. Something that’ll help you better control the versions and compiler parameters of libraries you depend on.

Thanks for the reply!
I will try to compile it with gcc-10.2 but I’m not sure that will help.

As for optimization flags, here is my CMakeLists.txt, I’m compiling with -DCMAKE_BUILD_TYPE=“Release” which appends -O3 and --DNDEBUG to the gcc command. Here is my CMakeLists.txt:
[pastebindotcom]/qauwK8TR

I have tried to install everything from source and I think I am. Here’s a quick overview:
Eigen: Just clone from gitlab and install headers using cmake and make install
casadi: depends on Ipopt, which depends on MUMPS (solver), BLAS, LAPACK. I installed libflame instead of lapack with --lapack2flame, installed BLIS instead of BLAS and specifically tried avoiding installing the liblapack-dev packages, contrary to the recommendations of IPOPT. Then I downloaded the ipopt source and installed it it from source as well. Then I downloaded the casadi source and installed it from source. These are the main deps I need for my program.

I will definitely have to compare the two compiled binaries, will get back to you later with that as I currently don’t have access to the 3900x, but I also wanted to try that, an maybe even run the same binary on both machines.

PGO is something I didn’t know about, I will read into that, thanks! I think this might help me with further optimizing my software in general.

linux-perf is something I’ve heard of but will have to take a closer look and use more, thanks for that as well :slight_smile:

As for using a more capable build system, I’ve thought about that but wasn’t sure what to use and if it’s worth it. Do you mean clang + bazel for example? Would love to hear about that, most posts I read said it doesn’t matter.

Another theory: I’m not sure if this is relevant at all, but Intel MKL and Lapack + BLAS might be a difference to the AMD setup that could be a problem. what I mean is that BLIS and libflame will probably not use the Intel MKL optimizations but maybe my laptop with an intel CPU will. I’m simply not sure how to check if it is using Intel MKL. Do you have any idea how to do that? Because if that is the case, I might actually be better off simply installing lapack and blas + intel mkl and then patch it so that it enabled optimizations even on an AMD CPU as described in some articles? As soon as I have access to the 3900x I will apply the patch either way but I somehow have a feeling that it won’t help, since it shouldn’t even be using anyhting intel related.

Edit/Update: Wow linux-perf is awesome!
So it’s obvious MUMPS (the solver) has the most overhead and is worth looking into. This is my perf.data generated with sudo perf record -F 1000 ./controller:
1drv.ms/u/s!Airv5jc71hN8hashn88bjfS2po_KwA?e=xt2z6U

and perf stat:

          5,865.86 msec task-clock                #    0.722 CPUs utilized          
            26,863      context-switches          #    0.005 M/sec                  
             6,169      cpu-migrations            #    0.001 M/sec                  
             6,121      page-faults               #    0.001 M/sec                  
    22,828,939,469      cycles                    #    3.892 GHz                    
    27,098,130,941      instructions              #    1.19  insn per cycle         
     4,086,712,289      branches                  #  696.695 M/sec                  
       103,165,465      branch-misses             #    2.52% of all branches        

       8.128264236 seconds time elapsed

       4.733205000 seconds user
       1.261645000 seconds sys

Could you tell me how to get permission to post files and links, please? That would make everything a lot easier. Also, to figure out how MUMPS is running internally, I think it would be helpful to profile what MUMPS is calling / using, is that possible with perf?
I’m also reading about kallsyms right now since that seams to have a lot of overhead as well.

Another update: I ran the benchmark with the same perf command, here is the file: [drivedotgoogledotcom]/file/d/1ULVe1kcwFS-EFDEzwDS1GFZPzvnsqipA/view?usp=sharing

Again, not an expert on your particular issue.

Links/file posting requires you to be around for longer.

Based on what you described… try the same binary. If you still get differences try profiling it at runtime. Intel’s been known to write software that enables certain features on intel CPUs only, despite AMD supporting the instruction set.

build system, for example Bazel, works well if you’re a control freak and care to squeeze every single tiny bit of performance out if your binaries. Its “natural habitat” are places using monorepos with all dependencies and the tool chain checked in. It’s not easy to use, it’s not easy to adopt, but it’s really nice once you have it and get used to it. (repeatable builds with Bazel caching are wonderful … once you get it running well). It’s just something to consider.

In terms of performance, you have a context switch and cpu migration every millisecond roughly… It’s probably not what’s slowing you down double, but it’s still a lot. If you have a queue to gather results try increasing Its size.

It looks like branch prediction is mostly working. And IPC is around 1.25 … (is the code doing mostly floating point? … or is the cache/ram the bottleneck here… what’s preventing it from going higher?)