Thanks for the reply!
I will try to compile it with gcc-10.2 but I’m not sure that will help.
As for optimization flags, here is my CMakeLists.txt, I’m compiling with -DCMAKE_BUILD_TYPE=“Release” which appends -O3 and --DNDEBUG to the gcc command. Here is my CMakeLists.txt:
[pastebindotcom]/qauwK8TR
I have tried to install everything from source and I think I am. Here’s a quick overview:
Eigen: Just clone from gitlab and install headers using cmake and make install
casadi: depends on Ipopt, which depends on MUMPS (solver), BLAS, LAPACK. I installed libflame instead of lapack with --lapack2flame, installed BLIS instead of BLAS and specifically tried avoiding installing the liblapack-dev packages, contrary to the recommendations of IPOPT. Then I downloaded the ipopt source and installed it it from source as well. Then I downloaded the casadi source and installed it from source. These are the main deps I need for my program.
I will definitely have to compare the two compiled binaries, will get back to you later with that as I currently don’t have access to the 3900x, but I also wanted to try that, an maybe even run the same binary on both machines.
PGO is something I didn’t know about, I will read into that, thanks! I think this might help me with further optimizing my software in general.
linux-perf is something I’ve heard of but will have to take a closer look and use more, thanks for that as well
As for using a more capable build system, I’ve thought about that but wasn’t sure what to use and if it’s worth it. Do you mean clang + bazel for example? Would love to hear about that, most posts I read said it doesn’t matter.
Another theory: I’m not sure if this is relevant at all, but Intel MKL and Lapack + BLAS might be a difference to the AMD setup that could be a problem. what I mean is that BLIS and libflame will probably not use the Intel MKL optimizations but maybe my laptop with an intel CPU will. I’m simply not sure how to check if it is using Intel MKL. Do you have any idea how to do that? Because if that is the case, I might actually be better off simply installing lapack and blas + intel mkl and then patch it so that it enabled optimizations even on an AMD CPU as described in some articles? As soon as I have access to the 3900x I will apply the patch either way but I somehow have a feeling that it won’t help, since it shouldn’t even be using anyhting intel related.
Edit/Update: Wow linux-perf is awesome!
So it’s obvious MUMPS (the solver) has the most overhead and is worth looking into. This is my perf.data generated with sudo perf record -F 1000 ./controller
:
1drv.ms/u/s!Airv5jc71hN8hashn88bjfS2po_KwA?e=xt2z6U
and perf stat
:
5,865.86 msec task-clock # 0.722 CPUs utilized
26,863 context-switches # 0.005 M/sec
6,169 cpu-migrations # 0.001 M/sec
6,121 page-faults # 0.001 M/sec
22,828,939,469 cycles # 3.892 GHz
27,098,130,941 instructions # 1.19 insn per cycle
4,086,712,289 branches # 696.695 M/sec
103,165,465 branch-misses # 2.52% of all branches
8.128264236 seconds time elapsed
4.733205000 seconds user
1.261645000 seconds sys
Could you tell me how to get permission to post files and links, please? That would make everything a lot easier. Also, to figure out how MUMPS is running internally, I think it would be helpful to profile what MUMPS is calling / using, is that possible with perf?
I’m also reading about kallsyms right now since that seams to have a lot of overhead as well.
Another update: I ran the benchmark with the same perf command, here is the file: [drivedotgoogledotcom]/file/d/1ULVe1kcwFS-EFDEzwDS1GFZPzvnsqipA/view?usp=sharing