Catsay plays with Xeon Phi

Holy cow! That is great!

I don’t know and wont pretend to know what a huge chunk of that really means, but it is all still super cool.

I did hear it was a pain getting mpss running, but never tried myself. The forum posts on what to do were (are) beyond my current understanding.

Just a quick question. What are you (Catsay) planning on using your phi for?

For now I’m not planning to use it for anything critical/work project related.
I’ve already got a liberated ex-mining rig GPU farm running AMD ROCm / HIP code for some of my actual data processing (lots of OCR).

The Xeon Phi I’m first going to experiment on, explore the hardware.
Test the ISA, do some microarchitecture benchmarks. Probably run some blender OpenCL renders (Xeon Phi supports OpenCL) so all sorts of opencl stuff can run on it actually. I also plan on making some OpenCL & OpenMP comparison code. Pitting both old(TeraScale) and new GPU’s against the Phi. Explore the strengths and weaknesses of both.

Possibly do some Java & Python offload experiments someone asked me about.

But mainly I’m going to just explore and document things along the way, so if you’ve got things you’d like to see run on it feel free to make suggestions.

Lastly; just as a precaution to everyone else that might not quite understand what Xeon Phi is… It’s not a graphics card. It’s a big fat CPU with an extended x86 instruction set and big vector registers that comprise some of the ‘GPU-like’ elements.

So no we can’t run Crysis :smiley_cat:
Or minecraft… :smirk:

2 Likes

You might’ve found something to try and do with it hahahaha

Maybe quake? Or something that runs on OpenGL?(Is it how it’s called?)

Just for funsies? hahaha

It has a whole bunch of tiny cores, right? So in theory, if each core was a node in a neural net, it should crunch those rather quickly.

I hope I don’t find one for less than 150€, else I will have 150€ less

4 Likes

Random update:

After a long period of screwing around with the Linux drivers and MPSS Stack
I decided to actually see if it’s even working and load it up in Windows.

And it is indeed working

And oh goody it drinks the power, 100W at idle and already running 64C. I’ve got a delta 16K RPM
(PFM0812HE-01BFY) fan strapped to it running 4500RPM just to maintain the idle temps.

And btw Intel MPSS SDK is horrible.
And just in general the software ecosystem around this is totally not as advertised in all of the marketing and training materials.
I attribute 90% of the failure of Xeon Phi to catch on, to the mess that is the MPSS stack.

I wanted to run a quick simple OpenCL test

It is definetly not like programming for a desktop CPU
Nothing about this thing is straight forward in any sense
You can’t just install the OpenCL runtime
No you need to then compile it for your machine
Then to compile it you need to install 11GB of Parrallel studio and 8GB of Visual Studio 2012
Then you need to build a new bootable base cpio image for the card because the last released one is missing a bunch of any useful tools and API’s
Then boot the card with that and setup an NFS share with your libraries etc
Then you can run your code

To get it working on Windows I’m gonna have to load Intel Parallel Studio XE
Which is usually a pain to get and costs an stupid amount of money
No wonder people where hesitant to develop for Xeon Phi

Intel started shoving the 3110P into their faces at $195 + Parallel Studio XE included at some stage in hopes people would use them.

More at 12 or whenever I log in again.

8 Likes

What I have determined so far is these things are not efficent at all for any code that also runs well/better on a GPU

They really need to be used for specific code with enough branch complexity that doesn’t run well on GPU’s

Experiences with Intel Xeon Phi in the Max-Planck Society

https://portal.enes.org/ISENES3/archive-1/phase-2/documents/Talks/WS3HH/session-4-hpc-software-challenges-solutions-for-the-climate-community/markus-rampp-mic-experiences-at-mpg

TLDR:

In moderately parallel applications traditional fat multicore CPU’s murder xeon Phi, (Especially Threadripper) in massively parallel applications GPGPUs rule. Xeon Phi is only barely competitive when you can perfectly parallelize AND vectorize your application.

If either one is not perfect forget Xeon Phi.

Also their cache hierarchy sucks BAWLLLS!
Attempts to manually approximate Cache Coherency are a figment of the designers imagination

The thing is dirty

6 Likes

Some playlists of relevance

Xeon Phi Coding & Hardware

Video course: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors

5 Likes

Looks like it.

They fell into a bad nieche between GPUs and PCIe FPGAs. Either of which is rather easy to programm for.

Oof. I’m kind of disappointed. Was hoping something fun would come out of these Phi cards. Obviously not great, otherwise people would use them, but maybe something fun.

Looks like it is hard to even get them up and running, at least on Linux.

Looks like Intel had the same issue with Phi that Sony, Toshiba, and IBM had with Cell BE. It is unfortunate, how ever Cell BE seems to still have some good use cases (even though they are diminishing to niches) compared to Phi.

You’re doing the Lord’s work here sir! I’ve been interested in these Intel coprocessors for years now. I’m really surprised to hear that the software support for them is so unstable.

HPC labs that shelled out the big bucks for these must have had some Intel software engineers on hotline when the scientists were trying to deploy code onto these things. I can picture the conversations now, “We paid you a small fortune for all of these, now get our simulations to f*cking work!”

Oh god, what a time to be alive. This is great.

I never really posted this. So let me add it now.
I actually got somewhat optimized OpenCL code execution working on this thing a while ago.

OpenCL support with Xeon Phi 5110 under windows 10.

First test run you can see it chugging the power. Getting any sort of memory bandwidth out of it with OpenCL kernels is honestly tough. It fights you every step.

Single precision compute numbers

Double precision compute characteristics, you can see it take a shit when running 16wide doubles (1024bit), 8 wide (512bit) it still managed.

Memory transfer operations (not great not terrible)
image

Compute wise this is pretty much maxing out what the card is capable of.
Memory bandwidth wise there’s a lot left on the table.

But if you’re seriously considering using a Xeon Phi for any modern work
FOR GODS SAKE STOP
USE A GPU

9 Likes

Hello there Phi-Natics :stuck_out_tongue:

Not a Phi-natic but still interested in how this all comes together. Have you found modern hardware to out perform this or are there still use cases where this setup would be viable in 2023?

Not viable at all anymore.
Desktop CPU’s outperform this now.
Nvidia Tesla K40’s outperformed this by A LOT almost a decade ago.

It’s just a novelty item now.

2 Likes

I remember reading about Knight’s Corner (or Landing?) many years ago back on the OMPF forum. One of the devs was boasting about “1 billion rays per second” on a test scene rendered with one of these cards.

If it’s not too much trouble, could you try to get a ray/path-tracing benchmark to running? Maybe something like LuxMark.

I’m quite curious to see how they compare to desktop Ryzen or Threadripper CPUs that we have now.

Thanks for demystifying how this hardware works! I’ve enjoyed reading your posts on it.

2 Likes

I include LuxMark results in this post, and it is not favorable to Xeon Phi because it targets SSE2, and so is using only a fraction of Phi’s 512 bit wide vector units. Intel’s OpenCL driver produces even less favorable results.

Mitsuba 2 shows very good results, although it is a pain for artistic use. You also have to be using the Intel compiler, modify some make files, and so on.

Untrue, if used with properly vectorized software. Though examples are admittedly few and far between. Generally if you’re making use of Intel libraries like MKL or Embree, you’ll see good scaling.

2 Likes

That kind of sounds like Itanium all over again.

That requirement is no different from a GPU, or for any HPC application. Spaghetti code is going to turn out poor performance numbers no matter where you run it; consumer x86 processors and applications are less sensitive in this regard.

With minimal changes I was able to make a Knights Landing Phi system blow 64-core Epyc Milan out of the water performance wise using perfectly normal open-source x86 code, while using less power. For this reason, Xeon Phi was used in plenty of HPC and Top500 machines to great effect.

Itanium’s performance was not convincing due to compiler shortcomings, and its technologies vanished once it failed to gain significant market share. In contrast:

  1. AVX 512 is still included in Enterprise-grade processors and offers tangible benefits to software that takes advantage of its extensions.

  2. The Knights Landing core-to-core “mesh” type interconnect formed the basis of Skylake and later Intel processors, supplanting the previous ring-bus interconnect.

  3. The implementation of the MCDRAM on-chip memory caching scheme has been reused for the recently released Xeon Max series of processors.

Because Phi’s performance is convincing, and because technologies pioneered with Phi are still in active use, it is undebatably unlike Itanium in significant ways. Just like a GPU, “specialized hardware” is not synonymous with “bad hardware.”

I don’t do HPC yet, but I cannot agree with that statement. I have seen Itanium in the wild until about 2017. The same arguments were made. Don’t get me wrong, it was a cool experiement and a lot of lessons learned from Itanium went into future technologies. Compilers only benefited.

But thanks for the information. It is all really neat. I am doing a masters’ degree in computer vision, machine learning, and artificial intelligence. It would lile to get into research eventually, so HPC actually interests me.