AMD Next Horizon - Livestream

cekim · November 7, 2018, 4:07pm

Tensor cores are actually a matrix FMA at a minimum… not just int4,8 and they actually do int16 operations with a matrix int32 accumulator.

It was sarcasm. It may or may not be less so in the future.

Zibob · November 7, 2018, 4:10pm

Slightly yes. Every time they have shown something “new”, and everyone fell for it like always, there has been competing or more capable versions of it all ready existing.

Just using it to.point out AMD DO have ray tracing. Though like nvidias, it is not for game use so far.

https://pro.radeon.com/en/software/radeon-rays/

urmamasllama · November 7, 2018, 4:13pm

So they mentioned using a form of IF for a gpu interconnect. I’m really excited that this is actually happening but I haven’t heard many details. Is this just a faster crossfire or does this allow multiple gpus to functionally appear as one compute device to the degree that a game wouldn’t have to do anything special to utilize it?

cekim · November 7, 2018, 4:14pm

AMD has had very limited traction in getting their actually very good, usually much more forward looking and open architectural ideas adopted. Nvidia has a big and effective hammer in the form of the Cuda tool/eco-system pipe-line that starts in universities and continues on through game and HPC development.

So, the “best” solution has been the one that’s most accessible to the largest number of people which includes familiarity with language/API. Not necessarily the most “forward looking and open” architecture that I think many people would prefer in principle.

cekim · November 7, 2018, 4:23pm

It certainly helps the case for the “high-speed-cache” architecture AMD is trying to push for VRAM. That is to say they are trying to re-brand/design the on-chip/board memory as a cache of off-card bulk-storage vs it just being “the memory pool” (period - no other memory is accessible to the GPU kernel [CL/GL] code).

Having these high-speed links is generally more important for communicating between multple cards/chips than it is to bulk storage, but it does help the case for bulk-storage + “high speed cache”.

At present even optane doesn’t provide latencies adequately low enough to not need a very large “cache” (VRAM). It does lower the penalty of moving to/from bulk storage which is often profound/dominates performance in HPC and “artsy” use cases.

RevampedTech · November 7, 2018, 7:15pm

So since AMD has ray tracing we should just buy it?

olddellian · November 7, 2018, 7:45pm

Something sort of like this is already happening on IBM POWER chips; for FPGAs and GPUs you connect to the CPU directly with OpenCAPI or NVLink over a separate physical hardware interface (BlueLink/PowerAXON).

Definitely, but my guess is that it won’t really touch the consumer market much. Enterprise, however, may need to deal with the mess of, NVLink vs Infinity Fabric for GPUs, and OpenCAPI vs CCIX vs Gen-Z for other peripherals like FPGAs.

It would be even better for AMD than IBM+Nvidia, since the CPU already natively “speaks” Infinity Fabric; on POWER chips, there is a translation layer, IIRC.

Maybe something they carry over from the IBM foundry acquisition? I thought I heard about IBM having really good cache tech.

denstieg · November 7, 2018, 8:04pm

As far as i know they use there GPU library witch is much denser.

FurryJackman · November 7, 2018, 8:06pm

Hah. Radeon Instinct is only going to have official Linux support according to the AMD spec page:

https://www.amd.com/en/products/professional-graphics/instinct-mi60

Now if only it had SR-IOV.

olddellian · November 7, 2018, 8:13pm

To me, it sounds exactly like NVLink; my understanding of the LTT video about NVLink and gaming and the Gamer’s Nexus benchmark video is that games can’t really use NVLink, but rather run SLI over the NVLink interface.

I assume the same will be true for Infinity Fabric, as it was for NVLink:
Games will not implement InfinityFabric/NVLink support, but rather use the Crossfire/SLI protocol over the InfinityFabric/NVLink physical bridge.

At best, I bet you will get a higher-bandwidth version of Crossfire. And as the GN video mentions, some games like Ashes of the Singularity will support multi-GPU, but not via direct GPU interconnection.

It probably has to do with scene rasterization being less parallel-izable than datacenter compute. Maybe if raytracing takes off that might change?

urmamasllama · November 7, 2018, 8:26pm

That’s what I expect as well. I saw the video on NVLINK. for compute tasks it can allow the gpus to function as one device effectively sharing resources instead of having to duplicate a lot of work. what I’m really hoping for is that eventually they’ll be able to utilize IF on a single board to bring back the likes of the 295x2 but in a way that isn’t a nightmare for developers to support.

Jimster480 · November 8, 2018, 6:47am

The one thing that I think alot of people are not thinking about in regards to the desktop socket is that there are only 2 memory channels…
As we have seen with threadripper and 32 cores, there is not enough memory bandwidth for 32 cores while using only 4 memory channels.
Therefore if there were 16 cores and only 2 memory channels, this would effectively run into the same per-core bandwidth bottleneck that exists with the 32c Threadripper.

urmamasllama · November 8, 2018, 3:17pm

I had worked that out too. I rather think that this chiplet design is going to be used to give us 8c16t apus toward the end of next year and beefy ones at that because they should be able to fit a vega 20 that way. AM4+ will likely be when we see a 16 core consumer chip if even then.

2bitmarksman · November 8, 2018, 6:58pm

AMD is sitting real pretty for it’s CPUs to take a huge chunk of market share. Looking forward to how this effects the consumer market in late 2019/2020 an IO Die that size could mean some big things to come

Marten · November 9, 2018, 12:46am

The EPYC IO chip will be very interesting to see spec wise. Will it be large L4 cache and with the next gen infinity fabric feed all the cores well.

Are the chiplets 1ccx now or still 2.

I could see AMD sticking to 8 cores on AM4 but 16 would be insane Threadripper next year could be nice with a IO die.

MazeFrame · November 9, 2018, 8:08am

8 chiplets and 64 cores total, so 8cpc (cores per chiplet).

Yup, I want to know more about it.

anotherriddle · November 9, 2018, 9:12am

We can only hope! I would love more than 8 cores on AM4 but I think it is more realistic that they will stay at 8 cores on the consumer boards.

Some people are saying that now there are 8 cores per CCX (which means all cores in one chiplet on one compute complex). Do we have confirmation on that?

denstieg · November 9, 2018, 12:41pm

"I had a drink with Scott Aylor of AMD last evening along with Ian Cutress, Paul Alcorn, and Charlie Demerjian.

There are going to be things coming out between now and Rome. Sit tight."

From Servethehome, intresting.

cekim · November 9, 2018, 7:30pm

The 64 core epyc has 8 chiplets with 8 cores each…

It’s possible that each chiplet is now a single ccx but given that a single chiplet was 2 ccx’s, it seems like the theory of 8 per ccx’s came from an assumption that the 64core epyc would be 4 dies again… it is not.

This was clearly wrong, so that theory should be called into question as well…

There is also the issue that the chiplet evidently relies on the 14nm “north bridge” (my words) so the consumer part would either have to be different or rely on a different north bridge chiplet… I think it is clear the epyc north bridge is monstrous for a consumer chip… it is physically approaching the size of intels 10-18 core parts… so that’s not gonna happen in AM4…

They are going to need to fab a consumer north bridge and/or a different chiplet. Their strategy has been to share the chiplet design, so my guess would be smaller north bridge with a single chiplet for AM4 but I also guessing…

anotherriddle · November 9, 2018, 7:41pm

Yeah probably. I was not really sure where this was coming from, therefore my question.

Exactly. I am very curious about how they have designed the consumer chip.

However, I am not convinced that it makes sense to go with the same separated IO-die approach like on the server chip. Conncting those dies is not free, but it (obviously) scales favourably with many dies. It might be cheaper to just make a one die consumer chip. We’ll see.