AMD Next Horizon - Livestream

urmamasllama · November 9, 2018, 7:58pm

somehow I have a feeling that the seperated IO controller may be a necessity. given that they had to manufacture it on a larger process node, they may not be able to get reliable yields with it integrated at all. if that’s the case then we are going to be seeing this general design. something neat about this design is that they get to basically just make one thing on 7nm for their whole catalog. all of the differentiation would be done on the io controller.

cekim · November 9, 2018, 8:09pm

A consumer north-bridge would be a MUCH smaller chiplet than an 8-way cross-bar with 128PCIe lanes and 8 DDR4 channels. So, it might make perfect sense for them to fab at least 2 variations for Ryzen and TR/Epyc. The question is whether even TR gets its own?

It seems unlikely that they’d have 2 dead DDR4 IMC connections on the CCX chiplet. The allure of punting high-current gates to a well proven process (14nm) has to have been pretty high… so an IMC on both processes seems like something everyone would have hated.

cekim · November 9, 2018, 8:11pm

Current and area go together… small gates and external drivers don’t play well, so it ends up being a large gate(s) regardless of process node. To it may not have been a “struggle” or “yield issue” so much as an obvious choice from the get-go preempting what everyone would have known was going to be a problem.

olddellian · November 9, 2018, 8:35pm

By external driver, you mean stuff that’s communicating off-module to PCIe/memory?

Wild Speculation:

With both CPU and GPU supporting Infinity Fabric, I wonder if there is a chance that APUs could have the 14nm chiplet absorb components from both the CPU and GPU side?

Example 3-chiplet module:

7 nm - Zen cores
7 nm Vega cores
14 nm IO/hardware video decoding/etc.

I’m assuming that ASIC-style stuff doesn’t really need 7 nm.

Question is, what would memory look like on something like this? Still DDR4?

Even if we don’t see this on APUs, maybe the Zen+Navi console chips would do this?

cekim · November 9, 2018, 8:39pm

yes - a “driver” as in block of gates that communicates to/from off-chip logic.

anotherriddle · November 9, 2018, 8:54pm

That’s a fair point I have not thought about before. Although this goes counter to the behaviour of past nodes that IO and corresponding circuits tend to tolerate defects more than cores and logic.

That depends on how their cache is layed out but you are probably right, the reuse designs mantra is one of the huge advantages of AMD at the moment.

I agree. I don’t think the io chiplet itself will be a problem but it will be very small and with every cut on the wafer you throw away area because you can’t go all the way to the edge. Also connecting the chiplets on the substrate also costs money.

denstieg · November 9, 2018, 9:03pm

One of the reasons for 14nm I/O chiplet is also that I/O doesn’t scale well, power reductions are just not there while yields are. Lisa and others did stress this on the live stream a number of times

anotherriddle · November 9, 2018, 9:05pm

I am also wondering how long it will take for AMD to build a more or less fully integrated system with Zen cores and several GPU dies and/or FPGAs or other ICs, all on one board for specialized compute applications. When everything else failed AMD always had their semi-custom branch and you could do amazing things with this technology. Although, you would need companies like Google or Amazon to order something like that to make financial sense.

The thing is for Ryzen you don’t need octa-channel memory and huge L3 or L4 cache sizes so the IO is tiny for the consumer chips but quite significant for the server chips. I am not entirely convinced it makes sense to separate cores and IO on the consumer chip.

We don’t really know. Everyone here makes great points and I don’t disagree with you, I just like to think through all the possible angles.

denstieg · November 9, 2018, 9:19pm

I think if you look at tooling prices of 7nm vs 14nm and yields the chance of a complete 7nm die with I/O intergrated are 0 to none, i have no insider info butt logic tels me there wil be 2 i/O chips with defected EPYC blocks for Threadripper and one for ryzen. APU’s problebly 12nm shrinks of the 2000 series that kill it on price.

anotherriddle · November 9, 2018, 9:27pm

yep, that’s the catch.

It’s probably the most likely configuration, I agree. I just like to think through different angles on how you could solve that kind of problem. It’a not always as obvious as it seems at first glance.

denstieg · November 9, 2018, 9:51pm

It costs $271 million to design a 7nm SoC.
It’s also too expensive to make the wrong choice at 10nm and 7nm. Making the wrong bet can be disastrous.

To help foundry customers get ahead of the curve, Semiconductor Engineering has taken a look at the various tradeoffs at 10nm and 7nm.

The tradeoffs
Not long ago, it was a straightforward and inexpensive effort to migrate from one node to another. But the dynamics appeared to change at 16nm/14nm, when foundry vendors introduced finFET transistors for leading-edge designs.

FinFETs solve the short-channel effects that plague leading-edge planar devices. In finFETs, the control of the current is accomplished by implementing a gate on each of the three sides of a fin.

While finFETs keep the industry on the scaling path, the problem is that fewer and fewer foundry customers can afford the technology. In total, the average IC design cost for a 14nm chip is about $80 million, compared to $30 million for a 28nm planar device, according to Gartner. after a ltlle googlefu. And this is just design and tooling is in billions

denstieg · November 10, 2018, 10:53am

Zibob · November 10, 2018, 2:09pm

Depends how they go about it.

I could see that am4 stays more or less where it is now but that would be a bad idea.

They could make smaller IO chips, so instead of handling 8 ccxs and 64 cores it just deals with one ccx. It could be designed differently so instead of shuffling stuff around dies it is optimised for in and out of the CcX as fast as possible on the consumer devices, shooting for Intel’s best for gaming title.

This assumes that the new 7nm ccxs are destined for consumers and not some new version of Zeppelin, or some other way of wiring them. It is assuming that they are “dumb cores” unlike Zeppelin, where the smaller ccxs are reliant on the IO chip, there could be a consumer variant where it is just one chip all 8 cores and the IO on one.

Edit: I wonder also now with the 8 core ccx being so small now will there be the opportunity for a “small” 8c16t CPU with the majority given over to on board GPU.

This is all wild imagination now but the new Vega stuff is getting to 1080 levels and doing it pretty small. I wonder could they get to the point where just nipping at Intel’s heels in performance they go and drop a 1080 or more on the CPU too. Probably too expensive to be useful but provided the motherboards keep up with the requirements that is an entire PC in a mobo, APU, RAM and PSU. Maybe not for consumers but all in ones and other integrated PCs could go wild on that if they get it all good, at least in my imagination.

Uque · November 10, 2018, 7:29pm

AMD probably wouldn’t want to create two separate I/O die designs for mainstream Ryzen and mainstream Ryzen APU, so any design would need to be able to connect to a GPU as well.

I see three possibilities:

I/O die having 2 chiplet connections, one of which would be used for graphics;
splitting the single chiplet connection between core and graphics chiplets for the APUs;
instead of offering the full 24 external PCIe lanes, 8 of those would be used - possibly in IF mode - for the iGPU on APUs.

(1) would open the possibility of 2 chiplet mainstream Ryzen part, with maximum of 16 cores; and depending on cost, including a minimal (mainly 2D) graphics die (something like 2 or 3 CU) on some of the mainstream Ryzen parts (even those with 6 or 8 cores). The integrated GPU could work as primary GPU for desktop use (especially businesses would love this), a backup GPU, for connecting an additional desktop display etc.

Marten · November 10, 2018, 7:46pm

In keeping with form and not castrating Threadripper like the current gen. I bet Threadripper gets an IO chip that uses half memory better and this lonely cores with no direct connect goes away.

Still im navel gazing and fast forward 3 months please.

Uque · November 11, 2018, 1:12am

For Threadripper AMD can use faulty or partially disabled EPYC I/O die, design another I/O die, or a combination of both (depending on SKU and the number of chiplet connections required for each). (EPYCs will always need a full complement of memory and external I/O controllers, so there’ll be dies to use with Threadripper.)

If they go with dual design route, the exact limit as to which SKUs will use each I/O design depends on yields and fault rates, but also on the volumes of sales: If the vast majority of Threadrippers sold will be 16-core or below that, it may be most cost-effective to only have just 2 (or maybe 3) core chiplet connections, 4 memory channels and 64 lanes of PCIe on the special Threadripper I/O die, and leave the higher-end SKUs to use the partially disabled EPYC I/O one.

denstieg · November 12, 2018, 5:05pm

https://www.anandtech.com/show/13578/naples-rome-milan-zen-4-an-interview-with-amd-cto-mark-papermaster

denstieg · November 13, 2018, 10:27pm

Not out yet but the partners are already ordering.

ajc9988 · November 14, 2018, 5:30pm

So, wanted to address some of your comments, but add to the thread generally. First let’s start with the idea of designing a consumer chip versus the EPYC and TR chip. AMD would have a choice, if they removed I/O from the chiplets with the logic cores, they need an I/O chip. The question then is whether it is cheaper to design multiple logic chips with the consumer chip taking up larger size on a wafer with the inclusion of I/O on a new 7nm process with higher defect densities, or just having one logic chiplet design for all product lines and using the 14nm process to design multiple I/O chips on a mature process with low defect densities which can go depending on which product line is being produced. I’m siding with the latter as it is mature and increases 7nm yields while package integration is already in the 99%+ range. It is a much simpler and cheaper solution to design multiple I/O chips for what is needed, considering the costs of designing a chip for 7nm. So, to answer your question on that, they will have one logic chiplet for all lines, and multiple I/O chiplet designs depending on target market segment. The only real question is with the I/O chiplet size, whether or not they will cut down the Epyc version for TR, that way to utilize defective ones with non-critical defects on the TR line and reduce costs that way, or if they will design the chiplet with fewer memory channels and reduced elements in the chiplet for the TR models. I’m hoping for the former. Mainstream will have it’s own chiplet for I/O, which also means when DDR5 comes to Zen 3 Epyc 3, the mainstream AM4 socket will only have a DDR4 IMC on the chiplet for it (also means TR3 may see DDR5).
Source:
https://www.anandtech.com/show/13578/naples-rome-milan-zen-4-an-interview-with-amd-cto-mark-papermaster

Next is to analyze some of Papermaster’s comments (@anon46267848, @olddellian).
“IC: Do the chiplets communicate with each other directly, or is all communication through the IO die?
MP: What we have is an IF link from each CPU chiplet to the IO die.”

This may suggest that all information goes through the I/O die, including core to core communications. It would standardize latencies, but leads to other questions (some of which I will mention in a moment on topology as potentially part of this).

The talk on cache on I/O is an interesting one. In Zen 1 and Zen+, the L3 was used as a victim cache. But, because of it being pooled, if a core needed something in the L3 of another CCX or die, you have the latencies for inter-die and inter-CCX comms. This, plus the Caanard leak of 256MB of L3 cache, and Papermaster mentioning all chiplets are tied to the I/O, and the size of the I/O chiplet, makes me think that the L3 cache was moved to the I/O chip so that it equilizes the latency for access to L3 for all chiplets. Not only this, the memory controllers can feed the L3 directly on the I/O chip somewhat simplifying some of the routing requirements that otherwise would result (same thing could be done with an L4 cache and using L3 for a localized chiplet victim cache and L4 for the general victim type cache). Either way, Papermaster did not give guidance. But, using cache on the I/O also allows for control of the L3 or L4 cache separate from the cores, meaning we could see monster cache sizes for that on the 32 core chips matching the 64 core chips for server products.

This also makes sense considering the size of the chiplets being less than half, almost one/third the size, yet packing more on the 256-bit FP, load/store, increased L1 and L2 cache sizes, new instruction sets, etc., all of which would require space on the logic chiplets. So, moving the L3 off die with the IMCs and I/O would further reduce the size, followed by a slight increase with the new features.

Then we have the placement of the I/O that is interesting and made me think of two videos that AdoredTV published, although the first video and the papers attached is most pertinent.

Basically, the optimal form seen for disintegration was 8 cores, and they explored chiplets on interposers with different topologies and the I/O and memory on package interposer positioned at the edges. What was seen is the central channels naturally wind up with more use, almost causing a traffic jam, for the inter-chiplet comms. They also had to find work arounds to shorten the jumps and repeaters to get to memory channels on the other side of the package. But, since the memory is not on package, and even though not using an interposer, why not centrally place the I/O and memory channels? If the central channel gets bogged down from trying to get to the other side and comms between chiplets, then turn into the skid. Setup for the MCM to have the IF2 controller there for more efficient routing and central memory access so it all is fed through there on purpose. This includes for a large victim cache, potentially, to keep the cores fed and try to overcome not putting more memory channels on the I/O, because if you have a large enough cache to keep the cores fed, and enough bandwidth on memory to keep the larger cache primed, you can counteract the potential for starvation by not increasing the memory controllers.

Sure, it is a bit more complicated than this, but just some food for thought.

anotherriddle · November 14, 2018, 5:36pm

Awesome!

wait a sec, let me get some coffee, then I’ll start reading