> this would be the first time that a high core count CCD will have the ability to support a V-Cache die. If AMD sticks to the same ratio of base die cache to V-Cache die cache, then each 32 core CCD would have up to 384MB of L3 cache which equates to 3 Gigabytes of L3 cache across the chip.
Core Complex Die - an AMD term for a chiplet that contains the CPU cores and cache. It connects to an IOD (I/O die) that does memory, PCIe etc (≈southbridge?).
To further expand on this, "southbridge" is what we now call a chipset expander (or 50 other company or product line specific names).
Its a switch that has a bunch of unified PHYs that can do many different tasks (non-primary PCI-E lanes, SATA ports, USB ports, etc), leveraging shared hardware to reduce silicon footprint while increasing utility, and connects to PCI-E lanes on the CPU.
The "northbridge" in modern Zen systems is the IO die, and in Zen 1/+, its the tiny fractional IO die that was colocated on each chip (which means a Zen 1/+ Epyc had the equivalent of 4 tiny northbridges).
However, they just embed the equivalent design of the chipsets into the IO Die SoC on Epycs.
Fun fact: For desktop, since Zen 1 (and AM4-compatible non-Zen CPUs) they included a micro-southbridge into the IO die. It gave you 2 SATA ports and 4 USB ports, usually the only "good" ones on the board. On Epyc, they just put the full sized one here instead of pairing it with an external one.
This also means, for example, if you have 4 USB3 10gbit ports, and its not handled by a third party add-on chip? Those are wired directly into the CPU, and aren't competing for the x4 that feeds the southbridge.
Also fun fact: The X, B, and A chips are all sibling designs, under the name of Promontory, made jointly with ASMedia. They're essentially all identical, only updated for PCI-E and USB versions as time went on, as well as adding more ports and shrinking die size.
The exception is the X570, its an AMD-produced variant of the Promontory that also contains the Zen 2/3 IO Die, as they're actually the same chip in this case. The chips that failed to become IO Dies had all their Promontory features enabled instead, and became chipset chips. The Zen 2/3 Epycs shipped their IO die, at least partly, as two X570s welded together, with even more DDR PHY thrown in, as some sort of cost saving.
I don't think that panned out, because the X/B/A 600 and 800 variants (Zen 4 and 5) went back to being straight Promontory again.
256 Zen 6c Core. I cant wait for cloud vendors to get their hands on it. In a Dual Socket config that is 512 Core and 1024 vCPU per server node. We could get two node in a server, That is 1024 Core with 2048 threads.
Even the slowest of All programming languages or framework with 1 request per second per vCPU, that is 2K Request per second.
With 2 IO dies aren't there effectively 2 meta NUMA nodes with 4 leaf nodes each? Or am I off base there?
The above doesn't even consider the possibility of multi-CPU systems. I suspect the existing programming models are quickly going to become insufficient for modeling these systems.
I also find myself wondering how atomic instruction performance will fare on these. GPU ISA and memory model on CPU when?
There should be plenty of existing programming models that can be reused because HPC used single-image multi-hop NUMA systems a lot before the Beowulf clusters took over.
Even today, I think very large enterprise systems (where a single kernel runs on a single system that spans multiple racks) are built like this, too.
If you query the NUMA layout tree, you have two sibling hw threads per core, then a cluster of 8 or 12 actual cores per die (up to 4 or 8 dies per socket), then the individual sockets (up to 2 sockets per machine).
Before 8 cores per die (introduced in Zen 3, and retained in 4, 5 and 6), the Zen 1/+ and 2 series this would have been two sets of four cores instead of one set of eight (and a split L3 instead of a unified one). I can't remember if the split-CCX had its own NUMA layer in the tree or not, or if they were just iterated in pairs.
What I find myself wondering about is the performance impacts of cross-thread communication in this scenario. With the nested domains it seems like there should be different (and increasingly severe) performance implications for crossing each distinct boundary. Whereas the languages we write in and the programming models we employ don't seem particularly well suited to expressing how we want our code to adapt to such constraints at present, at least not in a generalized manner.
I realize that HPC code can be customized to the specific device it will be run on but more widely deployed software is going to want to abstract these increasingly complex relationships.
Its why, if you want high performance code in this sort of work, you'll want either C or C-like code. For example, learn how madvise() is used. Also, learn how thread local storage works in the context of implementing it on a hierarchical SMP system. Also, learn how to make a message passing system and what "atomic" means (locks are often not your friend here).
Ironically, a lot of people keep shooting themselves in the foot and blindly using MPI or OpenMP or any of the other popular industry supported frameworks, and then thinking that magically bails them out. It doesn't.
The most important thing you need, above all others: make sure the problem you're solving can be parallelized, and CPUs are the right way of doing it. Once you've answered this question, and the answer is yes, you can just write it pretty much normally.
Also, ironically, you can write Java that isn't shit and takes advantage of systems like these. Sun and the post-Sun community put a lot of work into the Hotspot JVM to make it scale alarmingly well on high core count machines. Java used correctly is performant.
Chips and cheese did some measurements for previous AMD generations, they have a pretty core to core latency measurement a little bit after halfway down the page.
Each SM cluster contains 4 independent 32-wide compute units, and GB202 has 192 SMs, although only 188 of them are enabled on the largest shipping SKU. IMO that makes for 752 "cores", but depending on where you draw the line it could be 188, 752, or 24064.
sms is the Nvidia definition of processor, and cuda device properties returns it, not anything else. If you want a marketing number, use cuda cores, it doesn't consistently match to anything in the hardware design.
NVidia's use of "cores" is simply wrong. unless you think a core is a simple scalar ALU. but cores haven't been like that for decades.
or would you like to count cores in a current AMD or Intel CPU? each "core" has half a dozen ALUs/FP pipes, and don't forget to multiply by SIMD width.
Does it actually scale well to that many cores? If so, that's quite impressive; most video game simulations of that kind benefits more from few fast cores since parallelizing simulations well is difficult
these big high-core systems do scale, really well, on the workloads they're intended for. not games, desktops, web/db servers, lightweight stuff like that. but scientific, engineering - simulations and the like, they fly! enough that the HPC world still tends to use dual-socket servers. maybe less so for AI, where at least in the past, you'd only need a few cores per hefty GPU - possibly K/V stuff is giving CPUs more to do...
> not ... web/db servers, lightweight stuff like that.
They scale very well for web and db servers as well. You just put lots of containers/VMs on a single server.
AMD EPYC has a separate architecture specifically for such workloads. It's a bit weaker, runs at lower frequency and power and takes less silicon area. This way AMD can put more such cores on a single CPU (192 vs 128 for Zen 5c vs 5). So it's the other way round - web servers love high core count CPUs.
not really - you can certainly put lots of lightweight services on it, but they don't scale. because each core doesn't really get that much cache or memory bandwidth. it's not bad, just not better.
Not true. You should look up Sienna chips and something like ASUS S14NA-U12. It has six DDR5-4800 channels, two physical PCIe 5.0 ports, two M.2 ports, and six MCIO x8 ports. All lanes are full-bandwidth. The 8434PN CPU gets you 48 physical cores in a 150W envelope. Zen 4c really is magic, and LOTS of bandwidth to play with.
> not games, desktops, web/db servers, lightweight stuff like that.
Things like games, desktops, browsers, and such were designed for computers with a handful of cores, but the core count will only go up on these devices - a very pedestrian desktop these days has more than 8 cores.
If you want to make software that’ll run well enough 10 years from now, you’d better start using computers from 10 years from now. A 256 core chip might be just that.
why do you think lightweight uses will ever scale to lots of cores?
the standard consumer computer of today has only a few cores that race-to-sleep, because there simply isn't that much to do. where do you imagine the parallel work will come from? even for games, will work shift off the GPU onto the host processor? seems unlikely.
future-proofing isn't about inflating your use of threads, but being smart about memory and IO. those have been the bottleneck for decades now.
> why do you think lightweight uses will ever scale to lots of cores?
Because the cores will be there, regardless. At some point, machines will be able to do a lot of background activity, learn about what we are doing, so that local agentic models can act as better intelligent assistants. I don't know what will be the killer app for the kilocore desktop - nobody knows that, but when PARC made a workstation with bit-mapped graphics out of a semi custom built minicomputer that could easily serve a department of text terminals we got things like GUIs, WYSYWIG, Smalltalk, and a lot of other fancy things nobody imagined back then.
You can try to invent the future using current tech, or you might just try to see what's possible with tomorrow's tools and observe it first hand.
Someone needs to try running Crysis on that bad boy using the D3D WARP software rasterizer. No GPU, just an army of CPU cores trying their best. For science.
I wonder what Ampere (mentioned in that article) is going to do. At this rate they’ll need to release a 1000 cpu chip just to be noticeably “different.”
At some point won't the bandwidth requirements exceed the number of pins you can fit within the available package area? Presumably you'll end up back at a low maximum memory high bandwidth GPU design.
I wonder how many of these you could cram into 1U? And what the maximum next gen kW/U figure looks like.
EDIT: actually, now that I think about it some more, my characterization of Zen-C cores as the same "idea" as Intel E-cores was pretty unfair too; they do serve the same market idea but the implementation is so much less silly that it's a bit daft to compare them. Intel E-Cores have different IPC, different tuning characteristics, and different feature support (ie, they are usually a different uarch) which makes them really annoying to deal with. Zen C cores are usually the same cores with less cache and sometimes fewer or narrower ports depending on the specific configuration.
I was about to reply with an "well, actually..." comment and then I saw that you beat me to it with your edit.
Fully agreed, they may be targetting a similar goal, but the execution is so different, and a Intel screwed up the idea so bad that it can really mislead people into assuming that dense Zen cores are the same junk as a Intel E-cores.
I may be wrong, as I'm not an expert but Intel E-Cores are basically decendents of Intel ATOM (I personally really liked the idea of ATOM it was just so nerfed by Intel with its memory limits and platforms) and P cores are derived from the i-Series - two totally different cores. Yes they support the same instruction sets generally, but they are different cores.
AMD's approach was to basically trim the fat on Zen as far as they possibly could but keep the core fast and efficient and you end up with the C-Cores.
In practice (where I generally live with expertise) AMD's approach lets you move software between cores and it generally doesn't care or know, whereas Intel's method applications definetly do care and can have issues.
To me, the difference is moving a virtual machine between two similar CPUs and not having to reboot it (AMDs approach) and having to reboot for one reason or another (Compatibility) -- Intels method.
Yes, you're right and that's what I discuss in the last paragraph of my comment. And, yes, E-cores are rough descendants of some processors that were sometimes called Atom. Using Intel marketing names is fraught with peril, though (see, Celeron) as "Atom" has referred to many conceptually different microarchitectures over time; the modern E-Core is of no relation to the original in-order "Atom" processor many remember.
I've had the same experiences as you with Intel mixed-core desktop parts. They're incredibly difficult to optimize for due to the heterogeneous core mixture, whereas AMD mobile parts are generally more reasonable (you're on a slow core or a fast core, basically), and AMD never made a mixed-core desktop part.
However, Intel server parts several years ago switched to E-core only or P-core only, so all of the heterogeneous core mixture issues aren't a thing - you basically have two separate processor generations being sold at once, which isn't particularly surprising or uncommon.
With AMD server processor families (linked in my comment), depending on the part's density you get either "slow" or "fast" cores and either "wide" or "narrow" units, so you do still have to think about things a little bit there too.
Where Intel really screwed up in general, microarchitecture differences aside, is AVX512. That's the wrench that prevents the same compiled code from running across most Intel parts - they just couldn't decide what they wanted to do with it, whereas AMD just chose to support it and stick with it, even though the throughput for the wide instructions is wildly different between processors.
I've never understood why Intel didn't just soft-disable AVX512 on P-cores until the OS writes a value to some MSR that means "I understand that only some cores have AVX512".
From the OS side, the change to support it is pretty simple. On the first #UD trap caused by an AVX512 instruction being missing, pin the process to just the P-cores and end the process's timeslice.
Intel E-cores are basically a different microarchitecture. They often support different instruction sets than their P-cores, have different "instructions-per-clock" rates (IPC), and all sorts of other major differences. They're just very different things, and those differences are responsible for most of the bad reputation that E-cores have.
AMD's dense-cores are the same microarchitecture, have the same IPC, use all the same instruction sets. The only real difference between them and regular AMD cores is that their dense cores have less cache, and lower peak clocks.
There's nothing wrong with E-cores though, their bad reputation is quite undeserved. They pack a lot of compute in tiny area and power constraints compared to P-cores. They're probably not the optimal choice for a single-thread workload, but that's an entirely different matter.
Their bad reputation is fully deserved, it's just also out of date. Reputations are almost always about first impressions, and the first impression with E-cores was bad. They've done a lot to fix the situation though, and they do indeed run pretty well nowadays if you have a more modern Intel CPU.
That said, manually disabling AVX-512 on P-cores just so I can have E-cores is still a *bad* tradeoff as far as I'm concerned, but I get that my use-cases aren't everyone's use-cases.
The P and E cores support different instructions and Intel "fixed" it by disabling instructions on the P-cores. So now they have the same instructions but at the cost of a bunch of wasted silicon.
The Intel server CPUs with P-cores support AVX-512, like all current AMD CPUs, and they also support a few extensions not currently supported by AMD, like AMX (Zen 6 will add FP16 arithmetic support in AVX-512, reducing the differences vs. Intel P-core servers).
The Intel server CPUs with E-cores, both the current Sierra Forest CPUs with Crestmont cores and the future Clearwater Forest CPUs with Darkmont cores do not support AVX-512 and they are almost identical with the E-cores from Intel laptop/desktop CPUs.
Therefore, for demanding applications you cannot run the same programs on Intel servers with P-cores or E-cores, unless they use dynamical dispatch to select at run-time between AVX and AVX-512 libraries, as the gain from AVX-512 can be very substantial and on server applications not using it would lose money by lowering the throughput.
The Intel Darkmont cores of Panther Lake and Clearwater Forest are almost identical with the Skymont cores of Arrow Lake and Lunar Lake (the main difference is that the Skymont cores are made by TSMC, while the Darkmont cores are made by Intel in their new 18A CMOS process) and they are extremely similar in die size and in performance with the ARM Neoverse V3 cores from the newly launched AWS Graviton5 (which are known as Cortex-X4 in their smartphone variant).
Intel has said that they will eliminate this ISA difference between E-cores and P-cores, but a couple of years might pass until this will reach their server CPUs.
While the power draw might be high in absolute terms, the surface area is also quite large. For example, the article's estimates add up to just 2000mm2 for the Epyc chip. For reference, a Ryzen 9950X (AMD's hottest desktop CPU) has a surface area of about 262mm2, and a PPT (maximum power draw) of ~230W. This means that the max heat flux at the chip interface will almost certainly be lower on the Epyc chip than on the Ryzen - I don't think we're going to be getting 1000W+ PPT/TDP chips.
From that you can infer that there shouldn't be the need for liquid cooling in terms of getting the heat off the chip.
There still are overall system power dissipation problems, which might lead you to want to use liquid cooling, but not necessarily.
You can move a lot of air with good efficiency even just by using bigger fans that don't need to spin as fast most of the time. Water cooling is a good default for power-dense workloads, but far from an absolute necessity in every case.
Air almost certainly. They always develop these chips within a thermal envelop. The envelop should be within what air cooling can do.
PS. Having many cores doesn’t mean a lot more power. Multi core performance can be made very efficient by having many cores running at lower clock rate.
Perhaps the most comparable 1990s system would be the SGI Origin 2800 (https://en.wikipedia.org/wiki/SGI_Origin_2000) with 128 processors in a single shared-memory multiprocessing system. The full system took up nine racks. The successor SGI Origin 3800 was available with up to 512 processors in 2002.
Each core is multiples faster than a 90's CPU for various reasons as well. I think if you look at an entire rack it's easily a multiple of a 90's datacenter.
> this would be the first time that a high core count CCD will have the ability to support a V-Cache die. If AMD sticks to the same ratio of base die cache to V-Cache die cache, then each 32 core CCD would have up to 384MB of L3 cache which equates to 3 Gigabytes of L3 cache across the chip.
Good lord!
> CCD
Core Complex Die - an AMD term for a chiplet that contains the CPU cores and cache. It connects to an IOD (I/O die) that does memory, PCIe etc (≈southbridge?).
Aside: CCX is Core Complex - see Figure 1 of https://www.amd.com/content/dam/amd/en/documents/products/ep...
For any other older fogeys that CCD means something different.
> memory, PCIe etc (≈southbridge?)
northbridge
To further expand on this, "southbridge" is what we now call a chipset expander (or 50 other company or product line specific names).
Its a switch that has a bunch of unified PHYs that can do many different tasks (non-primary PCI-E lanes, SATA ports, USB ports, etc), leveraging shared hardware to reduce silicon footprint while increasing utility, and connects to PCI-E lanes on the CPU.
Don’t EPYC CPUs avoid using a chipset altogether? I think in that case, it would be NB+SB.
Yes.
The "northbridge" in modern Zen systems is the IO die, and in Zen 1/+, its the tiny fractional IO die that was colocated on each chip (which means a Zen 1/+ Epyc had the equivalent of 4 tiny northbridges).
However, they just embed the equivalent design of the chipsets into the IO Die SoC on Epycs.
Fun fact: For desktop, since Zen 1 (and AM4-compatible non-Zen CPUs) they included a micro-southbridge into the IO die. It gave you 2 SATA ports and 4 USB ports, usually the only "good" ones on the board. On Epyc, they just put the full sized one here instead of pairing it with an external one.
This also means, for example, if you have 4 USB3 10gbit ports, and its not handled by a third party add-on chip? Those are wired directly into the CPU, and aren't competing for the x4 that feeds the southbridge.
Also fun fact: The X, B, and A chips are all sibling designs, under the name of Promontory, made jointly with ASMedia. They're essentially all identical, only updated for PCI-E and USB versions as time went on, as well as adding more ports and shrinking die size.
The exception is the X570, its an AMD-produced variant of the Promontory that also contains the Zen 2/3 IO Die, as they're actually the same chip in this case. The chips that failed to become IO Dies had all their Promontory features enabled instead, and became chipset chips. The Zen 2/3 Epycs shipped their IO die, at least partly, as two X570s welded together, with even more DDR PHY thrown in, as some sort of cost saving.
I don't think that panned out, because the X/B/A 600 and 800 variants (Zen 4 and 5) went back to being straight Promontory again.
Wikipedia has some good charts for this: https://en.wikipedia.org/wiki/List_of_AMD_chipsets
256 Zen 6c Core. I cant wait for cloud vendors to get their hands on it. In a Dual Socket config that is 512 Core and 1024 vCPU per server node. We could get two node in a server, That is 1024 Core with 2048 threads.
Even the slowest of All programming languages or framework with 1 request per second per vCPU, that is 2K Request per second.
Pure brute force hardware scaling.
256 cores on a die. Stunning.
32 cores on a die, 256 on a package. Still stunning though
How do people use these things? Map MPI ranks to dies, instead of compute nodes?
Yeah, there's an option to configure one NUMA node per CCD that can speed up some apps.
MPI is fine, but have you heard of threads?
Sure, the conventional way of doing things is OpenMP on a node and MPI across nodes, but
* It just seems like a lot of threads to wrangle without some hierarchy. Nested OpenMP is also possible…
* I’m wondering if explicit communication is better from one die to another in this sort of system.
With 2 IO dies aren't there effectively 2 meta NUMA nodes with 4 leaf nodes each? Or am I off base there?
The above doesn't even consider the possibility of multi-CPU systems. I suspect the existing programming models are quickly going to become insufficient for modeling these systems.
I also find myself wondering how atomic instruction performance will fare on these. GPU ISA and memory model on CPU when?
There should be plenty of existing programming models that can be reused because HPC used single-image multi-hop NUMA systems a lot before the Beowulf clusters took over.
Even today, I think very large enterprise systems (where a single kernel runs on a single system that spans multiple racks) are built like this, too.
If you query the NUMA layout tree, you have two sibling hw threads per core, then a cluster of 8 or 12 actual cores per die (up to 4 or 8 dies per socket), then the individual sockets (up to 2 sockets per machine).
Before 8 cores per die (introduced in Zen 3, and retained in 4, 5 and 6), the Zen 1/+ and 2 series this would have been two sets of four cores instead of one set of eight (and a split L3 instead of a unified one). I can't remember if the split-CCX had its own NUMA layer in the tree or not, or if they were just iterated in pairs.
What I find myself wondering about is the performance impacts of cross-thread communication in this scenario. With the nested domains it seems like there should be different (and increasingly severe) performance implications for crossing each distinct boundary. Whereas the languages we write in and the programming models we employ don't seem particularly well suited to expressing how we want our code to adapt to such constraints at present, at least not in a generalized manner.
I realize that HPC code can be customized to the specific device it will be run on but more widely deployed software is going to want to abstract these increasingly complex relationships.
Its why, if you want high performance code in this sort of work, you'll want either C or C-like code. For example, learn how madvise() is used. Also, learn how thread local storage works in the context of implementing it on a hierarchical SMP system. Also, learn how to make a message passing system and what "atomic" means (locks are often not your friend here).
Ironically, a lot of people keep shooting themselves in the foot and blindly using MPI or OpenMP or any of the other popular industry supported frameworks, and then thinking that magically bails them out. It doesn't.
The most important thing you need, above all others: make sure the problem you're solving can be parallelized, and CPUs are the right way of doing it. Once you've answered this question, and the answer is yes, you can just write it pretty much normally.
Also, ironically, you can write Java that isn't shit and takes advantage of systems like these. Sun and the post-Sun community put a lot of work into the Hotspot JVM to make it scale alarmingly well on high core count machines. Java used correctly is performant.
Chips and cheese did some measurements for previous AMD generations, they have a pretty core to core latency measurement a little bit after halfway down the page.
https://chipsandcheese.com/p/genoa-x-server-v-cache-round-2
640 cores should be enough for anyone
Tell that to Nvidia, Blackwell is already up to 752 cores (each with 32-lane SIMD).
640K cores should be enough for everyone.
b200 is 148 sms, so no
Each SM cluster contains 4 independent 32-wide compute units, and GB202 has 192 SMs, although only 188 of them are enabled on the largest shipping SKU. IMO that makes for 752 "cores", but depending on where you draw the line it could be 188, 752, or 24064.
sms is the Nvidia definition of processor, and cuda device properties returns it, not anything else. If you want a marketing number, use cuda cores, it doesn't consistently match to anything in the hardware design.
no, you really can't.
NVidia's use of "cores" is simply wrong. unless you think a core is a simple scalar ALU. but cores haven't been like that for decades.
or would you like to count cores in a current AMD or Intel CPU? each "core" has half a dozen ALUs/FP pipes, and don't forget to multiply by SIMD width.
That's going to run Cities Skylines 2 ~~really really well~~ as well as it can be run.
Does it actually scale well to that many cores? If so, that's quite impressive; most video game simulations of that kind benefits more from few fast cores since parallelizing simulations well is difficult
No, see https://m.youtube.com/watch?v=44KP0vp2Wvg . You're right it didn't scale that well
Looks like it may be capped at 32 cores in that video, if they are hitting 25%-30% of a 96 core CPU?
Here's analysis of a prior LTT video showing 1/3 of cores at 100%, 1/3 of cores at 50%, and 1/3 idle cores:
https://www.youtube.com/watch?v=XqSCRZJl7S0
In any case, CS2 can take advantage of far more cores than most games.
these big high-core systems do scale, really well, on the workloads they're intended for. not games, desktops, web/db servers, lightweight stuff like that. but scientific, engineering - simulations and the like, they fly! enough that the HPC world still tends to use dual-socket servers. maybe less so for AI, where at least in the past, you'd only need a few cores per hefty GPU - possibly K/V stuff is giving CPUs more to do...
> not ... web/db servers, lightweight stuff like that.
They scale very well for web and db servers as well. You just put lots of containers/VMs on a single server.
AMD EPYC has a separate architecture specifically for such workloads. It's a bit weaker, runs at lower frequency and power and takes less silicon area. This way AMD can put more such cores on a single CPU (192 vs 128 for Zen 5c vs 5). So it's the other way round - web servers love high core count CPUs.
not really - you can certainly put lots of lightweight services on it, but they don't scale. because each core doesn't really get that much cache or memory bandwidth. it's not bad, just not better.
Not true. You should look up Sienna chips and something like ASUS S14NA-U12. It has six DDR5-4800 channels, two physical PCIe 5.0 ports, two M.2 ports, and six MCIO x8 ports. All lanes are full-bandwidth. The 8434PN CPU gets you 48 physical cores in a 150W envelope. Zen 4c really is magic, and LOTS of bandwidth to play with.
> not games, desktops, web/db servers, lightweight stuff like that.
Things like games, desktops, browsers, and such were designed for computers with a handful of cores, but the core count will only go up on these devices - a very pedestrian desktop these days has more than 8 cores.
If you want to make software that’ll run well enough 10 years from now, you’d better start using computers from 10 years from now. A 256 core chip might be just that.
why do you think lightweight uses will ever scale to lots of cores?
the standard consumer computer of today has only a few cores that race-to-sleep, because there simply isn't that much to do. where do you imagine the parallel work will come from? even for games, will work shift off the GPU onto the host processor? seems unlikely.
future-proofing isn't about inflating your use of threads, but being smart about memory and IO. those have been the bottleneck for decades now.
> why do you think lightweight uses will ever scale to lots of cores?
Because the cores will be there, regardless. At some point, machines will be able to do a lot of background activity, learn about what we are doing, so that local agentic models can act as better intelligent assistants. I don't know what will be the killer app for the kilocore desktop - nobody knows that, but when PARC made a workstation with bit-mapped graphics out of a semi custom built minicomputer that could easily serve a department of text terminals we got things like GUIs, WYSYWIG, Smalltalk, and a lot of other fancy things nobody imagined back then.
You can try to invent the future using current tech, or you might just try to see what's possible with tomorrow's tools and observe it first hand.
Nope, see https://m.youtube.com/watch?v=44KP0vp2Wvg . Just didn't scale enough
I’m gonna get one of these and I’m just gonna play DOOM on it.
Intel's Clearwater Forest could be shipping even sooner, 288 cores. https://chipsandcheese.com/p/intels-clearwater-forest-e-core...
It's a smaller denser core but still incredibly incredibly promising and so so neat.
Someone needs to try running Crysis on that bad boy using the D3D WARP software rasterizer. No GPU, just an army of CPU cores trying their best. For science.
This has already been tried :)
iirc, in the 2016 a quadcore intel cpu ran the original crysis at ~15fps
I wonder what Ampere (mentioned in that article) is going to do. At this rate they’ll need to release a 1000 cpu chip just to be noticeably “different.”
At some point won't the bandwidth requirements exceed the number of pins you can fit within the available package area? Presumably you'll end up back at a low maximum memory high bandwidth GPU design.
I wonder how many of these you could cram into 1U? And what the maximum next gen kW/U figure looks like.
Unfortunately Ampere has fallen pretty far behind AMD. I don't see much point to their recent CPUs.
"E-cores" are not the same
The 32 core / die AMD products are almost certainly Zen 6c, which is the same "idea" as Intel E-Cores albeit way less crappy.
https://www.techpowerup.com/forums/threads/amd-zen-6-epyc-ve...
EDIT: actually, now that I think about it some more, my characterization of Zen-C cores as the same "idea" as Intel E-cores was pretty unfair too; they do serve the same market idea but the implementation is so much less silly that it's a bit daft to compare them. Intel E-Cores have different IPC, different tuning characteristics, and different feature support (ie, they are usually a different uarch) which makes them really annoying to deal with. Zen C cores are usually the same cores with less cache and sometimes fewer or narrower ports depending on the specific configuration.
I was about to reply with an "well, actually..." comment and then I saw that you beat me to it with your edit.
Fully agreed, they may be targetting a similar goal, but the execution is so different, and a Intel screwed up the idea so bad that it can really mislead people into assuming that dense Zen cores are the same junk as a Intel E-cores.
I may be wrong, as I'm not an expert but Intel E-Cores are basically decendents of Intel ATOM (I personally really liked the idea of ATOM it was just so nerfed by Intel with its memory limits and platforms) and P cores are derived from the i-Series - two totally different cores. Yes they support the same instruction sets generally, but they are different cores.
AMD's approach was to basically trim the fat on Zen as far as they possibly could but keep the core fast and efficient and you end up with the C-Cores.
In practice (where I generally live with expertise) AMD's approach lets you move software between cores and it generally doesn't care or know, whereas Intel's method applications definetly do care and can have issues.
To me, the difference is moving a virtual machine between two similar CPUs and not having to reboot it (AMDs approach) and having to reboot for one reason or another (Compatibility) -- Intels method.
Yes, you're right and that's what I discuss in the last paragraph of my comment. And, yes, E-cores are rough descendants of some processors that were sometimes called Atom. Using Intel marketing names is fraught with peril, though (see, Celeron) as "Atom" has referred to many conceptually different microarchitectures over time; the modern E-Core is of no relation to the original in-order "Atom" processor many remember.
I've had the same experiences as you with Intel mixed-core desktop parts. They're incredibly difficult to optimize for due to the heterogeneous core mixture, whereas AMD mobile parts are generally more reasonable (you're on a slow core or a fast core, basically), and AMD never made a mixed-core desktop part.
However, Intel server parts several years ago switched to E-core only or P-core only, so all of the heterogeneous core mixture issues aren't a thing - you basically have two separate processor generations being sold at once, which isn't particularly surprising or uncommon.
With AMD server processor families (linked in my comment), depending on the part's density you get either "slow" or "fast" cores and either "wide" or "narrow" units, so you do still have to think about things a little bit there too.
Where Intel really screwed up in general, microarchitecture differences aside, is AVX512. That's the wrench that prevents the same compiled code from running across most Intel parts - they just couldn't decide what they wanted to do with it, whereas AMD just chose to support it and stick with it, even though the throughput for the wide instructions is wildly different between processors.
I've never understood why Intel didn't just soft-disable AVX512 on P-cores until the OS writes a value to some MSR that means "I understand that only some cores have AVX512".
From the OS side, the change to support it is pretty simple. On the first #UD trap caused by an AVX512 instruction being missing, pin the process to just the P-cores and end the process's timeslice.
ie. marketed as "dense" instead of "efficient"
By what logic?
Intel E-cores are basically a different microarchitecture. They often support different instruction sets than their P-cores, have different "instructions-per-clock" rates (IPC), and all sorts of other major differences. They're just very different things, and those differences are responsible for most of the bad reputation that E-cores have.
AMD's dense-cores are the same microarchitecture, have the same IPC, use all the same instruction sets. The only real difference between them and regular AMD cores is that their dense cores have less cache, and lower peak clocks.
There's nothing wrong with E-cores though, their bad reputation is quite undeserved. They pack a lot of compute in tiny area and power constraints compared to P-cores. They're probably not the optimal choice for a single-thread workload, but that's an entirely different matter.
Their bad reputation is fully deserved, it's just also out of date. Reputations are almost always about first impressions, and the first impression with E-cores was bad. They've done a lot to fix the situation though, and they do indeed run pretty well nowadays if you have a more modern Intel CPU.
That said, manually disabling AVX-512 on P-cores just so I can have E-cores is still a *bad* tradeoff as far as I'm concerned, but I get that my use-cases aren't everyone's use-cases.
Intel is building their new chips on that microarchitecture so it will probably be fine.
>They often support different instruction sets than their P-cores
Do they?
I thought it caused very significant problems (when there's switch between E and P core) and they avoided it
But I cannot find anything about it
The P and E cores support different instructions and Intel "fixed" it by disabling instructions on the P-cores. So now they have the same instructions but at the cost of a bunch of wasted silicon.
The Intel server CPUs with P-cores support AVX-512, like all current AMD CPUs, and they also support a few extensions not currently supported by AMD, like AMX (Zen 6 will add FP16 arithmetic support in AVX-512, reducing the differences vs. Intel P-core servers).
The Intel server CPUs with E-cores, both the current Sierra Forest CPUs with Crestmont cores and the future Clearwater Forest CPUs with Darkmont cores do not support AVX-512 and they are almost identical with the E-cores from Intel laptop/desktop CPUs.
Therefore, for demanding applications you cannot run the same programs on Intel servers with P-cores or E-cores, unless they use dynamical dispatch to select at run-time between AVX and AVX-512 libraries, as the gain from AVX-512 can be very substantial and on server applications not using it would lose money by lowering the throughput.
The Intel Darkmont cores of Panther Lake and Clearwater Forest are almost identical with the Skymont cores of Arrow Lake and Lunar Lake (the main difference is that the Skymont cores are made by TSMC, while the Darkmont cores are made by Intel in their new 18A CMOS process) and they are extremely similar in die size and in performance with the ARM Neoverse V3 cores from the newly launched AWS Graviton5 (which are known as Cortex-X4 in their smartphone variant).
Intel has said that they will eliminate this ISA difference between E-cores and P-cores, but a couple of years might pass until this will reach their server CPUs.
> Do they?
Yes?
Ah, I omitted to mention that with 256 cores, you get 512 threads.
I'd just like to take a moment to appreciate chipsandcheese and how they fill the Anandtech-shaped void in my heart <3
random internet feedback:
i really wish the article would have spent 2 sec to write in parenthesis what 'ccd' is (its 'Core Complex Die' fyi)
This is a hardcore chip website. All their readers know what it is.
If their goal was to appeal to more casual readers, then I agree.
Well, it could also mean CCD (Charge Coupled Device) which is also used in this field (or was?)
How is this sort of package cooled? Seems like you'd pretty much need to do some sort of water cooling right?
While the power draw might be high in absolute terms, the surface area is also quite large. For example, the article's estimates add up to just 2000mm2 for the Epyc chip. For reference, a Ryzen 9950X (AMD's hottest desktop CPU) has a surface area of about 262mm2, and a PPT (maximum power draw) of ~230W. This means that the max heat flux at the chip interface will almost certainly be lower on the Epyc chip than on the Ryzen - I don't think we're going to be getting 1000W+ PPT/TDP chips.
From that you can infer that there shouldn't be the need for liquid cooling in terms of getting the heat off the chip.
There still are overall system power dissipation problems, which might lead you to want to use liquid cooling, but not necessarily.
For example, Super Micro will sell you air cooled 1U servers that options up to 400W CPU options (https://www.supermicro.com/en/products/system/hyper/1u/as%20...)
You can move a lot of air with good efficiency even just by using bigger fans that don't need to spin as fast most of the time. Water cooling is a good default for power-dense workloads, but far from an absolute necessity in every case.
Air almost certainly. They always develop these chips within a thermal envelop. The envelop should be within what air cooling can do.
PS. Having many cores doesn’t mean a lot more power. Multi core performance can be made very efficient by having many cores running at lower clock rate.
You can cool it however you want but the better the cooling the better the performance. We'll probably see heat pipes at a minimum.
256c/512t off a single package… likely 1024 threads in a 2cpu system.
Basically we are about to reach the scale where a single rack of these is a whole datacenter from the nineties or something like that
Perhaps the most comparable 1990s system would be the SGI Origin 2800 (https://en.wikipedia.org/wiki/SGI_Origin_2000) with 128 processors in a single shared-memory multiprocessing system. The full system took up nine racks. The successor SGI Origin 3800 was available with up to 512 processors in 2002.
Each core is multiples faster than a 90's CPU for various reasons as well. I think if you look at an entire rack it's easily a multiple of a 90's datacenter.
The new double wide rack looks good
AMD Venice? 2005 is calling!
x86_64 server architecture 256 cores on a die.
Blackwell 100+200 compression spin lock documentation.
Have not checked for a while, but does AMD at this point have any software to run stable and efficiently?
Or are they still building chips no one wants to use because cuda is the only thing that doesn’t suck balls
ROCm is pretty stable now.