It's interesting that Zen5's FPUs running in full 512bit wide mode doesn't actually seem to cause any trouble, but that lighting up the load store units does. I don't know enough about hardware-level design to know if this would be "expected".
The fully investigation in this article is really interesting, but the TL;DR is: Light up enough of the core, and frequencies will have to drop to maintain power envelope. The transition period is done very smartly, but it still exists - but as opposed to the old intel avx512 cores that got endless (deserved?) bad press for their transition behavior, this is more or less seamless.
On the L/S unit impact: data movement is expensive, computation is cheap (relatively).
In "Computer Architecture, A Quantitative Approach" there are numbers for the now old TSMC 45nm process: A 32 bits FP multiplication takes 3.7 pJ, and a 32 bits SRAM read from an 8 kB SRAM takes 5 pJ. This is a basic SRAM, not a cache with its tag comparison and LRU logic (more expansive).
Then I have some 2015 numbers for Intel 22nm process, old too. A 64 bits FP multiplication takes 6.4 pJ, a 64 bits read/write from a small 8 kB SRAM 4.2 pJ, and from a larger 256 kB SRAM 16.7 pJ. Basic SRAM here too, not a more expansive cache.
The cost of a multiplication is quadratic, and it should be more linear for access, so the computation cost in the second example is much heavier (compare the mantissa sizes, that's what is multiplied).
The trend gets even worse with more advanced processes. Data movement is usually what matters the most now, expect for workloads with very high arithmetic intensity where computation will dominate (in practice: large enough matrix multiplications).
Appreciate the detail! That explains a lot of what is going on.. It also dovetails with some interesting facts I remember reading about the relative power consumption for the zen cores versus the infinity fabric connecting them - The percentage of package power usage simply from running the fabric interconnect was shocking.
Right, but a SIMD single precision mul is linear (or even sub linear) relative to it's scalar cousin. So a 16x32, 512-bit MUL won't be even 16x the cost of a scalar mul, the decoder has to do only the same amount of work for example.
The calculations within each unit may be, true, but routing and data transfer is probably the biggest limiting factor on a modern chip. It should be clear that placing 16x units of non-trivial size means that the average will likely be further away from the data source than a single unit, and transmitting data over distances can have greater-than-linear increasing costs (not just resistance/capacitance losses, but to hit timing targets you need faster switching, which means higher voltages etc.)
Both Intel and AMD to some extent separate the vector ALUs and the register file in 128-bit (or 256-bit?) lanes, across which arithmetic ops won't need to cross at all. Of course loads/stores/shuffles still need to though, making this point somewhat moot.
AFAIK you have to think about how many different 512b paths are being driven when this happens, like each cycle in the steady-state case is simultaneously (in the case where you can do two vfmadd132ps per cycle):
- Capturing 2x512b from the L1D cache
- Sending 2x512b to the vector register file
- Capturing 4x512b values from the vector register file
- Actually multiplying 4x512b values
- Sending 2x512b results to the vector register file
.. and probably more?? That's already like 14*512 wires [switching constantly at 5Ghz!!], and there are probably even more intermediate stages?
> but as opposed to the old intel avx512 cores that got endless (deserved?) bad press for their transition behavior, this is more or less seamless.
The problem with Intel was, the AVX frequencies were secrets. They were never disclosed in later cores where power envelope got tight, and using AVX-512 killed performance throughout the core. This meant that if there was a core using AVX-512, any other cores in the same socket throttled down due to thermal load and power cap on the core. This led to every process on the same socket to suffer. Which is a big no-no for cloud or HPC workloads where nodes are shared by many users.
Secrecy and downplaying of this effect made Intel's AVX-512 frequency and behavior infamous.
Oh, doing your own benchmarks on your own hardware which you paid for and releasing the results to the public was verboten, btw.
To be clear, the problem with the Skylake implementation was that triggering AVX-512 would downclock the entirety of the CPU. It didn’t do anything smart, it was fairly binary.
This AMD implementation instead seems to be better optimized and plug into the normal thermal operations of the CPU for better scaling.
Reading the section under "Load Another FP Pipe?" I'm coming away with the impression that it's not the LSU but rather total overall load that causes trouble. While that section is focused on transition time, the end steady state is also slower…
I haven’t read the article yet, but back when I tried to get to over 100 GB/s IO rate from a bunch of SSDs on Zen4 (just fio direct IO workload without doing anything with the data), I ended up disabling Core Boost states (or maybe something else in BIOS too), to give more thermal allowance for the IO hub on the chip. As RAM load/store traffic goes through the IO hub too, maybe that’s it?
I don't think these things are related, this is talking about the LSU right inside the core. I'd also expect oscillations if there were a thermal problem like you're describing, i.e. core clocks up when IO hub delivers data, IO hub stalls, causes core to stall as well, IO hub can run again delivering data, repeat from beginning.
(Then again, boost clocks are an intentional oscillation anyway…)
Ok, I just read through the article. As I understand, their tests were designed to run entirely on data on the local cores' cache? I only see L1d mentioned there.
Yes, that's my understanding of "Zen 5 also doubles L1D load bandwidth, and I’m exercising that by having each FMA instruction source an input from the data cache." Also, considering the author's other work, I'm pretty sure they can isolate load-store performance from cache performance from memory interface performance.
It seems even more interesting than the power envelope. It looks like the core is limited by the ability of the power supply to ramp up. So the dispatch rate drops momentarily and then goes back up to allow power delivery to catch up.
Seemed appropriate to me as comparing the "first core to use full-width AVX-512 datapaths"; my interpretation is that AMD threw more R&D into this than Intel before shipping it to customers…
(It's also not really a comparative article at all? Skylake-X is mostly just introduction…)
> my interpretation is that AMD threw more R&D into this than Intel before shipping it to customers
AMD had the benefit of learning from Intel's mistakes in their first generation of AVX-512 chips. It seemed unfair to compare an Intel chip that's so old (albeit long-lasting due to Intel's scaling problems). Skylake-X chips were released in 2017! [1]
sure, but AMD's decision to start with a narrower datapath happened without insight from Intel's mistakes and could very well have backfired (if Intel had managed to produce a better-working implementation faster, that could've cost AMD a lot of market share). Intel had the benefit of designing the instructions along with the implementation as well, and also the choice of starting on a 2x256 datapath...
And again, yeah it's not great were it a comparison, but it really just doesn't read as a comparison to me at all. It's a reference.
AMD did not start with a narrower datapath, even if this is a widespread myth. It only had a narrower path between the inner CPU core and the L1 data cache memory.
The most recent Intel and AMD CPU cores (Lion Cove and Zen 5) have identical vector datapath widths, but for many years, for 256-bit AVX Intel had a narrower datapath than AMD, 768-bit for Intel (3 x 256-bit) vs. 1024-bit for AMD (4 x 256-bit).
Only when executing 512-bit AVX-512 instructions, the vector datapath of Intel was extended to 1024-bit (2 x 512-bit), matching the datapath used by AMD for any vector instructions.
There were only 2 advantages of Intel AVX-512 vs. AMD executing AVX or the initial AVX-512 implementation of Zen 4.
The first was that some Intel CPU models, but only the more expensive SKUs, i.e. most of the Gold and all of the Platinum, had 2 x 512-bit FMA units, while the cheap Intel CPUs and AMD Zen 4 had only one 512-bit FMA unit (but AMD Zen 4 still had 2 x 512-bit FADD units). Therefore Intel could do 2 FMUL or FMA per clock cycle, while Zen 4 could do only 1 FMUL or FMA (+ 1 FADD).
The second was that Intel had a double width link to the L1 cache, so it could do 2 x 512-bit loads + 1 x 512-bit stores per clock cycle, while Zen 4 could do only 1 x 512-bit loads per cycle + 1 x 512-bit stores every other cycle. (In a balanced CPU core design the throughput for vector FMA and for vector loads from the L1 cache must be the same, which is true for both old and new Intel and AMD CPU cores.)
With the exception of vector load/store and FMUL/FMA, Zen 4 had the same or better AVX-512 throughput for most instructions, in comparison with Intel Sapphire Rapids/Emerald Rapids. There were a few instructions with a poor implementation on Zen 4 and a few instructions with a poor implementation on Intel, where either Intel or AMD were significantly better than the other.
> AMD did not start with a narrower datapath, even if this is a widespread myth. It only had a narrower path between the inner CPU core and the L1 data cache memory.
"Thus as many of us predicted, 512-bit instructions are split into 2 x 256-bit of the same instruction. And 512-bit is always half-throughput of their 256-bit versions."
Is that wrong?
There's a lot of it being described as "double pumped" going around…
(tbh I couldn't care less about how wide the interface buses are, as long as they can deliver in sum total a reasonable bandwidth at a reasonable latency… especially on the further out cache hierarchies the latency overshadows the width so much it doesn't matter if it comes down to 1×512 or 2×256. The question at hand here is the total width of the ALUs and effective IPC.)
Sorry, but you did not read carefully that good article and you did not read the AMD documentation and the Intel documentation.
The AMD Zen cores had for several generations, until Zen 4, 4 (four) vector execution units with a width of 256 bits, i.e. a total datapath width of 1024 bits.
On an 1024-bit datapath, you can execute either four 256-bit instructions per clock cycle or two 512-bit instructions per clock cycle.
While the number of instructions executed per cycle varies, the data processing throughput is the same, 1024 bits per clock cycle, as determined by the datapath width.
The use of the word "double-pumped" by the AMD CEO has been a very unfortunate choice, because it has been completely misunderstood by most people, who have never read the AMD technical documentation and who have never tested the behavior of the micro-architecture of the Zen CPU cores.
On Zen 4, the advantage of using AVX-512 is not from a different throughput, but it is caused by a better instruction set and by the avoiding of bottlenecks in the CPU core front-end, at instruction fetching, decoding, renaming and dispatching.
On the Intel P cores before Lion Cove, the datapath for 256-bit instructions had a width of 768 bits, as they had three 256-bit execution units. For most 256-bit instructions the throughput was of 768 bits per clock cycle. However the three execution units were not identical, so some 256-bit instructions had only a throughput of 512 bits per cycle.
When the older Intel P cores executed 512-bit instructions, the instructions with a 512 bit/cycle throughput remained at that throughput, but most of the instructions with a 768 bit/cycle throughput had their throughput increased to 1024 bit/cycle, matching the AMD throughput, by using an additional 256-bit datapath section that stayed unused when executing 256-bit or narrower instructions.
While what is said above applies to most vector instructions, floating-point multiplication and FMA have different rules, because their throughput is not determined by the width of the datapath, but it may be smaller, being determined by the number of available floating-point multipliers.
Cheap Intel CPUs and AMD Zen 2/Zen 3/Zen 4 had FP multipliers with a total throughput of 512 bits of results per clock cycle, while the expensive Xeon Gold and Platinum had FP multipliers with a total throughput of 1024 bits of results per clock cycle.
The "double-pumped" term is applicable only to FP multiplication, where Zen 4 and cheap Intel CPUs require a double number of clock cycles to produce the same results as expensive Intel CPUs. It may be also applied, even if that is even less appropriate, to vector load and store, where the path to the L1 data cache was narrower in Zen 4 than in Intel CPUs.
The "double-pumped" term is not applicable to the very large number of other AVX-512 instructions, whose throughput is determined by the width of the vector datapath, not by the width of the FP multipliers or by the L1 data cache connection.
Zen 5 doubles the vector datapath width to 2048 bits, so many 512-bit AVX-512 instructions have a 2048 bit/cycle throughput, except FMUL/FMA, which have an 1024 bit/cycle throughput, determined by the width of the FP multipliers. (Because there are only 4 execution units, 256-bit instructions cannot use the full datapath.)
Intel Diamond Rapids, expected by the end of 2026, is likely to have the same vector throughput as Zen 5. Until then, the Lion Cove cores from consumer CPUs, like Arrow Lake S, Arrow Lake H and Lunar Lake, are crippled, by having a half-width datapath of 1024 bits, which cannot compete with a Zen 5 that executes AVX-512 instructions.
> Sorry, but you did not read carefully that good article and you did not read the AMD documentation and the Intel documentation.
I think you are the one who hasn't read documentation or tested the behavior of Zen cores. Read literally any AMD material about Zen4: it mentions that the AVX512 implementation is done over two cycles because there are 256-bit datapaths.
On page 34 of the Zen4 Software Optimization Guide[^1], it literally says:
> Because the data paths are 256 bits wide, the scheduler uses two consecutive cycles to issue a 512-bit operation.
It is not certain if what AMD writes there is true, because it is almost impossible to determine by testing whether the 2 halves of a 512-bit instruction are executed sequentially in 2 clock cycles of the same execution unit or they are executed in the same clock cycle in 2 execution units.
Some people have attempted to test this claim of AMD by measuring instruction latencies. The results have not been clear, but they tended to support that this AMD claim is false.
Regardless whether this AMD claim is true or false, this does not change anything for the end user.
For any relevant 512-bit instruction, there are 2 or 4 available execution units. The 512-bit instructions are split into 2 x 256-bit micro-operations, and then either 4 or 2 such micro-operations are issued simultaneously, corresponding to the total datapath width of 1024 bits, or to the partial datapath width available for a few instructions, e.g. FMUL and FMA, resulting in a throughput of 1024 bits of results per clock cycle for most instructions (512 bits for FMA/FMUL), the same as for any Intel CPU supporting AVX-512 (with the exception of FMA/FMUL, where the throughput matches only the cheaper Xeon SKUs).
The throughput would be the same, i.e. of 1024 bits per cycle, regardless if what AMD said is true, i.e. that when executing 8 x 256-bit micro-operations in 2 clock cycles, the pair of micro-operations executed in the same execution unit comes from a single instruction, or if the claim is false and the pair of micro-operations executed in the same execution unit comes from 2 distinct instructions.
The throughput depends only on the total datapath width of 1024 bits and it does not depend on the details of the order in which the micro-operations are issued to the execution units.
The fact that one execution unit has a data path of 256 bits is irrelevant for the throughput of a CPU. Only the total datapath width matter.
For instance, an ARM Cortex-X4 CPU core has the datapath width for a single execution unit of only 128 bits. That does not mean that it is slower than a consumer Intel CPU core that supports only AVX, which has a datapath width for a single execution unit of 256 bits.
In fact both CPU cores have the same vector FMA throughput, because they have the same total datapath width for FMA instructions of 512 bits, i.e. 4 x 128 bits for Cortex-X4 and 2 x 256 bits for a consumer Intel P-core, e.g. Raptor Cove.
It is not enough to read the documentation if you do not think about what you read, to assess whether it is correct or not.
Technical documentation is not usually written by the engineers that have designed the device, so it frequently contains errors when the technical writer has not understood what the designers have said, or the writer has attempted to synthesize or simplify the information, but that has resulted in a changed meaning.
It doesn't really matter if the two "halves" are issued in sequence or in parallel¹; either way they use 2 "slots" of execution which are therefore not available for other use — whether that other use be parallel issue, OOE or HT². To my knowledge, AVX512 code tends to be "concentrated", there's generally not a lot of non-AVX512 code mixed in that would lead to a more even spread on resources. If that were the case, the 2-slot approach would be less visible, but that's not really in the nature of SIMD code paths.
But at the same time, 8×256bit units would be better than 4×512, as the former would allow more thruput with non-AVX512 code. But that costs other resources (and would probably also limit achievable clocks since increasing complexity generally strains timing…) 3 or 4 units seems to be what Intel & AMD engineers decided to be best in tradeoffs. But all the more notable then that Zen4→Zen5 is not only a 256→512 width change but also a 3→4 unit increase³, even if the added unit is "only" a FADD one.
(I guess this is what you've been trying to argue all along. It hasn't been very clear. I'm not sure why you brought up load/store widths to begin with, and arguing "AMD didn't have a narrower datapath" isn't quite productive when the point seems to be "Intel had the same narrower datapath"?)
¹ the latency difference should be minor in context of existing pipeline depth, but of course a latency difference exists. As you note it seems not very easy to measure.
² HT is probably the least important there, though I'd also assume there are quite a few AVX512 workloads that can in fact load all cores and threads of a CPU.
Isn't it misleading to just add up the output width of all SIMD ALU pipelines and call the sum "datapath width", because you can't freely mix and match when the available ALUs pipelines determine what operations you can compute at full width?
You are right that in most CPUs the 3 or 4 vector execution units are not completely identical.
Therefore some operations may use the entire datapath width, while others may use only a fraction, e.g. only a half or only two thirds or only three quarters.
However you cannot really discuss these details without listing all such instructions, i.e. reproducing the tables from the Intel or AMD optimization guides of from Agner Fog's optimization documents.
For the purpose of this discussion thread, these details are not really relevant, because for Intel and AMD the classification of the instructions is mostly the same, i.e. the cheap instructions, like addition operations, can be executed in all execution units, using the entire datapath width, while certain more expensive operations, like multiplication/division/square root/shuffle may be done only in a subset of the execution units, so they can use only a fraction of the datapath width (but when possible they will be coupled with simple instructions using the remainder of the datapath, maintaining a total throughput equal with the datapath width).
Because most instructions are classified by cost in the same way by AMD and Intel, the throughput ratio between AMD and Intel is typically the same both for instructions using the full datapath width and for those using only a fraction.
Like I have said, with very few exceptions (including FMUL/FMA/LD/ST), the throughput for 512-bit instructions has been the same for Zen 4 and the Intel CPUs with AVX-512 support, as determined by the common 1024-bit datapath width, including for the instructions that could use only a half-width 512-bit datapath.
Wouldn't it be 1536-bit for 2 256-bit FMA/cycle, with FMA taking 3 inputs? (applies equally to both so doesn't change anything materially; And even goes back to Haswell, which too is capable of 2 256-bit FMA/cycle)
That is why I have written the throughput "for results", to clarify the meaning (the throughput for output results is determined by the number of execution units; it does not depend on the number of input operands).
The vector register file has a number of read and write ports, e.g. 10 x 512-bit read ports for recent AMD CPUs (i.e. 10 ports can provide the input operands for 2 x FMA + 2 FADD, when no store instructions are done simultaneously).
So a detailed explanation of the "datapath widths", would have to take into account the number of read and write ports, because some combinations of instructions cannot be executed simultaneously, even when there are available execution units, because the paths between the register file and the execution units are occupied.
Even more complicately, some combinations of instructions that would be prohibited by not having enough register read and write ports, can actually be done simultaneously because there are bypass paths between the execution units that allow the sharing of some input operands or the direct use of output operands as input operands, without passing through the register file.
The structure of the Intel vector execution units, with 3 x 256-bit execution units, 2 of which can do FMA, goes indeed back to Haswell, as you say.
The Lion Cove core launched in 2024 is the first Intel core that uses the enhanced structure used by AMD Zen for many years, with 4 execution units, where 2 can do FMA/FMUL, but all 4 can do FADD.
Starting with the Skylake Server CPUs, the Intel CPUs with AVX-512 support retain the Haswell structure when executing 256-bit or narrower instructions, but when executing 512-bit instructions, 2 x 256-bit execution units are paired to make a 512-bit execution unit, while the third 256-bit execution unit is paired with an otherwise unused 256-bit execution unit to make a second 512-bit execution unit.
Of these 2 x 512-bit execution units, only one can do FMA. Certain Intel SKUs add a second 512-bit FMA unit, so in those both 512-bit execution units can do FMA (this fact is mentioned where applicable in the CPU descriptions from the Intel Ark site).
So the 1024-bit number is the number of vector output bits per cycle, i.e. 2×FMA+2×FADD = (2+2)×256-bit? Is the term "datapath width" used for that anywhere else? (I guess you've prefixed that with "total " in some places, which makes much more sense)
For most operations, an ALU has a width in 1-bit subunits, e.g. adders, and the same number as the number of subunits is the width in bit lines of the output path and of each of the 2 input paths that are used for most input operands. Some operations use only one input path, while others, like FMA or bit select may need 3 input paths.
The width of the datapath is normally taken to be the number of 1-bit subunits of the execution units, which is equal to the width in bit lines of the output path.
Depending on the implemented instruction set, the number of input paths having the same width as the output path may vary, e.g. either 2 or 3. In reality this is even more complicated, e.g. for 4 execution units you may have 10 input paths, whose connections can be changed dynamically, so they may be provide 3 input paths for some execution units and 2 input paths for other execution units, depending on what micro-operations happen to be executed there during a clock cycle. Moreover there may be many additional bypass operand paths.
Therefore, if you say that the datapath width for a single execution unit is of 256 bits, because it has 256 x 1-bit ALU subunits and 256 bit lines for output, that does not determine completely the complexity of the execution unit, because it may have a total input path width with values varying e.g. between 512 bit lines to 1024 bit lines or even more (which are selected with multiplexers).
The datapath width for a single execution units matters very little for the performance of a CPU or GPU. What matters is the total datapath width, summed over all available execution units, which is what determines the CPU throughput when executing some program.
For AVX programs, starting with Zen 2 the AMD CPUs had a total datapath width of 1024 bits vs. 768 bits for Intel, which is why they were beating easily the Intel CPUs in AVX benchmarks.
For 512-bit AVX-512 instructions, Zen 4 and the Intel Xeon CPUs with P-cores have the same total datapath width for instructions other than FMUL/FMA/LD/ST, which has resulted in the same throughput per clock cycle for the programs that do not depend heavily on floating-point multiplications. Because Zen 4 had higher clock frequencies in power-limited conditions, Zen 4 has typically beaten the Xeons in AVX-512 benchmarks, with the exception of the programs that can use the AMX instruction set, which is not implemented yet by AMD.
The "double-pumped" term used about Zen 4 has created a lot of confusion, because it does not refer to the datapath width, but only to the number of available floating-point multipliers, which is half of that of the top models of Intel Xeons, so any FP multiplications must require a double number of clock cycles on Zen 4.
The term "double-pumped" is actually true for many models of AMD Radeon GPUs, where e.g. a 2048-bit instruction (64 wavefront size) is executed in 2 clock cycles as 2 x 1024-bit micro-operations (32 wavefront size).
On Zen 4, it is not at all certain that this is how the 512-bit instructions are executed, because unlike on Radeon, on Zen 4 there are 2 parallel execution units that can execute the instruction halves simultaneously, which results in the same throughput as when the execution is "double-pumped" in a single execution unit.
AMD also has a long history of using half-width, double-pumped SIMD implementations. It worked each time. With Zen 4 the surprise wasn't that it worked at all, but how well it worked and how little the core grew in any relevant metric to support it: size, static and dynamic power.
It makes me wonder why Intel didn't pair two efficiency cores with half width SIMD data paths per core sharing a single full width permute unit between them.
I think it's mostly the lack of comparable research other than the Skylake-X one by Travis Downs. I too would like to see how Zen 4 behaves in the situation with its double-pumping.
True, but who would bother to pay a lot of money for a CPU that is known to be inferior to alternatives, only to be able to test the details of its performance.
With prices in the range $2,000 - $20,000 for the CPU, plus a couple grand for at least MB, cooler, memory and PSU, the journalist must be very well funded to spend that much for publishing one article analyzing the CPU.
I would like to read such an article or to be able to test myself such a CPU, but my curiosity is not so great as to make me spend such money.
For now, the best one can do is to examine the results of general-purpose benchmarks published on sites like:
Cascade Lake improved the situation a bit, but then you had Ice Lake where iirc the hard cutoffs were gone and you were just looking at regular power and thermal steering. IIRC, that was the first generation where we enabled AVX512 for all workloads.
I don't understand why 2x FMAs in CPU design poses such a challenge when GPUs literally have hundreds of such ALUs? Both operate at similar TDP so where's the catch? Much lower GPU clock frequency?
Zen 5 still clocks way higher than GPUs even with the penalties. Additionally, CPUs typically target much lower latency for operations even per-clock, which adds a ton of silicon cost for the same throughput, and especially so at high clock frequency.
The difficulty with transitions that Skylake-X suffered especially from just has no equivalent on GPU; if you always stay in the transitioned-to-AVX512 state on Skylake-X, things are largely normal; GPUs just are always unconditionally in such a state, but that be awful on CPUs, as it'd make scalar-only code (not a thing on GPUs, but the main target for CPUs) unnecessarily slow. And so Intel decided that the transitions are worth the improved clocks for code not utilizing AVX-512.
It's not the execution of FMAs that's the challenge, it's the ramp up / down.
And I assure you GPUs do have challenges with that as well. That's just less well known because (1) in GPUs, all workloads are vector workloads, and so there was never a stark contrast between scalar and vector regimes like in Intel's AVX-512 implementation and (2) GPU performance characteristics are in general less broadly known.
Yes, I agree that it was premature to say that GPUs aren't suffering from the same symptoms. There's just not enough evidence but the differences in the compute power are still large.
It's not 2 FMAs, it's AVX-512 (and going with 32-bit words) ⇒ 2*512/32 = 32 FMAs per core, 256 on an 8-core CPU. The unit counts for GPUs - depending on which number you look at - count these separately.
CPUs also have much more complicated program flow control, versatility, and AFAIK latency (⇒ flow control cost) of individual instructions. GPUs are optimized for raw calculation throughput meanwhile.
Also note that modern GPUs and CPUs don't have a clear pricing relationship anymore, e.g. a desktop CPU is much cheaper than a high-end GPU, and large server CPUs are more expensive than either.
1x 512-bit FMA or 2x 256-bit FMAs or 4x 128-bit FMAs is irrelevant here - it's still a single physical unit in a CPU that consumes 512 bits of data bandwidth. The question is why the CPU budget allows for 2x 512-bit or 4x 256-bit while H100, for example, has 14592 FP32 CUDA cores - in AVX terminology that would translate, if I am not mistaken, to 7926x 512-bit or 14592x 256-bit FMAs per clock cycle. Even considering the obvious differences between GPUs and CPUs, this is still a large difference. Since GPU cores operate at much lower frequencies than CPU cores, it is what it made me believe where the biggest difference comes from.
AIUI an FP32 core is only 32 bits wide, but this is outside my area of expertise really. Also note that CPUs also have additional ALUs that can't do FMAs, FMA is just the most capable one.
You're also repeating 2×512 / 4×256 — that's per core, you need to multiply by CPU core count.
[also, note e.g. an 8-core CPU is much cheaper than a H100 card ;) — if anything you'd be comparing the highest end server CPUs here. An 192-core Zen5c is 8.2~10.5k€ open retail, an H100 is 32~35k€…]
[reading through some random docs, a CPU core seems vaguely comparable to a SM; a SM might have 128 or 64 lanes (=FP32 cores) while a CPU only has 16 with AVX-512, but there is indeed also a notable clock difference and far more flexibility otherwise in the CPU core (which consumes silicon area)]
Nvidia calls them cores to deliberately confuse people, and make it appear vastly more powerful than it really is. What they are in reality is SIMD lanes.
So the H100(which costs vastly more than a Zen5..), has 14592 32 bit SIMD lanes, not cores.
A Zen 5 has 16x4(64) 32 bit SIMD lanes per core, so scale that by core count to get your answer. A higher end desktop Zen5 will have 16 cores, so 64x16 = 1024. The Zen5 also clocks much higher than the GPU, so you can also scale it up by perhaps 1.5-2x
While this is obviously less than the H100, the Zen5 chip costs $550 and the H100 cost $40k.
There is more to it than this, GPUs also have transcendental functions, texture sampling, and 16 bit ops(which are lacking in CPUs). While CPUs are much more flexible, and have powerful byte & integer manipulation instructions, along with full speed 64 bit integer/double support.
Thanks for the clarification on the NVidia, I didn't know that. What I also found is that NVidia groups 32 SIMD lanes into what they call a warp. Then 4 warps are grouped into what they're calling a streaming multiprocessor (SM). And lastly H100 has 114 SMs so 432114=14592 checks out.
> Zen 5 has 16x4(64) 32 bit SIMD lanes per core
Right, 2x FMA and 2x FADD so the highest-end Zen 5 die with 192 cores would total to 12288 32-bit SIMD lanes or half of that if we are considering only FMA ops. This is then indeed much closer to 14592 32-bit SIMD lanes of H100.
There are x86 extensions for fp16/bf16 ops. e.g. both Zen 4 and Zen 5 support AVX512_BF16, which has vdpbf16ps, i.e. dot product of pairs of bf16 elements from two args; that is, takes a total of 64 bf16 elts and outputs 16 fp32 elts. Zen 5 can run two such instrs per cycle.
Like another poster already said, the power budget of a consumer CPU, like 9950X, executing programs at a double clock frequency in comparison with a GPU, allows for 16 cores x 2 execution units x 16 = 512 FP32 FMA per clock cycle, which provides the same throughput like an 1024 FP32 FMA per clock cycle iGPU from the best laptop CPUs, while consuming 3 times less power than a datacenter GPU, so the power budget and performance is like for a datacenter GPU with 3072 FP32 FMA per clock cycle.
However, because of its high clock frequency a consumer CPU has high performance per dollar, but low performance per watt.
Server CPUs with many cores have much better energy efficiency, e.g. around 3 times higher than a desktop CPU and the same with the most efficient laptop CPUs. For many generations of NVIDIA GPUs and Intel Xeon CPUs, until about 5-6 years ago, the ratio between their floating-point FMA throughput per watt has been of only 3.
This factor of 3 is mainly due to the overhead of various tricks used by CPUs to extract instruction-level parallelism from programs that do not use enough concurrent threads or array operations, e.g. superscalar out-of-order execution, register renaming, etc.
In recent years, starting with NVIDIA Volta, followed later by AMD and Intel GPUs, the GPUs have made a jump in performance that has increased the gap between their throughput and that of CPUs, by supplementing the vector instructions with matrix instructions, i.e. what NVIDIA calls tensor instructions.
However this current greater gap in performance between CPUs and GPUs could easily be removed and the performance per watt ratio could be brought back to a factor unlikely to be greater than 3, by adding matrix instructions to the CPUs.
Intel has introduced the AMX instruction set, besides AVX, but for now it is supported only in expensive server CPUs and Intel has defined only instructions for low-precision operations used for AI/ML. If AMX were extended with FP32 and FP64 operations, then the performance would be much more competitive with GPUs.
ARM is more advanced in this direction, with SME (Scalable Matrix Extension) defined besides SVE (Scalable Vector Extension). SME is already available in recent Apple CPUs and it is expected to be also available in the new Arm cores that will be announced in a few months for now, which should become available in the smartphones of 2026, and presumably also in future Arm-based CPUs for servers and laptops.
The current Apple CPUs do not have strong SME accelerators, because they also have an iGPU that can perform the operations whose latency is less important.
On the other hand, an Arm-based server CPU could have a much bigger SME accelerator, providing a performance much closer to a GPU.
I appreciate the response with a lot of interesting details, however, I don't believe it answers the question I had? My doubt was why is it so that the CPU design suffers from clock frequency issues in AVX-512 workloads whereas GPUs which have much more compute power do not.
I assumed that it was due to the fact that GPUs run at much lower clock frequencies and therefore available power budget but as I also discussed with another commenter above this was probably a premature conclusion since we don't have enough evidence showing that GPUs indeed do not suffer from same type of issues. They likely do but nobody measured it yet?
The low clock frequency when executing AVX-512 workloads is a frequency where the CPU operates efficiently, with a low energy consumption per operation executed.
For such a workload that executes a very large number of operations per second, the CPU cannot afford to operate inefficiently because it will overheat.
When a CPU core has many execution units that are idle, so they do not consume power, like when executing only scalar operations or only operations with narrow 128-bit vectors, it can afford to raise the clock frequency e.g. by 50%, even if that would increase the energy consumption per operation e.g. 3 times. By executing 4 times or 8 times less operations per clock cycle, even if the energy consumption is 3 times higher the total power consumption is smaller and the CPU does not overheat and the desktop owner does not care that the completion of the same workload requires much more energy, because it is likely that the owner cares more about the time to completion.
The clock frequency of a GPU also varies continuously depending on the workload, in order to maintain the power consumption within the limits. However a GPU is not designed to be able to increase the clock frequency as much as a CPU. The fastest GPUs have clock frequencies under 3 GHz, while the fastest CPUs exceed 6 GHz.
The reason is that normally one never launches a GPU program that would use only a small fraction of the resources of a GPU allowing a higher clock frequency, so it makes no sense to design a GPU for this use case.
Designing a chip for a higher clock frequency greatly increases the size of the chip, as shown by the comparison between a normal Zen core designed for 5.7 GHz and a Zen compact core, designed e.g. for 3.3 GHz, a frequency not much higher than that of a GPU.
On Zen compact cores and on normal Zen cores configured for server CPUs with a large number of cores, e.g. 128 cores (with a total of 4096 FP32 ALUs, like a low-to-mid-range desktop GPU, or like a top desktop GPU of 5 years ago; a Zen compact server CPU can have 6144 FP32 ALUs, more than a RTX 4070), the clock frequency variation range is small, very similar to the clock variation range of a GPU.
In conclusion, it is not the desktop/laptop CPUs which drop their clock frequency, but it is the GPUs which never raise their clock frequency much, the same as the server CPUs, because neither GPUs nor server CPUs are normally running programs that keep most of their execution units idle, to allow higher clock frequencies without overheating.
I'm curious how changing some OC parameters would affect those results
if that it caused by voltage drop, how load line calibration affects it?
is that's power constraint then how PBO would affect it?
Nothing to fix here. The behavior in the transition regimes is already quite good.
The overall throttling is dynamic and reactive based on heat and power draw - this is unavoidable and in fact desirable (the alternative is to simply run slower all the time, not to somehow be immune to physics and run faster all the time)
It's interesting that Zen5's FPUs running in full 512bit wide mode doesn't actually seem to cause any trouble, but that lighting up the load store units does. I don't know enough about hardware-level design to know if this would be "expected".
The fully investigation in this article is really interesting, but the TL;DR is: Light up enough of the core, and frequencies will have to drop to maintain power envelope. The transition period is done very smartly, but it still exists - but as opposed to the old intel avx512 cores that got endless (deserved?) bad press for their transition behavior, this is more or less seamless.
On the L/S unit impact: data movement is expensive, computation is cheap (relatively).
In "Computer Architecture, A Quantitative Approach" there are numbers for the now old TSMC 45nm process: A 32 bits FP multiplication takes 3.7 pJ, and a 32 bits SRAM read from an 8 kB SRAM takes 5 pJ. This is a basic SRAM, not a cache with its tag comparison and LRU logic (more expansive).
Then I have some 2015 numbers for Intel 22nm process, old too. A 64 bits FP multiplication takes 6.4 pJ, a 64 bits read/write from a small 8 kB SRAM 4.2 pJ, and from a larger 256 kB SRAM 16.7 pJ. Basic SRAM here too, not a more expansive cache.
The cost of a multiplication is quadratic, and it should be more linear for access, so the computation cost in the second example is much heavier (compare the mantissa sizes, that's what is multiplied).
The trend gets even worse with more advanced processes. Data movement is usually what matters the most now, expect for workloads with very high arithmetic intensity where computation will dominate (in practice: large enough matrix multiplications).
Appreciate the detail! That explains a lot of what is going on.. It also dovetails with some interesting facts I remember reading about the relative power consumption for the zen cores versus the infinity fabric connecting them - The percentage of package power usage simply from running the fabric interconnect was shocking.
Right, but a SIMD single precision mul is linear (or even sub linear) relative to it's scalar cousin. So a 16x32, 512-bit MUL won't be even 16x the cost of a scalar mul, the decoder has to do only the same amount of work for example.
The calculations within each unit may be, true, but routing and data transfer is probably the biggest limiting factor on a modern chip. It should be clear that placing 16x units of non-trivial size means that the average will likely be further away from the data source than a single unit, and transmitting data over distances can have greater-than-linear increasing costs (not just resistance/capacitance losses, but to hit timing targets you need faster switching, which means higher voltages etc.)
Both Intel and AMD to some extent separate the vector ALUs and the register file in 128-bit (or 256-bit?) lanes, across which arithmetic ops won't need to cross at all. Of course loads/stores/shuffles still need to though, making this point somewhat moot.
AFAIK you have to think about how many different 512b paths are being driven when this happens, like each cycle in the steady-state case is simultaneously (in the case where you can do two vfmadd132ps per cycle):
- Capturing 2x512b from the L1D cache
- Sending 2x512b to the vector register file
- Capturing 4x512b values from the vector register file
- Actually multiplying 4x512b values
- Sending 2x512b results to the vector register file
.. and probably more?? That's already like 14*512 wires [switching constantly at 5Ghz!!], and there are probably even more intermediate stages?
… per core. There are eight per compute tile!
I like to ask IT people a trick question: how many numbers can a modern CPU multiply in the time it takes light to cross a room?
Random logic had also much better area scaling than SRAM since EUV which implies that gap continues to widen at a faster rate.
> but as opposed to the old intel avx512 cores that got endless (deserved?) bad press for their transition behavior, this is more or less seamless.
The problem with Intel was, the AVX frequencies were secrets. They were never disclosed in later cores where power envelope got tight, and using AVX-512 killed performance throughout the core. This meant that if there was a core using AVX-512, any other cores in the same socket throttled down due to thermal load and power cap on the core. This led to every process on the same socket to suffer. Which is a big no-no for cloud or HPC workloads where nodes are shared by many users.
Secrecy and downplaying of this effect made Intel's AVX-512 frequency and behavior infamous.
Oh, doing your own benchmarks on your own hardware which you paid for and releasing the results to the public was verboten, btw.
> Oh, doing your own benchmarks on your own hardware which you paid for and releasing the results to the public was verboten, btw.
Well, Cloudflare did anyway.
To be clear, the problem with the Skylake implementation was that triggering AVX-512 would downclock the entirety of the CPU. It didn’t do anything smart, it was fairly binary.
This AMD implementation instead seems to be better optimized and plug into the normal thermal operations of the CPU for better scaling.
Reading the section under "Load Another FP Pipe?" I'm coming away with the impression that it's not the LSU but rather total overall load that causes trouble. While that section is focused on transition time, the end steady state is also slower…
I haven’t read the article yet, but back when I tried to get to over 100 GB/s IO rate from a bunch of SSDs on Zen4 (just fio direct IO workload without doing anything with the data), I ended up disabling Core Boost states (or maybe something else in BIOS too), to give more thermal allowance for the IO hub on the chip. As RAM load/store traffic goes through the IO hub too, maybe that’s it?
I don't think these things are related, this is talking about the LSU right inside the core. I'd also expect oscillations if there were a thermal problem like you're describing, i.e. core clocks up when IO hub delivers data, IO hub stalls, causes core to stall as well, IO hub can run again delivering data, repeat from beginning.
(Then again, boost clocks are an intentional oscillation anyway…)
Ok, I just read through the article. As I understand, their tests were designed to run entirely on data on the local cores' cache? I only see L1d mentioned there.
Yes, that's my understanding of "Zen 5 also doubles L1D load bandwidth, and I’m exercising that by having each FMA instruction source an input from the data cache." Also, considering the author's other work, I'm pretty sure they can isolate load-store performance from cache performance from memory interface performance.
It seems even more interesting than the power envelope. It looks like the core is limited by the ability of the power supply to ramp up. So the dispatch rate drops momentarily and then goes back up to allow power delivery to catch up.
I find it irritating that they are comparing clock scaling to the venerable Skylake-X. Surely Sapphire Rapids has been out for almost 2 years by now.
Seemed appropriate to me as comparing the "first core to use full-width AVX-512 datapaths"; my interpretation is that AMD threw more R&D into this than Intel before shipping it to customers…
(It's also not really a comparative article at all? Skylake-X is mostly just introduction…)
> my interpretation is that AMD threw more R&D into this than Intel before shipping it to customers
AMD had the benefit of learning from Intel's mistakes in their first generation of AVX-512 chips. It seemed unfair to compare an Intel chip that's so old (albeit long-lasting due to Intel's scaling problems). Skylake-X chips were released in 2017! [1]
[1] https://en.wikipedia.org/wiki/Skylake_(microarchitecture)#Hi...
sure, but AMD's decision to start with a narrower datapath happened without insight from Intel's mistakes and could very well have backfired (if Intel had managed to produce a better-working implementation faster, that could've cost AMD a lot of market share). Intel had the benefit of designing the instructions along with the implementation as well, and also the choice of starting on a 2x256 datapath...
And again, yeah it's not great were it a comparison, but it really just doesn't read as a comparison to me at all. It's a reference.
AMD did not start with a narrower datapath, even if this is a widespread myth. It only had a narrower path between the inner CPU core and the L1 data cache memory.
The most recent Intel and AMD CPU cores (Lion Cove and Zen 5) have identical vector datapath widths, but for many years, for 256-bit AVX Intel had a narrower datapath than AMD, 768-bit for Intel (3 x 256-bit) vs. 1024-bit for AMD (4 x 256-bit).
Only when executing 512-bit AVX-512 instructions, the vector datapath of Intel was extended to 1024-bit (2 x 512-bit), matching the datapath used by AMD for any vector instructions.
There were only 2 advantages of Intel AVX-512 vs. AMD executing AVX or the initial AVX-512 implementation of Zen 4.
The first was that some Intel CPU models, but only the more expensive SKUs, i.e. most of the Gold and all of the Platinum, had 2 x 512-bit FMA units, while the cheap Intel CPUs and AMD Zen 4 had only one 512-bit FMA unit (but AMD Zen 4 still had 2 x 512-bit FADD units). Therefore Intel could do 2 FMUL or FMA per clock cycle, while Zen 4 could do only 1 FMUL or FMA (+ 1 FADD).
The second was that Intel had a double width link to the L1 cache, so it could do 2 x 512-bit loads + 1 x 512-bit stores per clock cycle, while Zen 4 could do only 1 x 512-bit loads per cycle + 1 x 512-bit stores every other cycle. (In a balanced CPU core design the throughput for vector FMA and for vector loads from the L1 cache must be the same, which is true for both old and new Intel and AMD CPU cores.)
With the exception of vector load/store and FMUL/FMA, Zen 4 had the same or better AVX-512 throughput for most instructions, in comparison with Intel Sapphire Rapids/Emerald Rapids. There were a few instructions with a poor implementation on Zen 4 and a few instructions with a poor implementation on Intel, where either Intel or AMD were significantly better than the other.
> AMD did not start with a narrower datapath, even if this is a widespread myth. It only had a narrower path between the inner CPU core and the L1 data cache memory.
https://www.mersenneforum.org/node/21615#post614191
"Thus as many of us predicted, 512-bit instructions are split into 2 x 256-bit of the same instruction. And 512-bit is always half-throughput of their 256-bit versions."
Is that wrong?
There's a lot of it being described as "double pumped" going around…
(tbh I couldn't care less about how wide the interface buses are, as long as they can deliver in sum total a reasonable bandwidth at a reasonable latency… especially on the further out cache hierarchies the latency overshadows the width so much it doesn't matter if it comes down to 1×512 or 2×256. The question at hand here is the total width of the ALUs and effective IPC.)
Sorry, but you did not read carefully that good article and you did not read the AMD documentation and the Intel documentation.
The AMD Zen cores had for several generations, until Zen 4, 4 (four) vector execution units with a width of 256 bits, i.e. a total datapath width of 1024 bits.
On an 1024-bit datapath, you can execute either four 256-bit instructions per clock cycle or two 512-bit instructions per clock cycle.
While the number of instructions executed per cycle varies, the data processing throughput is the same, 1024 bits per clock cycle, as determined by the datapath width.
The use of the word "double-pumped" by the AMD CEO has been a very unfortunate choice, because it has been completely misunderstood by most people, who have never read the AMD technical documentation and who have never tested the behavior of the micro-architecture of the Zen CPU cores.
On Zen 4, the advantage of using AVX-512 is not from a different throughput, but it is caused by a better instruction set and by the avoiding of bottlenecks in the CPU core front-end, at instruction fetching, decoding, renaming and dispatching.
On the Intel P cores before Lion Cove, the datapath for 256-bit instructions had a width of 768 bits, as they had three 256-bit execution units. For most 256-bit instructions the throughput was of 768 bits per clock cycle. However the three execution units were not identical, so some 256-bit instructions had only a throughput of 512 bits per cycle.
When the older Intel P cores executed 512-bit instructions, the instructions with a 512 bit/cycle throughput remained at that throughput, but most of the instructions with a 768 bit/cycle throughput had their throughput increased to 1024 bit/cycle, matching the AMD throughput, by using an additional 256-bit datapath section that stayed unused when executing 256-bit or narrower instructions.
While what is said above applies to most vector instructions, floating-point multiplication and FMA have different rules, because their throughput is not determined by the width of the datapath, but it may be smaller, being determined by the number of available floating-point multipliers.
Cheap Intel CPUs and AMD Zen 2/Zen 3/Zen 4 had FP multipliers with a total throughput of 512 bits of results per clock cycle, while the expensive Xeon Gold and Platinum had FP multipliers with a total throughput of 1024 bits of results per clock cycle.
The "double-pumped" term is applicable only to FP multiplication, where Zen 4 and cheap Intel CPUs require a double number of clock cycles to produce the same results as expensive Intel CPUs. It may be also applied, even if that is even less appropriate, to vector load and store, where the path to the L1 data cache was narrower in Zen 4 than in Intel CPUs.
The "double-pumped" term is not applicable to the very large number of other AVX-512 instructions, whose throughput is determined by the width of the vector datapath, not by the width of the FP multipliers or by the L1 data cache connection.
Zen 5 doubles the vector datapath width to 2048 bits, so many 512-bit AVX-512 instructions have a 2048 bit/cycle throughput, except FMUL/FMA, which have an 1024 bit/cycle throughput, determined by the width of the FP multipliers. (Because there are only 4 execution units, 256-bit instructions cannot use the full datapath.)
Intel Diamond Rapids, expected by the end of 2026, is likely to have the same vector throughput as Zen 5. Until then, the Lion Cove cores from consumer CPUs, like Arrow Lake S, Arrow Lake H and Lunar Lake, are crippled, by having a half-width datapath of 1024 bits, which cannot compete with a Zen 5 that executes AVX-512 instructions.
> Sorry, but you did not read carefully that good article and you did not read the AMD documentation and the Intel documentation.
I think you are the one who hasn't read documentation or tested the behavior of Zen cores. Read literally any AMD material about Zen4: it mentions that the AVX512 implementation is done over two cycles because there are 256-bit datapaths.
On page 34 of the Zen4 Software Optimization Guide[^1], it literally says:
> Because the data paths are 256 bits wide, the scheduler uses two consecutive cycles to issue a 512-bit operation.
[^1]: https://www.amd.com/content/dam/amd/en/documents/processor-t...
It is not certain if what AMD writes there is true, because it is almost impossible to determine by testing whether the 2 halves of a 512-bit instruction are executed sequentially in 2 clock cycles of the same execution unit or they are executed in the same clock cycle in 2 execution units.
Some people have attempted to test this claim of AMD by measuring instruction latencies. The results have not been clear, but they tended to support that this AMD claim is false.
Regardless whether this AMD claim is true or false, this does not change anything for the end user.
For any relevant 512-bit instruction, there are 2 or 4 available execution units. The 512-bit instructions are split into 2 x 256-bit micro-operations, and then either 4 or 2 such micro-operations are issued simultaneously, corresponding to the total datapath width of 1024 bits, or to the partial datapath width available for a few instructions, e.g. FMUL and FMA, resulting in a throughput of 1024 bits of results per clock cycle for most instructions (512 bits for FMA/FMUL), the same as for any Intel CPU supporting AVX-512 (with the exception of FMA/FMUL, where the throughput matches only the cheaper Xeon SKUs).
The throughput would be the same, i.e. of 1024 bits per cycle, regardless if what AMD said is true, i.e. that when executing 8 x 256-bit micro-operations in 2 clock cycles, the pair of micro-operations executed in the same execution unit comes from a single instruction, or if the claim is false and the pair of micro-operations executed in the same execution unit comes from 2 distinct instructions.
The throughput depends only on the total datapath width of 1024 bits and it does not depend on the details of the order in which the micro-operations are issued to the execution units.
The fact that one execution unit has a data path of 256 bits is irrelevant for the throughput of a CPU. Only the total datapath width matter.
For instance, an ARM Cortex-X4 CPU core has the datapath width for a single execution unit of only 128 bits. That does not mean that it is slower than a consumer Intel CPU core that supports only AVX, which has a datapath width for a single execution unit of 256 bits.
In fact both CPU cores have the same vector FMA throughput, because they have the same total datapath width for FMA instructions of 512 bits, i.e. 4 x 128 bits for Cortex-X4 and 2 x 256 bits for a consumer Intel P-core, e.g. Raptor Cove.
It is not enough to read the documentation if you do not think about what you read, to assess whether it is correct or not.
Technical documentation is not usually written by the engineers that have designed the device, so it frequently contains errors when the technical writer has not understood what the designers have said, or the writer has attempted to synthesize or simplify the information, but that has resulted in a changed meaning.
It doesn't really matter if the two "halves" are issued in sequence or in parallel¹; either way they use 2 "slots" of execution which are therefore not available for other use — whether that other use be parallel issue, OOE or HT². To my knowledge, AVX512 code tends to be "concentrated", there's generally not a lot of non-AVX512 code mixed in that would lead to a more even spread on resources. If that were the case, the 2-slot approach would be less visible, but that's not really in the nature of SIMD code paths.
But at the same time, 8×256bit units would be better than 4×512, as the former would allow more thruput with non-AVX512 code. But that costs other resources (and would probably also limit achievable clocks since increasing complexity generally strains timing…) 3 or 4 units seems to be what Intel & AMD engineers decided to be best in tradeoffs. But all the more notable then that Zen4→Zen5 is not only a 256→512 width change but also a 3→4 unit increase³, even if the added unit is "only" a FADD one.
(I guess this is what you've been trying to argue all along. It hasn't been very clear. I'm not sure why you brought up load/store widths to begin with, and arguing "AMD didn't have a narrower datapath" isn't quite productive when the point seems to be "Intel had the same narrower datapath"?)
¹ the latency difference should be minor in context of existing pipeline depth, but of course a latency difference exists. As you note it seems not very easy to measure.
² HT is probably the least important there, though I'd also assume there are quite a few AVX512 workloads that can in fact load all cores and threads of a CPU.
³ wikipedia claims this, I'm finding conflicting information on how many pipes Zen4 had (3 or 4). [Ed.: this might be an error on wikipedia. ref. https://www.phoronix.com/image-viewer.php?id=amd-zen-5-core&... ]
Isn't it misleading to just add up the output width of all SIMD ALU pipelines and call the sum "datapath width", because you can't freely mix and match when the available ALUs pipelines determine what operations you can compute at full width?
You are right that in most CPUs the 3 or 4 vector execution units are not completely identical.
Therefore some operations may use the entire datapath width, while others may use only a fraction, e.g. only a half or only two thirds or only three quarters.
However you cannot really discuss these details without listing all such instructions, i.e. reproducing the tables from the Intel or AMD optimization guides of from Agner Fog's optimization documents.
For the purpose of this discussion thread, these details are not really relevant, because for Intel and AMD the classification of the instructions is mostly the same, i.e. the cheap instructions, like addition operations, can be executed in all execution units, using the entire datapath width, while certain more expensive operations, like multiplication/division/square root/shuffle may be done only in a subset of the execution units, so they can use only a fraction of the datapath width (but when possible they will be coupled with simple instructions using the remainder of the datapath, maintaining a total throughput equal with the datapath width).
Because most instructions are classified by cost in the same way by AMD and Intel, the throughput ratio between AMD and Intel is typically the same both for instructions using the full datapath width and for those using only a fraction.
Like I have said, with very few exceptions (including FMUL/FMA/LD/ST), the throughput for 512-bit instructions has been the same for Zen 4 and the Intel CPUs with AVX-512 support, as determined by the common 1024-bit datapath width, including for the instructions that could use only a half-width 512-bit datapath.
Wouldn't it be 1536-bit for 2 256-bit FMA/cycle, with FMA taking 3 inputs? (applies equally to both so doesn't change anything materially; And even goes back to Haswell, which too is capable of 2 256-bit FMA/cycle)
That is why I have written the throughput "for results", to clarify the meaning (the throughput for output results is determined by the number of execution units; it does not depend on the number of input operands).
The vector register file has a number of read and write ports, e.g. 10 x 512-bit read ports for recent AMD CPUs (i.e. 10 ports can provide the input operands for 2 x FMA + 2 FADD, when no store instructions are done simultaneously).
So a detailed explanation of the "datapath widths", would have to take into account the number of read and write ports, because some combinations of instructions cannot be executed simultaneously, even when there are available execution units, because the paths between the register file and the execution units are occupied.
Even more complicately, some combinations of instructions that would be prohibited by not having enough register read and write ports, can actually be done simultaneously because there are bypass paths between the execution units that allow the sharing of some input operands or the direct use of output operands as input operands, without passing through the register file.
The structure of the Intel vector execution units, with 3 x 256-bit execution units, 2 of which can do FMA, goes indeed back to Haswell, as you say.
The Lion Cove core launched in 2024 is the first Intel core that uses the enhanced structure used by AMD Zen for many years, with 4 execution units, where 2 can do FMA/FMUL, but all 4 can do FADD.
Starting with the Skylake Server CPUs, the Intel CPUs with AVX-512 support retain the Haswell structure when executing 256-bit or narrower instructions, but when executing 512-bit instructions, 2 x 256-bit execution units are paired to make a 512-bit execution unit, while the third 256-bit execution unit is paired with an otherwise unused 256-bit execution unit to make a second 512-bit execution unit.
Of these 2 x 512-bit execution units, only one can do FMA. Certain Intel SKUs add a second 512-bit FMA unit, so in those both 512-bit execution units can do FMA (this fact is mentioned where applicable in the CPU descriptions from the Intel Ark site).
So the 1024-bit number is the number of vector output bits per cycle, i.e. 2×FMA+2×FADD = (2+2)×256-bit? Is the term "datapath width" used for that anywhere else? (I guess you've prefixed that with "total " in some places, which makes much more sense)
"Datapath width" is somewhat ambiguous.
For most operations, an ALU has a width in 1-bit subunits, e.g. adders, and the same number as the number of subunits is the width in bit lines of the output path and of each of the 2 input paths that are used for most input operands. Some operations use only one input path, while others, like FMA or bit select may need 3 input paths.
The width of the datapath is normally taken to be the number of 1-bit subunits of the execution units, which is equal to the width in bit lines of the output path.
Depending on the implemented instruction set, the number of input paths having the same width as the output path may vary, e.g. either 2 or 3. In reality this is even more complicated, e.g. for 4 execution units you may have 10 input paths, whose connections can be changed dynamically, so they may be provide 3 input paths for some execution units and 2 input paths for other execution units, depending on what micro-operations happen to be executed there during a clock cycle. Moreover there may be many additional bypass operand paths.
Therefore, if you say that the datapath width for a single execution unit is of 256 bits, because it has 256 x 1-bit ALU subunits and 256 bit lines for output, that does not determine completely the complexity of the execution unit, because it may have a total input path width with values varying e.g. between 512 bit lines to 1024 bit lines or even more (which are selected with multiplexers).
The datapath width for a single execution units matters very little for the performance of a CPU or GPU. What matters is the total datapath width, summed over all available execution units, which is what determines the CPU throughput when executing some program.
For AVX programs, starting with Zen 2 the AMD CPUs had a total datapath width of 1024 bits vs. 768 bits for Intel, which is why they were beating easily the Intel CPUs in AVX benchmarks.
For 512-bit AVX-512 instructions, Zen 4 and the Intel Xeon CPUs with P-cores have the same total datapath width for instructions other than FMUL/FMA/LD/ST, which has resulted in the same throughput per clock cycle for the programs that do not depend heavily on floating-point multiplications. Because Zen 4 had higher clock frequencies in power-limited conditions, Zen 4 has typically beaten the Xeons in AVX-512 benchmarks, with the exception of the programs that can use the AMX instruction set, which is not implemented yet by AMD.
The "double-pumped" term used about Zen 4 has created a lot of confusion, because it does not refer to the datapath width, but only to the number of available floating-point multipliers, which is half of that of the top models of Intel Xeons, so any FP multiplications must require a double number of clock cycles on Zen 4.
The term "double-pumped" is actually true for many models of AMD Radeon GPUs, where e.g. a 2048-bit instruction (64 wavefront size) is executed in 2 clock cycles as 2 x 1024-bit micro-operations (32 wavefront size).
On Zen 4, it is not at all certain that this is how the 512-bit instructions are executed, because unlike on Radeon, on Zen 4 there are 2 parallel execution units that can execute the instruction halves simultaneously, which results in the same throughput as when the execution is "double-pumped" in a single execution unit.
oops, haswell has only 3 SIMD ALUs, i.e. 768 bits of output per cycle, not 1024.
AMD also has a long history of using half-width, double-pumped SIMD implementations. It worked each time. With Zen 4 the surprise wasn't that it worked at all, but how well it worked and how little the core grew in any relevant metric to support it: size, static and dynamic power.
It makes me wonder why Intel didn't pair two efficiency cores with half width SIMD data paths per core sharing a single full width permute unit between them.
I think it's mostly the lack of comparable research other than the Skylake-X one by Travis Downs. I too would like to see how Zen 4 behaves in the situation with its double-pumping.
True, but who would bother to pay a lot of money for a CPU that is known to be inferior to alternatives, only to be able to test the details of its performance.
A well funded investigative tech journalist?
With prices in the range $2,000 - $20,000 for the CPU, plus a couple grand for at least MB, cooler, memory and PSU, the journalist must be very well funded to spend that much for publishing one article analyzing the CPU.
I would like to read such an article or to be able to test myself such a CPU, but my curiosity is not so great as to make me spend such money.
For now, the best one can do is to examine the results of general-purpose benchmarks published on sites like:
https://www.servethehome.com/
https://www.phoronix.com/
where Intel, AMD or Ampere send some review systems with Xeon, Epyc or Arm-based server CPUs.
These sites are useful, but a more thorough micro-architectural investigation would have been nice.
AWS, dedicated EC2 instance. A few dollars an hour.
Depending on which CPU category. I think Intel HEDT stops at Cascade Lake, which is essentially Skylake-X Refresh from 2019?
Whereas AMD has full-fat AVX512 even in gaming laptop CPUs.
Cascade Lake improved the situation a bit, but then you had Ice Lake where iirc the hard cutoffs were gone and you were just looking at regular power and thermal steering. IIRC, that was the first generation where we enabled AVX512 for all workloads.
I don't understand why 2x FMAs in CPU design poses such a challenge when GPUs literally have hundreds of such ALUs? Both operate at similar TDP so where's the catch? Much lower GPU clock frequency?
Zen 5 still clocks way higher than GPUs even with the penalties. Additionally, CPUs typically target much lower latency for operations even per-clock, which adds a ton of silicon cost for the same throughput, and especially so at high clock frequency.
The difficulty with transitions that Skylake-X suffered especially from just has no equivalent on GPU; if you always stay in the transitioned-to-AVX512 state on Skylake-X, things are largely normal; GPUs just are always unconditionally in such a state, but that be awful on CPUs, as it'd make scalar-only code (not a thing on GPUs, but the main target for CPUs) unnecessarily slow. And so Intel decided that the transitions are worth the improved clocks for code not utilizing AVX-512.
It's not the execution of FMAs that's the challenge, it's the ramp up / down.
And I assure you GPUs do have challenges with that as well. That's just less well known because (1) in GPUs, all workloads are vector workloads, and so there was never a stark contrast between scalar and vector regimes like in Intel's AVX-512 implementation and (2) GPU performance characteristics are in general less broadly known.
Yes, I agree that it was premature to say that GPUs aren't suffering from the same symptoms. There's just not enough evidence but the differences in the compute power are still large.
If you normalize for cost, the difference isn't nearly as large as you might think, around 4-5x.
It's not 2 FMAs, it's AVX-512 (and going with 32-bit words) ⇒ 2*512/32 = 32 FMAs per core, 256 on an 8-core CPU. The unit counts for GPUs - depending on which number you look at - count these separately.
CPUs also have much more complicated program flow control, versatility, and AFAIK latency (⇒ flow control cost) of individual instructions. GPUs are optimized for raw calculation throughput meanwhile.
Also note that modern GPUs and CPUs don't have a clear pricing relationship anymore, e.g. a desktop CPU is much cheaper than a high-end GPU, and large server CPUs are more expensive than either.
1x 512-bit FMA or 2x 256-bit FMAs or 4x 128-bit FMAs is irrelevant here - it's still a single physical unit in a CPU that consumes 512 bits of data bandwidth. The question is why the CPU budget allows for 2x 512-bit or 4x 256-bit while H100, for example, has 14592 FP32 CUDA cores - in AVX terminology that would translate, if I am not mistaken, to 7926x 512-bit or 14592x 256-bit FMAs per clock cycle. Even considering the obvious differences between GPUs and CPUs, this is still a large difference. Since GPU cores operate at much lower frequencies than CPU cores, it is what it made me believe where the biggest difference comes from.
AIUI an FP32 core is only 32 bits wide, but this is outside my area of expertise really. Also note that CPUs also have additional ALUs that can't do FMAs, FMA is just the most capable one.
You're also repeating 2×512 / 4×256 — that's per core, you need to multiply by CPU core count.
[also, note e.g. an 8-core CPU is much cheaper than a H100 card ;) — if anything you'd be comparing the highest end server CPUs here. An 192-core Zen5c is 8.2~10.5k€ open retail, an H100 is 32~35k€…]
[reading through some random docs, a CPU core seems vaguely comparable to a SM; a SM might have 128 or 64 lanes (=FP32 cores) while a CPU only has 16 with AVX-512, but there is indeed also a notable clock difference and far more flexibility otherwise in the CPU core (which consumes silicon area)]
Nvidia calls them cores to deliberately confuse people, and make it appear vastly more powerful than it really is. What they are in reality is SIMD lanes.
So the H100(which costs vastly more than a Zen5..), has 14592 32 bit SIMD lanes, not cores.
A Zen 5 has 16x4(64) 32 bit SIMD lanes per core, so scale that by core count to get your answer. A higher end desktop Zen5 will have 16 cores, so 64x16 = 1024. The Zen5 also clocks much higher than the GPU, so you can also scale it up by perhaps 1.5-2x
While this is obviously less than the H100, the Zen5 chip costs $550 and the H100 cost $40k.
There is more to it than this, GPUs also have transcendental functions, texture sampling, and 16 bit ops(which are lacking in CPUs). While CPUs are much more flexible, and have powerful byte & integer manipulation instructions, along with full speed 64 bit integer/double support.
Thanks for the clarification on the NVidia, I didn't know that. What I also found is that NVidia groups 32 SIMD lanes into what they call a warp. Then 4 warps are grouped into what they're calling a streaming multiprocessor (SM). And lastly H100 has 114 SMs so 432114=14592 checks out.
> Zen 5 has 16x4(64) 32 bit SIMD lanes per core
Right, 2x FMA and 2x FADD so the highest-end Zen 5 die with 192 cores would total to 12288 32-bit SIMD lanes or half of that if we are considering only FMA ops. This is then indeed much closer to 14592 32-bit SIMD lanes of H100.
There are x86 extensions for fp16/bf16 ops. e.g. both Zen 4 and Zen 5 support AVX512_BF16, which has vdpbf16ps, i.e. dot product of pairs of bf16 elements from two args; that is, takes a total of 64 bf16 elts and outputs 16 fp32 elts. Zen 5 can run two such instrs per cycle.
Like another poster already said, the power budget of a consumer CPU, like 9950X, executing programs at a double clock frequency in comparison with a GPU, allows for 16 cores x 2 execution units x 16 = 512 FP32 FMA per clock cycle, which provides the same throughput like an 1024 FP32 FMA per clock cycle iGPU from the best laptop CPUs, while consuming 3 times less power than a datacenter GPU, so the power budget and performance is like for a datacenter GPU with 3072 FP32 FMA per clock cycle.
However, because of its high clock frequency a consumer CPU has high performance per dollar, but low performance per watt.
Server CPUs with many cores have much better energy efficiency, e.g. around 3 times higher than a desktop CPU and the same with the most efficient laptop CPUs. For many generations of NVIDIA GPUs and Intel Xeon CPUs, until about 5-6 years ago, the ratio between their floating-point FMA throughput per watt has been of only 3.
This factor of 3 is mainly due to the overhead of various tricks used by CPUs to extract instruction-level parallelism from programs that do not use enough concurrent threads or array operations, e.g. superscalar out-of-order execution, register renaming, etc.
In recent years, starting with NVIDIA Volta, followed later by AMD and Intel GPUs, the GPUs have made a jump in performance that has increased the gap between their throughput and that of CPUs, by supplementing the vector instructions with matrix instructions, i.e. what NVIDIA calls tensor instructions.
However this current greater gap in performance between CPUs and GPUs could easily be removed and the performance per watt ratio could be brought back to a factor unlikely to be greater than 3, by adding matrix instructions to the CPUs.
Intel has introduced the AMX instruction set, besides AVX, but for now it is supported only in expensive server CPUs and Intel has defined only instructions for low-precision operations used for AI/ML. If AMX were extended with FP32 and FP64 operations, then the performance would be much more competitive with GPUs.
ARM is more advanced in this direction, with SME (Scalable Matrix Extension) defined besides SVE (Scalable Vector Extension). SME is already available in recent Apple CPUs and it is expected to be also available in the new Arm cores that will be announced in a few months for now, which should become available in the smartphones of 2026, and presumably also in future Arm-based CPUs for servers and laptops.
The current Apple CPUs do not have strong SME accelerators, because they also have an iGPU that can perform the operations whose latency is less important.
On the other hand, an Arm-based server CPU could have a much bigger SME accelerator, providing a performance much closer to a GPU.
I appreciate the response with a lot of interesting details, however, I don't believe it answers the question I had? My doubt was why is it so that the CPU design suffers from clock frequency issues in AVX-512 workloads whereas GPUs which have much more compute power do not.
I assumed that it was due to the fact that GPUs run at much lower clock frequencies and therefore available power budget but as I also discussed with another commenter above this was probably a premature conclusion since we don't have enough evidence showing that GPUs indeed do not suffer from same type of issues. They likely do but nobody measured it yet?
The low clock frequency when executing AVX-512 workloads is a frequency where the CPU operates efficiently, with a low energy consumption per operation executed.
For such a workload that executes a very large number of operations per second, the CPU cannot afford to operate inefficiently because it will overheat.
When a CPU core has many execution units that are idle, so they do not consume power, like when executing only scalar operations or only operations with narrow 128-bit vectors, it can afford to raise the clock frequency e.g. by 50%, even if that would increase the energy consumption per operation e.g. 3 times. By executing 4 times or 8 times less operations per clock cycle, even if the energy consumption is 3 times higher the total power consumption is smaller and the CPU does not overheat and the desktop owner does not care that the completion of the same workload requires much more energy, because it is likely that the owner cares more about the time to completion.
The clock frequency of a GPU also varies continuously depending on the workload, in order to maintain the power consumption within the limits. However a GPU is not designed to be able to increase the clock frequency as much as a CPU. The fastest GPUs have clock frequencies under 3 GHz, while the fastest CPUs exceed 6 GHz.
The reason is that normally one never launches a GPU program that would use only a small fraction of the resources of a GPU allowing a higher clock frequency, so it makes no sense to design a GPU for this use case.
Designing a chip for a higher clock frequency greatly increases the size of the chip, as shown by the comparison between a normal Zen core designed for 5.7 GHz and a Zen compact core, designed e.g. for 3.3 GHz, a frequency not much higher than that of a GPU.
On Zen compact cores and on normal Zen cores configured for server CPUs with a large number of cores, e.g. 128 cores (with a total of 4096 FP32 ALUs, like a low-to-mid-range desktop GPU, or like a top desktop GPU of 5 years ago; a Zen compact server CPU can have 6144 FP32 ALUs, more than a RTX 4070), the clock frequency variation range is small, very similar to the clock variation range of a GPU.
In conclusion, it is not the desktop/laptop CPUs which drop their clock frequency, but it is the GPUs which never raise their clock frequency much, the same as the server CPUs, because neither GPUs nor server CPUs are normally running programs that keep most of their execution units idle, to allow higher clock frequencies without overheating.
It's gpu frequency 5.5 GHZ is ~4x the heat and power as the 2.5 GHZ for GPUs.
I'm curious how changing some OC parameters would affect those results if that it caused by voltage drop, how load line calibration affects it? is that's power constraint then how PBO would affect it?
In practice everyone turns off AVX512 because they're afraid of the frequency throttling.
The damage was made by Skylake-X and won't be healed for years.
Everyone who? Surely anyone interested would do the research after buying a shiny new CPU.
Do you deploy your software on a single CPU?
I wonder if this will be improved or fix in Zen 6. Although personally I much rather they focus on IPC.
Nothing to fix here. The behavior in the transition regimes is already quite good.
The overall throttling is dynamic and reactive based on heat and power draw - this is unavoidable and in fact desirable (the alternative is to simply run slower all the time, not to somehow be immune to physics and run faster all the time)
They should put forward the fact that 512bits is the "sweet spot" as it is a data cache-line!