Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000
by Dr. Ian Cutress on December 3, 2020 10:00 AM EST- Posted in
- CPUs
- AMD
- Zen 3
- X570
- Ryzen 5000
- Ryzen 9 5950X
- SMT
- Multi-Threading
Conclusions: SMT On
I wasn’t too sure what we were going to see when I started this testing. I know the theory behind implementing SMT, and what it means for the instruction streams having access to core resources, and how cores that have SMT in mind from the start are built differently to cores that are just one thread per core. But theory only gets you so far. Aside from all the forum messages over the years talking about performance gains/losses when a product has SMT enabled, and the few demonstrations of server processors running focused workloads with SMT disabled, it is actually worth testing on real workloads to find if there is a difference at all.
Results Overview
In our testing, we covered three areas: Single Thread, Multi-Thread, and Gaming Performance.
In single threaded workloads, where each thread has access to all of the resources in a single core, we saw no change in performance when SMT is enabled – all of our workloads were within 1% either side.
In multi-threaded workloads, we saw an average uplift in performance of +22% when SMT was enabled. Most of our tests scored a +5% to a +35% gain in performance. A couple of workloads scored worse, mostly due to resource contention having so many threads in play – the limit here is memory bandwidth per thread. One workload scored +60%, a computational workload with little-to-no memory requirements; this workload scored even better in AVX2 mode, showing that there is still some bottleneck that gets alleviated with fewer instructions.
On gaming, overall there was no difference between SMT On and SMT Off, however some games may show differences in CPU limited scenarios. Deus Ex was down almost 10% when CPU limited, however Borderlands 3 was up almost 10%. As we moved to a more GPU limited scenario, those discrepancies were neutralized, with a few games still gaining single-digit percentage points improvement with SMT enabled.
For power and performance, we tested two examples where performance at two threads per core was either saw no improvement (Agisoft), or significant improvement (3DPMavx). In both cases, SMT Off mode (1 thread/core) ran at higher temperatures and higher frequencies. For the benchmark per performance was about equal, the power consumed was a couple of percentage points lower when running one thread per core. For the benchmark were running two threads per core has a big performance increase, the power in that mode was also lower, and there was a significant +91% performance per watt improvement by enabling SMT.
What Does This Mean?
I mentioned at the beginning of the article that SMT performance gains can be seen from two different viewpoints.
The first is that if SMT enables more performance, then it’s an easy switch to use, and some users consider that if you can get perfect scaling, then if SMT is an effective design.
The second is that if SMT enables too much performance, then it’s indicative of a bad core design. If you can get perfect scaling with SMT2, then perhaps something is wrong about the design of the core and the bottleneck is quite bad.
Having poor SMT scaling doesn’t always mean that the SMT is badly implemented – it can also imply that the core design is very good. If an effective SMT design can be interpreted as a poor core design, then it’s quite easy to see that vendors can’t have it both ways. Every core design has deficiencies (that much is true), and both Intel and AMD will tell its users that SMT enables the system to pick up extra bits of performance where workloads can take advantage of it, and for real-world use cases, there are very few downsides.
We’ve known for many years that having two threads per core is not the same as having two cores – in a worst case scenario, there is some performance regression as more threads try and fight for cache space, but those use cases seem to be highly specialized for HPC and Supercomputer-like tasks. SMT in the real world fills in the gaps where gaps are available, and this occurs mostly in heavily multi-threaded applications with no cache contention. In the best case, SMT offers a sizeable performance per watt increase. But on average, there are small (+22% on MT) gains to be had, and gaming performance isn’t disturbed, so it is worth keeping enabled on Zen 3.
126 Comments
View All Comments
MrSpadge - Thursday, December 3, 2020 - link
> We’ve known for many years that having two threads per core is not the same as having two coresTrue, and I still read this as an argument against SMT in forums. IMO it should be pointed out clearly that the cost of implementing either also differs drastically: +100% core size for another core and ~5% for SMT.
WaltC - Thursday, December 3, 2020 - link
Intel began its HT journey in order to pull more efficiency from each core--basically, as performance was being left on the table. Interestingly enough, after Athlon and A64, AMD roundly criticized Intel because the SMT thread was not done by a "real core"...and then proceeded to drop cores with two integer units--which AMD then labeled as "cores"...;) Intel's HT approach proved superior, obviously. IIRC. It's been awhile so the memories are vague...;) The only problem with this article is that it tries to make calls about SMT hardware design without really looking hard at the software, and the case for SMT is a case for SMT software. Games will not use more than 4-8 threads simultaneously so of course there is little difference between SMT on and off when running most games on a 5950. You would likely see near the same results on a 5600 in terms of gaming. SMT on or off when running these games leaves most of the CPU's resources untouched. Programs designed and written to utilize a lot of threads, however, show a robust, healthy scaling with SMT on versus no SMT. So--without a doubt--SMT CPU design is superior to no SMT from the standpoint of the hardware's performance. The outlier is the software--not the hardware. And of course the hardware should never, ever be judged strictly by the software one arbitrarily decides to run on it. We learn a lot more about the limits of the software tested here than we learn about SMT--which is a solid performance design in CPU hardware.WarlockOfOz - Friday, December 4, 2020 - link
Very valid point about how games won't see a difference between 16 and 32 threads when they only use 6. Do you know if this type of analysis has been done at the lower end of the market?WaltC - Friday, December 4, 2020 - link
It's been common knowledge established a few years ago when AMD started pushing 8 core (and greater) CPUs that games don't require that many cores and that 6 cores is optimal for gaming right now. And if you do more than game, occasionally, and need more than 6 threads then SMT is there for you. As the new consoles are 8-core CPU designs, over time the number of cores required for optimal game performance will increase.Flying Aardvark - Friday, December 4, 2020 - link
Consoles are 8-core now, with 2 reserved for the OS. Count on 6-cores being optimal for gaming for quite some time.Kangal - Friday, December 4, 2020 - link
Thoset Jaguar cores was more like a 4c/8t processor to be fair. And they weren't that much better than Intel's Atom cores, a far cry from Intel's Core-i SkyLake architecture. And current gen consoles were very light on the OS, so maybe using 1-full core (or 2-threads-shared) leaving only 3-cores for games, but much better than the 2-core optimised games from the PS3/360 era.The new gen consoles will be somewhat similar, using only 1-full core (2-threads) reserved for the OS. But this time we have an architecture that's on-par with Intel's Core-i SkyLake, with a modern full 8-core processor (SMT/HT optional). This time leaving a healthy 7-cores that's dedicated to games. Optimisations should come sooner than later, and we'll see the effects on PC ports by 2022. So we should see a widening gap between 4vs6-core, and to a lesser extent 6vs8-core in the future. I wouldn't future-proof my rig by going for a 5700x instead of a 5600x, I would do that for the next round (ie 2022 Zen4).
AntonErtl - Sunday, December 6, 2020 - link
The 8 Jaguar cores are in no way like 4c/8t CPUs; if you use only half of them, you get half the performance (unless your application is memory/L2-bandwidth-limited). Their predecessor Bobcat is about twice as fast as an Bonnell core (Atom proper), and a little slower than Silvermont (the core that replaced Bonnell), about half as fast as Goldmont+ (all at the clock rates at which they were available in fanless mini-ITX boards), one third as fast as a 3.5GHz Excavator core, and one sixth as fast as a 4.2GHz Skylake.Oxford Guy - Sunday, December 6, 2020 - link
Worse IPC than Bulldozer as far as I know. Certainly worse than Piledriver.Really sad. The "consoles" should have used something better than Jaguar. It's bad enough that the "consoles" are a parasitic drain on PC gaming in the first place. It's worse when they not only drain life with their superfluous walled gardens but also by foisting such a low-grade CPU onto the art.
Kangal - Thursday, December 24, 2020 - link
The Jaguar cores share alot of DNA with Bulldozer, but they aren't the same. It's like Intel's Atom chips compared to Intel Core-i chips. With that said, 2015 Puma+ was a slight improvement over 2013 Jaguar, which was a modest improvement over the initial 2011 Bobcat lineup. All this started in 2006 with AMD choosing to evolve their earlier Phenom2 cores which are derivatives of the AMD Athlon-64.So just by their history, we can see they're inline with Intel's Atom architecture evolution, and basically a direct competitor. Where Intel had slightly less performance, but had much lower power-draw... making them the obvious winner. Leaving AMD to fill in the budget segments of the market.
As for the core arrangement, they don't have full proper cores as people expect them. Like the Bulldozer architecture, each core had to share resources like the decoder and floating-point unit. So in many instances, one core would have to wait for the other core. This boosts multithreaded performance with simple calculations in orderly patterns. However, with more complex calculations and erratic/dynamic patterns (ie Regular PC use), it causes a hit to the single-thread performance and notable hiccups. So my statement was true. This is more like a 4c/8t chipset, and it is less like a Core-i and much more like an Atom. But don't take my word for it, take Dr Ian
Cutress. He said the same thing during the deep dive into the Jaguar microarchitecture, and recently in the Chuwi Aerobox (Xbox One S) article.
https://www.anandtech.com/show/16336/installing-wi...
Now, there have been huge benefits to the Gaming PC industry, and game ports, due to the PS4/XB1. The first being the x86-64bit direct compatibility. Second was the cross-compatability thanks to Vulkan and DirectX (moreso with PS4 Pro and XB1X). The third being that it forced game developers to innovate their game engines, so that they're less narrow and more multi-threaded. With PS5/XseX we now see a second huge push with this philosophy, and the improvements of fast single-thread performance and fast-flash storage access. So I think while we have legitimate reasons to groan about the architecture (especially in the PS4) upon release, we do have to recognize the conveniences that they also brought (especially in the XB1X). This is just to show that my stance wasn't about console bashing.
at_clucks - Monday, December 7, 2020 - link
@Kangal, Jaguar APUs in consoles are definitely not "like a 4c/8t processor" because they don't use CMT. They are full 8 cores. Their IPC may be comparable with some newer Atoms although it's hard to benchmark how the later "Evolved Jaguar" cores in the mid generation console refresh compares against the regular Jaguar or Atom.