2014-01-15



It was almost mythical, this idea AMD had – to build a processor that incorporated a GPU onto the same die and have them communicate over a shared bus, working seamlessly on projects that could have been done separately on other hardware, but were interchangeable on this specific set of hardware. That was AMD’s goal when they started out with the APU and before they could even get the first one out the door, Intel saw the future potential and leapfrogged them. APUs make a lot of sense in the low-end markets and even more sense in low-end workstations and computers that aren’t necessarily built to play games.

But the desire to make such a product was a long way off – AMD’s first attempt with Llano was good, but held back by a memory bus that limited official speeds to DDR3-1600. Trinity came along with the Bulldozer architecture and DDR3-1866 speeds and made the case more solid. Richland upped the clock speeds, powered down like a champ and opened up extra bandwidth due to the slightly tweaked memory controller. But none of these products and certainly nothing from Intel, has managed to hit the notes that AMD has been trying so hard for.

The question is, will Kaveri be your next budget processor and if so, why? I think this analysis might go a ways to help you figure that out.

Steamroller and Graphics Core Next



Kaveri is the first commercially available socketed APU to incorporate two architectures that have been designed with interoperability in mind – Steamroller, a derivative of the Bulldozer architecture and Graphics Core Next, a replacement to the VLIW family that characterised the Radeon HD6000 family and finally saw its end with the Richland family of APUs.

Note that Bulldozer’s initial design and evolution over the years was all a result of the improvements that AMD was making on their HSA-compatible designs, which includes the shift to modules instead of cores, the removal of hardware that could be utilised in existing GPUs instead and the lowering of the amount of floating-point performance, which was always going to be better on a GPU anyway. Instead of a brute force approach, APUs are designed to be balanced in terms of hardware, performance and power usage.

With Kaveri, compared to earlier APU designs, AMD is doing away with the convention of listing their APUs with traditional CPU cores and graphics units. Now, they refer to cores and the individual shader modules in an APU as “Compute Cores,” something they started doing when they began to talk about Graphics Core Next back in December 2012 (for comparison, Nvidia has been using “CUDA Cores” to denote their individual shader unit count since the Geforce G80 era).

The idea behind this is that because HSA-ready software can utilise all the available hardware to perform a specific task and can intelligently assign jobs based on hardware configuration, the means to distinguish between cores and the units inside the GPU section become blurred; their differences glossed over, if you will, in favour of making the chip easier to market to consumers who merely look at clock speeds and core counts. In this regard, AMD’s marketing department has made discussing and selling APUs and APU-based machines much easier for the salesperson.

What is this HSA you speak of?



Heterogeneous Software Acceleration (HSA) is a collection of technologies that AMD has crammed together into the same product, giving Kaveri a number of software tricks that it can pull off if the software environment supports it. The first of these is unlocking the entirety of the APU’s theoretical performance. If you can have the CPU and GPU working on the same workload, it could be completed much quicker and the portions of the code that runs better on a GPU can be assigned to it, speeding things up even more and making the setup a lot more efficient. This has enormous implications for professional applications where the GPU hardware, which is normally idle in 2D mode if the software uses CPU cores, can be used instead to speed up work that can be parallelised easily.

In this case, HSA is one of the first attacks the computing industry is performing on Amdahl’s law, which states that if a portion of a workload cannot be parallelised it will not matter how many more hardware units you add into the picture; the minimum amount of computing time will always be limited to however long it takes to finish that single-threaded workload on a single core. Instead of letting the CPU work on its own, we’re using the GPU’s resources to get us closer to that minimum single-thread execution time.

HSA also incorporates heterogeneous Unified Memory Access (hUMA), which allows the GPU and CPU inside an APU to have their work open for the other to play with while the information is stored in system memory. No longer will programmers have to manage separate amounts of memory in separate memory pools – its all one big party in the jacuzzi. The CPU can work on whatever the GPU is doing at the same time and there is no lost time in copying the results of one dataset to the other.

On a related note, this is why Sony went with a giant pool of fast memory in the Playstation 4, while Microsoft went with separate memory pools in the Xbox One. The former solution is much easier to manage, exploit and implement in code; the latter takes more work but may yield similar performance through more tinkering and software optimisation. Cheap, Fast, Good – pick any two.

I have to point out at this time that Kaveri and HSA is a major, major benefit to the high-performance computing (HPC) market. The largest amount of memory you can cram into a discrete graphics card for HPC applications is limited to 12GB of GDDR5 memory. Not only do you have to optimise your program’s data sets to fit into that memory size, you also have to run an insane amount of system memory to have enough space for all the copy operations you’ll be doing if you have multiple GPUs. With a HSA-enabled APU, you could have up to 32GB of error-correcting memory per system and not have to worry about how to optimise workloads to fit into the GPU’s memory. Sure, it’ll run slower, but you can cluster together systems to overcome the performance deficit.

Lastly, as previously discussed, HSA’s final trick in Kaveri is that the APU is now considered to be a sum of compute units, not processors and individual shader cores as previously discussed in my introduction. It all looks the same to the software, but the hardware will be able to figure out where things go and how it should be handled. That means less complexity for the developers and more time to figure out how to optimise the workload better.

Kaveri is made by Global Foundries on the 28nm SHP process

This point needs to be expounded upon because it says a lot about AMD’s plans for future APUs. Kaveri is made using a bulk silicon process that Global Foundries calls SHP, or “Superior High Performance.” SHP is designed for products that have a high transistor density, with the tradeoff that clock speeds will be a bit neutered to reduce leakage and instability. Its a process that works really well for Kaveri’s GPU hardware because they use a similar 28nm process by TSMC for their discrete GPUs based on GCN architecture, and just over 47% of Kaveri’s die space is dedicated to the GPU resources. However, this doesn’t bode well for the processor side of things.

This is because as you use smaller and smaller production processes, so you have to optimise the speeds and thermal headroom of your processors in order to not make the chip unstable. Smaller processes do wonders for power consumption, especially with low clock speeds and lower TDPs (just ask Intel) but clocking things up higher will create issues, especially if you have hundreds of shader cores in the same space. This is why Intel doesn’t beef up their Integrated graphics too much between generations and why Iris Pro is sold on a multiplier-locked processor that typically runs at 3.2GHz on all cores. Adding in more cores simply creates more heat and wastes die space, especially if the jobs you expect it to be performing can be done much more quickly and efficiently by the integrated graphics.

Simply put: don’t hold your breath for a six or eight-core FX processor on the socket FM2+ platform just yet. It can’t happen in the current circumstances that the market finds itself in.

This is also why most results you’ll find on the net now are of the A8-7600, the lowest-end chip in the Kaveri range. AMD is concentrating on power use and heat generation in the Kaveri lineup and the A8-7600 in 45W mode ends up being on par in most cases with the outgoing A10-6800K. That’s quite impressive, but also highlights the issues with the A10-7850K – you need a good cooler and faster-clocked memory to take advantage of the extra GCN cores inside it.

720p/60fps or 1080p/30fps with Medium settings

That’s the kind of performance that AMD is targeting with Kaveri and to be honest, it’s pretty similar to what Microsoft is offering with the Xbox One hardware. Many other games on the One are also light enough/optimised enough to hit 60fps at 1080p, but that’s limited to a handful of titles like Forza 5 and EA Sport’s games for the console. Battlefield 4, running at 720p with medium settings and at 60fps offers the closes comparison to the same experience you’d have on a Xbox One.

But AMD has achieved one thing with Kaveri that it promised early on – the ability to run most games in 1080p with good enough detail settings. In a slide shown to journalists at the GPU 14 event held in Hawaii in September 2013, AMD’s expectations for Kaveri’s performance were pretty specific – high or maximum settings for indie games, MOBAs and some smaller AAA titles, while it was expected that many other, more demanding titles would be run at medium settings to provide an average framerate around 30fps.

In Tech Report’s tests (one of the few sites that had enough time for frametime benchmarking), the A8-7600K performed reasonably well in Batman: Arkham Origins running at 1080p resolution with the Normal preset. The game is pretty taxing but yields some playability around 25 fps. It would run a lot better at 720p because the game taxes memory bandwidth a lot but the results clearly show what AMD was hoping to achieve – better gaming capability than the comparatively-priced Haswell-based Core i3-4330 from Intel, as well as generally better performance than the Core i7-4770K.

Battlefield 4 shows off the same capability, running mostly above 25fps and doesn’t dip too far down the frame time graphs. There are regular framedrops in the frame time results and this is probably because Tech Report left Ambient Occlusion and Antialiasing Post enabled, which can hit memory bandwidth a bit. Still, that’s pretty playable at 1080p and far ahead of what Intel’s HD4400 and HD4600 graphics engines are capable of. At 720p the performance gulf should be even bigger, as the memory subsystem isn’t hammered and allows for some breathing room for the GCN cores to work their magic.

In Tomb Raider, the trend continues. The A8-7600 saves you the hassle of having to buy a discrete GPU if you’re fine with playing your games at lower details, because Tomb Raider ends up being playable at 1080p with the Normal detail preset. Interestingly, performance in the 45W and 65W modes remains largely similar in all three of Tech Report’s tests which means that AMD’s power-saving measures aren’t throttling GPU or CPU performance in games. There’s some devilish magic going on in there.

Against Intel Iris Pro graphics

Its the match-up that everyone asks for, but doesn’t actually expect. Well, here’s the reality on Intel’s Iris Pro graphics – as the resolution scales up, performance falls off a cliff because the intended applications for Iris Pro aren’t really gaming scenarios, but for compute purposes. Still, it’s a good showing by Intel, despite the fact that the A8-7600 and the A10-7850K (Anandtech was one of the few reviewers to receive a sample) trail very closely. AMD dominates in Bioshock Infinite and Tomb Raider and isn’t too far from the same performance as a Core i7-4770K paired with a discrete GPU. Anandtech ran into issues with F1 2013, which is why there isn’t an entry for Iris Pro in that graph.

That’s incredible when you consider that the 45W and 65W versions of the A8-7600 are right in the middle of things. In fact, the 45W part would be able to fit into quite a few laptop chassis and, paired with a 1366 x 768 or 1600 x 900 monitor, would yield some pretty playable gaming performance for a third of what Intel charges OEMs for the Core i7-4770R.

Conclusions

Once again, AMD has set the tone for what people can expect in terms of graphics performance from their APUs. If their low-end quad-core A8-7600 can hang ten with some of the bigger and more expensive processors and still retail for around $120, its a perfect fit for anyone who wants a new rig but doesn’t mind playing their games at 720p to get the most performance out of the system. This goes for anyone intending to use it for work purposes – as a low-budget chip made to run programs like Photoshop, Musemage, Handbrake, Adobe’s suite of video editors, it’s a no-brainer because of it’s compatibility with OpenCL software. Shove in a SSD with 8 or 16GB of DDR3-2133 memory and you’d be hard pressed to find anything really wrong with such a setup, especially given the price.

Kaveri is also an especially good fit for a Steam Machine, considering that at 720p with high settings in most games it pushes out around 60fps. That’s as good as the performance from an Xbox One. Put that into a small chassis, fit it with high-speed memory and a 500GB hard drive and either run SteamOS or Big Picture mode (on Windows) to enjoy the same kind of benefit. Again, there would be really little to complain about because it’ll be both cheaper and a lot more flexible in terms of what you can do with it.

And lets not forget what’s in the pipeline for 2014. We have TrueAudio, embedded into every Kaveri chip and taking off the sound processing load from the CPU cores. We have Mantle, promising more performance and higher efficiency on lower-end GCN hardware. We also have dual graphics on the cards, enabling you to pair up a cheap R7-series Radeon for higher performance. Don’t forget HSA and OpenCL either – those are sure to explode later this year with software support and it’s not just AMD that benefits – this also applies to the other partners in the HSA foundation, which includes ARM.

Honestly, it looks rather good from where I’m sitting and I hope that AMD capitalises on their chance to dominate the low-end market. They have a new slate in the form of socket FM2+, they have brand new hardware and they have only a Haswell refresh to worry about for now. Let’s see how they do.

Sources: Tech Report, Anandtech, PC Perspective, Bit-Tech.net, Phoronix

Discuss this in the forums: Linky

Show more