2015-10-12

Blue, red and green are the three primary colors in the world of optics and the only three colors which have corresponding photoreceptors in our retinas. So it’s almost poetic that Intel, AMD and Nvidia are the three primary PC component makers in the world of consumer PCs and PC gaming.

What’s even more fascinating is the realization that while these three companies are quite different, they share a singular vision for the future of this industry. In this editorial we’re going to explore what these three companies have been doing to bring that vision to life. To achieve that we have to understand the philosophies that each of them impart onto the products they make and how those decisions came to be.

To Understand Where We’re Going We Must Look At Where We’ve Been

Let’s start from the very beginning, with the first consumer chip to integrate a CPU – Central Processing Unit – with a GPU – Graphics Processing Unit – . In 2010 Intel released the first processor to feature integrated graphics but just like the company’s first effort to make a dual or a quad core processor the resulting product was a multi-chip module or an MCM for short, in which multiple discreet chips were connected together and sold as one. Intel called this design Clarkdale & it was not a fully integrated monolithic solution. The CPU and graphics portions were manufactured separately as two discreet chips that were later assembled together on single circuit board.


A truly integrated solution did not come from Intel until Sandy Bridge was released a year later in January 2011.
On the very same month AMD released its first APU ever, a processor with an integrated graphics solution, code named Brazos. However unlike Intel’s first effort with Clarkdale the graphics portion of these two aforementioned chips was truly integrated solutions. Where the graphics engine and the CPU shared a single die of silicon.

AMD had began working towards this from 2006, when the company had made the decision to buy ATi with the vision to fuse GPU cores and CPU cores into a single piece of silicon called the Accelerated Processing Unit or APU for short. AMD called this effort “Fusion” and it was the company’s main advent from buying ATi.
So in a sense Brazos & Sandy Bridge marked the beginning of the “Fusion” era for the industry.



Fast forward to today and you’ll find that almost all processors have some sort of integrated graphics solution. Intel’s entire range of consumer processors have integrated graphics except for the niche that socket LGA 2011 addresses. All the processors that AMD launched in the past three years have been APUs and thus featured integrated graphics solutions. All mobile processors spanning notebooks to handheld devices also feature integrated graphics solutions.  This is the direction that the entire industry is heading towards as it has becomie strikingly clear, but to what purpose?

Evolution Of Intel’s And AMD’s Integrated Graphics

Today integrated graphics processors take up a considerable percentage of chips’ real estate.
The GPU in Kaveri, AMD’s most recent desktop APU, takes up 47% of the chip’s real estate, almost half the die.
GPU real-estate has been consistently growing for the past several years. For instance AMD doubled the GPU portion from Llano, the company’s first generation mainstream APU, to Kaveri.


We see a very similar trend with Intel as well, increasing the GPU portion of the die with each generation. Going from Sandy Bridge, Intel’s first processor with integrated graphics built on a single monolithic die, to mainstream GT2 Haswell parts the GPU real estate nearly doubled. And with Intel’s “Iris” GT3 graphics parts the real estate quadrupled compared to Sandy Bridge.

This trend continued with Broadwell in 2015 and Skylake in 2016. Mainstream quadcore variants of both generations now have half the real-estate of the chip dedicated to the integrated graphics engine. And mainstream dual core variants have nearly two thirds of the real-estate dedicated to the graphics engine.

Integrated graphics processors are no longer just responsible for graphics, they can accelerate a plethora of workloads using OpenCL and other compute languages.  QuickSync is a great example of how iGPUs can be used for things other than graphics processing.  AMD has also worked on enabling integrated GPU acceleration for a variety of workloads like photo & video editing, stock analytics & media conversion.

We’ve witnessed a significant increase of research and development resources being poured into enabling graphics acceleration in the past few years.  Both Intel and AMD have continued to invest in this field and in the graphics engines they build into their chips.  However, this focus around integrated visualization technologies and CPU+GPU acceleration is relatively new. The fact that the earliest consumer based products to feature integrated graphics engines had only arrived in 2011, 4 years ago, highlights this.

For the past several decades engineers relied primarily on Moore’s Law to get better performance out of their designs. Each year engineers were handed more transistors to play with, transistors that were faster and consumed less power. But in recent years Moore’s Law has slowed down considerably.

Power consumption continued to decline under the recent progression of Moore’s Law but frequency stopped scaling as it used to and it became exponentially more difficult to squeeze greater performance out of a single serial processing unit (CPU). So engineers resorted to adding more CPU cores instead. However the more CPU cores they added the more challenging it became to extract the performance out of these chips. As it’s very challenging to write code that distributes the workload across the increasing number of cores. And the more cores you pile up, the more challenging it gets. And that’s all besides the fact that some workloads simply can’t run in parallel.

What the computing world ended up with are multicore core CPUs that could not effectively accelerate all types of code as their single core predecessors have for decades. This meant that all of a sudden designers’ jobs just got a lot harder. Simply keep adding cores and you end up with wasted transistors, inflating the cost to manufacture the processor without any tangible gain. The industry hit a brick wall trying to improve the performance of CPU designs without blowing power budgets or resorting to extremely complex and difficult design methods. If the industry was to continue its journey towards making faster processors the processor design game had to change. What the world needed was a technology that did not rely primarily on frequency or complexity to scale.

[nextpage title=”The Solution To The Problem”]

The answer was GPUs, parallel processors which can easily continue to scale with Moore’s law.
The industry could continue its reliance on the growth of transistor counts and densities enabled by Moore’s Law to improve the performance of each design by increasing the number of parallel processors, instead of attempting to push the frequencies or complexity of a handful of CPU cores. This meant that computing performance can continue to scale for the foreseeable future.

Parallel processors existed for years but were limited to a few applications. They have always been used in High Performance Computing “HPC” and graphics processing among other applications. Graphics processing was an obvious target for parallel processors, if you needed to process colors for millions of pixels tens of times every second a CPU was simply not going to cut it and thousands of smaller, slower & more efficient processors were perfect for such an application.

But there was a trick, not all code can be applied to GPUs and the grand majority of computer programmers in the world were either used to programming for a single fast serial processor or had learned programming on such a device. After all the entire industry relied on CPUs for several decades & If the industry was going to turn to parallel computing it had to make it less challenging and more accessible for programmers to write code for these types of processors.

From A Beautifully Simple Concept To A An Industry Wide Vision : Heterogeneous Computing

All of  Intel’s past actions and future roadmaps are strong indicators that this is the future that they envision. A future where CPUs and GPUs work seamlessly together to address the new challenges of CPU performance scaling and to address problems that CPUs and GPUs simply cannot solve separately.
AMD made it very clear from 2006 that their goal was to build the ultimate heterogeneous processor.
The company was able to more quickly adapt its vision to practice with the Heterogeneous System Architecture Foundation and mold it into an industry wide strategy that it hopes will bare fruit. The HSA foundation was brought into existence as a collective industry effort to chase the untapped potential of heterogeneous designs and begin a new era of computing where performance would scale again at the rate of golden age Silicon Valley.

AMD Forms The HSA Foundation

HSA stands for Heterogeneous System Architecture. To know what this foundation is all about, we need to take a few steps back. AMD’s goal to build the ultimate heterogeneous  processor meant that they had their work cut out for them.  The company’s vision for the next era of computing has a far-reaching effect on the entire industry, which made an industry-wide collaboration crucial to the success of any effort to bring this vision to reality. Luckily for AMD, many companies shared its aspirations. Industry giants such as Samsung, Qualcomm, ARM, Imagination Technologies, Mediatek and Texas Instruments joined AMD in its efforts and the HSA foundation was born.

So what is HSA & how does it solve the problem?

HSA is a relatively old concept based on a simple idea. The idea is to run the code on the ideal processor that would be the fastest and most efficient in executing it. Serial code with a lot of branches and conditionals would then be well suited to run on the CPU because that’s the fastest and most efficient pcoessor for this type of code. On the other hand, code that is fairly short, less conditional and massively parallel, such as the code used in graphics to calculate what color each pixel on the screen should be, would be well suited for a graphics processor.

Advertisements

GPUs differ from traditional CPUs in several key characteristics. CPUs generally have a lot more decode and branch prediction resources because they tend to deal with more complex branchy code. GPUs on the other hand are designed with heavy emphasis on execution resources. Because GPUs deal with code that relatively is less complex and data that’s massively more parallel.  Which in turn means that the weight would fall on the execution engines rather than the front end of the processor having to deal with the complexity of serial code.

A great, yet simple, example of a heterogeneous system would be a gaming computer. The graphics processor does all the graphics’ heavy lifting and the CPU deals with the API communication, audio processing, artificial intelligence and gameplay physics such as bullet trajectory, hit boxes, etc. Now think of HSA as a significantly more sophisticated and versatile system although based on the very same concept. Instead of the GPU and CPU working on two completely different tasks, such as graphics and AI, the processors can now work on and share the exact same task, such as physics. However, each processor takes care of a different stage of the task. The stages that would be completed faster on the CPU are done by the CPU and the stages which are more appropriate for the GPU are handled by the GPU.

Luckily, this concept works exceptionally well because the majority of software out there has a healthy mix of serial and parallel workloads, making the heterogeneous processor the ideal candidate for a lot of software.

An example of such a task is a Suffix Array. Suffix Arrays are used in a variety of workloads, such as full text index search, lossless data compression and Bio-informatics.

Though it sounds simple, there are multiple challenges that need to be overcome first, both on the hardware and software side.

The Software

Heterogeneous systems won’t show their true benefits until capable and compatible software is available. Members of the HSA foundation don’t expect developers to start writing code directly to the instruction set language of the foundation, which is called HSAIL. That is why they’re going  to rely on three major languages to deliver the benefits of HSA: OpenCL 2.0, Java and Microsoft’s C++ AMP.

OpenCL 2.0 has already been ratified and it takes advantage of many aspects of HSA systems, such as memory coherency between the CPU and GPU via hUMA, which completely eliminates the need for copies between the CPU and GPU to maintain coherency. Thus, code written in OpenCL 2.0 that takes that into account will show greatly improved performance.

Java will support HSA systems via APARAPI, which compiles Java code to OpenCL. C++ AMP doesn’t take advantage of HSA systems directly, much like Java. However, a compiler that generates HSAIL has been in the works for quite time and was finally released in July of this year.. Java will be addressing HSA directly with Java 9 by generating HSAIL directly from Java bytecode, Java 9’s final release and public availability is set for September of next year.

The Hardware

Updated memory protocols were among the first orders of business for the foundation. Heterogeneous Uniform Memory Access was developed to significantly reduce the overhead that was previously needed for adequate communication between various processors in a heterogeneous system. The technology completely eliminates the need for data copies between the CPU and GPU to achieve full memory coherency. The need for data copies between various processors in a heterogeneous system was undoubtedly the most significant bottleneck.

This means that various HSA compatible processors (GPUs,CPUs and Audio Processors, for example) can share data seamlessly. The insurmountable wall of overhead that was standing in front of the concept of HSA was finally torn down. The relevant code can now travel freely between different execution engines inside the system without any penalty. This was a crucial step in allowing the most appropriate processor for the task to gain quick access to the data it needs. This is what would eventually lead to the best possible performance and efficiency in an HSA compatible processor.

AMD delivered the first HSA-capable hardware with Kaveri in early 2014. This year, the company introduced Carrizo, which is the first HSA 1.0 compliant design in the industry.

It’s not just the HSA foundation that has been working on bringing up its hardware and software up to par, Intel and Nvidia have also been working toward the same goal. Right behind the HSA foundation both Nvidia and Intel have interdependently developed their own implementations for unified memory access.

Nvidia’s Push For Heterogeneous Software & Hardware

Nvidia introduced the capability to share virtual memory between the CPU and GPU with CUDA 6 and its Maxwell graphics architecture. The technology doesn’t offer a hardware-level unified memory access like AMD’s hUMA, which was introduced with Kaveri APUs and the GCN 1.1 graphics architecture. What it does is simplify the method by which programmers address CPU and GPU memory. What it does not do, however, is allow the processors to share data via pointers. Which is an essential part in addressing the most significant bottleneck in a heterogeneous design. Without reducing the huge latency dictated by data copies no tangible performance or efficiency gains can be realized from a a heterogeneous processor.  So despite the memory pool being “virtually” unified in software, performance exhaustive data copies still have to be made between the CPU and GPU.

Full-blown hardware-level unified memory is something that Nvidia has in the works still and will officially debut with the Pascal GPU architecture and the Tegra SOC that integrates it next year. It’s important to remember that unified memory is an SOC feature. Both the CPU and the GPU have to be on the same chip to achieve this functionality. Sharing memory across two physically distant processors is not a feasible task due to the great amount of latency associated with it.

The code name for Nvidia’s SOC with unified memory is still unknown. What we do know is that it’s going to succeed the Tegra X1, most likely sometime next year.

Intel’s Foray Into Heterogeneous Computing

Intel currently offers an extension to its Haswell SOCs that enable unified virtual memory. Very much like what Nvidia has done with CUDA 6 and Maxwell. Intel even released a whitepaper on heterogeneous system architectures using OpenCL very much like what the HSA foundation is aiming to achieve with OpenCL 2.0. However, unlike Nvidia the Intel has managed to follow AMD”s Kaveri last year with its own fully memory coherent design implementation in Broadwell processors featuring the company’s Gen8 graphics architecture.

Compute Architecture of Intel Processor Graphics Gen8
These new mechanisms can be used to maintain memory coherency and consistency for fine grained sharing throughout the memory hierarchy between CPU cores and devices. Moreover, the same virtual addresses can be shared seamlessly across devices. Such memory sharing is application programmable through emerging heterogeneous compute APIs such as the shared virtual memory (SVM) features specified in OpenCL 2.0. The net effect is that pointer-rich data- structures can be shared directly between application code running on CPU cores with application code running on Intel® Processor Graphics, without programmer data structure marshalling or cumbersome software translation techniques.

This is very much the essence of heterogeneous CPU/GPU compute. And the introduction of memory coherency with Kaveri and Broadwell last year signals the very beginning of the second phase in the next era of computing. Beginning with the principal premise of combining GPU and CPU on the same piece of Silicon in 2011 to full memory coherency between the two engines in 2014.

What does all of this mean?

Once you examine all the major players carefully, a crystal-clear image of the entire industry moving towards heterogeneous computing appears. AMD, Nvidia and Intel are all addressing the same challenges. As is usual with such cases, AMD chose to go for the open standard industry-wide route, where the entire industry (or as much of it as possible) collaborates to achieve a common goal. Nvidia chose to go for the proprietary route, while Intel took a more awkward position in the middle. They’re making sure their hardware is going to be up to snuff, but are leaving a lot of the industry-wide software challenges for the industry to deal with rather than address them directly like the HSA foundation is doing. Of course, there are exceptions to this, but they remain very specific and quite limited in scope.

So to summarize what all this means, we only need to take a look at one of AMD’s more intriguing press releases in which the company talks about how it’s going to improve the power efficiency of its processors by a factor of 25X, or the equivalent of 2500%, in just five -and-a-half years’ time. Heterogeneous computing is listed as the number one breakthrough that will drive this enormous power efficiency improvement.

It can be quite difficult to wrap our heads around a figure as large as 25000X. But what we’re essentially being told is that by 2020, a 4W processor would have the computational capacity to rival the 100W processors of today. And that’s certainly a future worth getting excited about.

The post The One Vision That Intel, AMD And Nvidia Are All Chasing – Why Heterogeneous Computing Is The Future by Khalid Moammer appeared first on WCCFtech.

Show more