Intel has kicked off their IDF15 presentation in San Francisco and fully disclosed details regarding their latest architecture featured on the 6th generation core processors. Codenamed Skylake, the 6th generation processor family was introduced on 5th August at Gamescom 2015, when Intel officially launched their Z170 platform and kicked off the Skylake family with two unlocked processors, the Core i7-6700K and Core i5-6600K. Today, Intel has unwrapped entire architectural details involving their 6th generation Skylake processors.
Featuring improved performance, better power efficiency and faster graphics cores, the Skylake processors are the first mainstream family from any chip maker to be fully compatible with DDR4 memory while future processors are on the road and will integrate faster eDRAM based cache that will significantly boost/enhance performance while reducing latency so that applications work and load faster. Several key technologies are part of the Intel Skylake family that include faster memory, storage solutions, more PCI-E lanes, support for future NvME based SSDs and even Octane (3D XPoint) SSDs. The article we are compiling today will give you a brief insight on what Skylake is and what architectural changes were incorporated to make Skylake faster and more efficient than its predecessor.
Intel’s latest System-on-Chip (SoC) microarchitecture, code name Skylake. It is built on Intel’s leading 14nm process technology. This “tock” microarchitecture was completely redesigned to bring new IPs and integrations, great performance and reduced power consumption. via Intel
The analysis will revolve around the CPU microarchitecture details followed by the GPU architecture which was detailed yesterday, the several Hardware P-states and finally an insight into the new platform features that will be featured on Skylake desktop and mobility processors that are bound to launch in a few months time. For those interested in finding details on specific SKUs featured in the Skylake family, you may want to head over to the links for Skylake-U, Skylake-H and Skylake-S family. Those who want to read on the architecture details should visit the respective pages from the drop-down menu feature below:
[nextpage title=”Intel Skylake CPU Architecture Analysis”]
We have all been waiting to know the details for Intel’s 6th generation Skylake processors and the time has finally come when we can fully unwrap the details of Intel’s latest 14nm core architecture. To start off, we would like to tell you four specific features that Skylake is going to focus on that include Scalability, power, performance, media and graphics. The Intel Skylake development started off with just a 3x TDP scale, 2x form factor range and a classic set of PC IO features but ended with 20x TDP scale, 4x form factor range and delivering a wide range of I/O features across PC and tablet devices that are going to launch in Q4 2015.
Technically, each Skylake core is bigger and wider, features better instructions per clock and improved power efficiency. The basic core block of a Skylake chip includes four of these cores which share a LLC through an enhanced interconnect ring known as the SOC Ring. The die includes graphics processors that range from GT1, GT2, GT3 to more performance oriented and advanced designs that include GT3e and GT4e with embedded DRAM (L4 Cache up to 128 MB) that feature support for OpenCL 2.0, DirectX 12 and OpenGL 4.4. The system agent includes the dual channel DDR4 memory controller, the display system for embedded and external displays while the PCI-Express lanes can be used to connect discrete graphics card for higher PC performance on desktop setups. Audio DSPs and sensor hubs also get an update with Skylake while a single integrated camera ISP can also be found inside the Skylake die for better imaging quality and lastly, Skylake delivers extended overclocking capabilities which we will talk about in a short moment.
So these are the general details of the Skylake microarchitecture but the core needs to be examined a bit more. Skylake features a vastly improved front end design with improved branch predictions that comes with a higher capacity compared to Haswell and has wider instruction supply with deeper buffers and fast prefetch. The deeper out-of-order buffers extract more instructions parallelism while the improved EUs (Execution Units) have lower latency, more units, can power down when idle and improve AES-GCM by 17% and AES-CBC by 33%. The load and store bandwidth is also larger with prefetcher improvements, deeper store buffer, better L2 cache miss bandwidth, improved page miss handling and new instructions for better cache management. Hyper threading performance is also improved with more wider retirement and Skylake gets higher queues of 64 per thread compared to 56/thread on Haswell and 28/thread on Sandy Bridge. The more parallelism optimized architecture of Skylake enhances the FP register file to 180 versus 168 on Haswell, scheduled entries to 97 versus 60 on Haswell to extract more parallelism out of the core design. The window size has now been increased exceptionally to 224 compared to 168 on Sandy Bridge and 192 on Haswell.
The new interconnect ring delivers double the throughput (bandwidth) without sacrificing power. LLC (Last Level Cache) throughput is also doubled with cache miss handling, the DDR4 DRAM as a whole deliver vastly improved bandwidth to the system while eDRAM based chips can effectively reduce latency and transfer speeds within the die block. The system acts as a fully coherent design to share and manage memory and data transfer and store loads in the processor. We won’t dive into the power details in this topic since it is covered in the next page but let’s dive into the overclocking architecture infused in Skylake.
Intel Skylake Recieves Full Range BCLK Improvements and Finer Grain Tuning For DDR4 Memory
With Skylake, Intel is leveraging their overclock support on their processors. Intel already added significant overclocking features on Haswell with real-time overclocking software, ration base clock overclocking and latest Intel XMP modes with their Intel Extreme Tuining utility. With Skylake, Intel adds full range BCLK tune options and improved DDR4 memory overclocking. With a fully unlocked turbo design that is controllable through software and BIOS, the full BCLK overclocking allows full range, 1 MHz increments over Haswell’s Ration-based tune in 100/125/166 MHz. The unlocked core ratios can be tuned up to 83 in 100 MHz increments with complete turbo overrides for voltage, power limits, IccMax. The Skylake processors also fully support DDR4 overclocking with override capabilities of up to 4133 MT/s and DDR steps tuning in 100/133 MHz compared to 200/266 MHz with finer grain increments. Even the graphics clock can be tuned with ratios up to 60 in 50 MHz increments, with fully turbo voltage controls.
Intel has also focused on Top-down Microarchitecture Analysis Method (TMAM) is an industry-proven systematic approach that identifies performance bottlenecks in out-of-order cores. Identifying true bottlenecks lets developers focus software tuning to remediate them and improve efficiently on same hardware. TMAM simplifies cycle-accounting using microarchitecture independent metrics organized in one single hierarchy which makes analysis simple. Using TMAM, the high-learning curve associated with each microarchitecture generation is replaced by a structured drill-down that guides the user to true performance limiters.
[nextpage title=”Intel Skylake GPU Architecture Analysis”]
Moving on, we have the graphics architectural analysis from Intel that is based around their Gen9 GPU. The details will involve all the key information and insights on Intel’s Skylake’s CPU and GPU architecture. Right now, we can show you the full details regarding the Skylake Gen9 GPU graphics architecture which will be divided into three *initial* tiers that include GT2, GT3/e and GT4/e.
The details Intel provides on their Skylake iGPUs include a block diagram for their Core i7-6700K processor which launched earlier this month as the flagship offering on the Z170 platform. The chip which Intel term as a SOC (System on Chip) houses four CPU cores with a shared LLC cache that’s interconnected through a SOC Ring which is a bi-directional, 32-byte wide bus and further connects with the iGPU and System agent. All the memory transactions to/from CPU cores and to/from Intel iGPUs are also handled by this SOC Ring, through the system agent and the unified DRAM controller. Now Intel would like to call it a SOC but its in fact a Semi-SOC since the PCH (Southbridge) is still housed on the motherboard while the Northbridge has been moved to the CPU for quite some time now. Intel has saved a lot of die space with their 14nm CPUs but moving the PCH to the chip itself will require a lot of room, even more if you are going to feature the eDRAM on the side of the chip package. A suitable example would be the case Broadwell Core i7-5775C that houses a 128 MB of eDRAM cache on the core package alongside the main die. That along with the several transistors make up for a lot of room and leaves little space for any further addition to the unit.
Some of the key improvements and changes for the Skylake Gen9 graphics include:
Intel Skylake Gen9 Graphics Features:
Gen9 Memory Hierarchy Refinements:
Coherent SVM write performance is significantly improved via new LLC cache management policies.
The available L3 cache capacity has been increased to 768 Kbytes per slice (512 Kbytes for application data).
The sizes of both L3 and LLC request queues have been increased. This improves latency hiding to achieve better effective bandwidth against the architecture peak theoretical.
In Gen9 EDRAM now acts as a memory-side cache between LLC and DRAM. Also, the EDRAM memory controller has moved into the system agent, adjacent to the display controller, to support power efficient and low latency display refresh.
Texture samplers now natively support an NV12 YUV format for improved surface sharing between compute APIs and media fixed function units.
Gen9 Compute Capability Refinements:
Preemption of compute applications is now supported at a thread level, meaning that compute threads can be preempted (and later resumed) midway through their execution.
Round robin scheduling of threads within an execution unit.
Gen9 adds new native support for the 32-bit float atomics operations of min, max, and compare/exchange. Also the performance of all 32-bit atomics is improved for kernel scenarios that issued multiple atomics back to back.
16-bit floating point capability is improved with native support for denormals and gradual underflow.
Gen9 Product Configuration Flexibility:
Advertisements
Gen9 has been designed to enable products with 1, 2 or 3 slices.
Gen9 adds new power gating and clock domains for more efficient dynamic power management. This can particularly improve low power media playback modes.
Intel Skylake GT2/e Graphics With 24 EUs and Optional eDRAM
The first Gen9 graphics that we are going to talk about is the GT2 graphics core that is found on the Graphics 530 chip featured on both Skylake desktop (unlocked) processors. As was the case with the Haswell processors, each slice of the new graphics block can either be combined or reduced to form different graphics SKUs for various range of products. Each slice is comprised of various subslice that include the foundation of the Gen9 graphics block, the EU (Execution Unit). Each EU is an SMT / IMT combination (Simultaneous and Fine-Grained Interleaved Multi-Threading) with multiple SIMD ALUs featured across multiple threads. Skylake gets 128 general purpose registers per EU with 32 bytes of register stores and 4 Kbytes of general purpose register files. Since Gen9 has 7 threads per EU, these amount to 28 Kbytes of GRF’s on each EU. The computation is handled by a pair of SIMD FPUs that can execute up to four 32-bit floating-point or integer and 16-bit floating point or integer operations while retaining the FP64 double precision compute capabilities. The integration of 16-bit floating point is new in Skylake processors with twice the operational speeds of FP32 and a similar path to what NVIDIA is planning to incorporate on their Pascal graphics processors next year.
The GT2 graphics chip has three subslices with 8 EUs per subslice. This makes a 24 EUs slice which is connected to the L3 cache through the SOC ring we talked about earlier. Each subslice comes with a local thread dispatcher which connects the different subslices as a single unified slice. They also come with the sampler (read-only memory fetch) unit that is used for the sampling of tiled (not tiled) texture and image surfaces. It comes with its own Sampler L1 and L2 cache. The data port is a memory load/store unit while the latest GTI (Graphics Technology Interface) works as a gateway between the Gen9 iGPU and the rest of the chip.
The HD Graphics 530 chip housed inside the Core i7-6700K has a clock speed of 350 MHz base and 1150 MHz boost. With 24 Execution units and an improved design, we have seen an increase in overall graphics performance. More surprisingly, the 6700K die also houses an optional eDRAM controller that can feature 64 MB to 128 MB of eDRAM (L4) cache with frequencies of up to 1.6 GHz to increase bandwidth, reduce latency and improve performance on faster iGPUs. It is quite unnecessary to incorporate such high band width memory on the 6700K class processor but the optional controller does make it seem that we may find some GT2 chips with embedded DRAM.
Intel Skylake GT3/e Graphics With 48 EUs and eDRAM
The Skylake GT3 graphics also come in two variants, one with eDRAM and one without it. The chip houses 48 Execution units that are partitioned in two slices, each slice consisting of three subslices with 8 EUs per subslice. Each slice has its own L3 data cache and a unified memory interface. The chip will house up to 64 MB of L4 cache with the GT3e variants that come later this year.
Intel Skylake GT4/e Graphics With 72 EUs and eDRAM
Finally, we have the fastest graphics chip that Intel has ever made, the GT4 class integrated graphics chip. This Gen9 core combines three slices of 24 EUs where each slice is composed of three subslices of 8 EUs per subslice. With three L3 data caches that are combined through the unified local memory interface in a large package, the chip will house up to 128 MB of L4 cache (eDRAM). The chip will featured on Iris Pro class processors with increased graphics performance compared to traditional iGPU based processors. At 1 GHz clock, the GT4e graphics chip can pump out 1152 GFlops of compute performance which is without taking in account the performance of the processing cores.
While we have seen the performance of Core i7-6700K and Core i5-6600K iGPUs which come close to the really low end discrete graphics cards available such as the Radeon R5 230 or the GeForce GT 720/730, both NVIDIA and AMD haven’t actually released any discrete level parts in the market as of yet with their latest Radeon 300 series and GeForce 900 series lineup. Now AMD has a reason for not releasing discrete cards as the APUs they ship already come with GCN powered chips but discrete graphics cards market share in major markets such as China has drastically fallen down for these two GPU makers and more people are buying or aiming at the high-end graphics cards. This leaves a question whether the low-end discrete graphics market is going to end in a couple of years?
Well, neither did AMD or NVIDIA focused in performance improvement on the low-end sector, their low-end cards below the Radeon R7 360/260 or the GTX 950/750 is quite poor in terms of performance and retails at around $50 US which if user goes with an integrated graphics solution is a better option since that $50 US is saved for additional upgrades. The market here was wide open for AMD and Intel to leverage their own iGPU chips and they have done so but Intel seems to be doing it better. The chips Intel currently offer are only found on expensive processors but with a few passing generations, we can see the same or better performance scale down to mid-tier chips that are equivalent of an AMD APU retailing at around $150 US – $200 US. This will give user a decent processor that houses performance on par with a GeForce GTX 750 Ti class graphics card that can be a ideal budget PC built with low power consumption.
The concept of onboard graphics chips has been here for a long time, there was a time when all three giants used to integrate these chips on the main boards but with the shrinking desktop market, they had to revise their GPUs and focus their path. NVIDIA went the discrete and mobility route, ATi merged with AMD and offered discrete solutions, mobility solutions and APUs while Intel went an all out integrated route. Now when we look at the discrete market share of cards, we see NVIDIA dominating the graphics market with around 70-75% and AMD with a discrete market share of 25-30% up till Q4 ’14 (based on figures compiled by Beyond3D) however when we take a look at the GPU market share which includes discrete, integrated, mobility chips, we see a bigger divide. The figures from John Peddie Research up till Q4 ’14 show AMD at 13.61% market share, NVIDIA at 15% and Intel dominating the market with an insane 71.39% market share which is insanely high.
Now we know these numbers are not representative of the existing market as we have seen AMD and NVIDIA launching insanely high-performance graphics cards in the market and Intel introducing two new architectures with new graphics capabilities. But even if AMD and Intel do get a lead in graphics share, it won’t nearly even touch Intel’s dominance in the market and do note, Intel is known for their processors and their chipsets, not their graphics chips. So what led to this gain? Integration across the board, Intel has secured lots of AIB in the mobility world and that is where the market growth is at. In just a small fraction of time, Intel is on par with AMD’s Carrizo APU that houses their latest Excavator and GCN 1.1 graphics core. Intel has a recipie of disaster cooked for AMD and NVIDIA in the entry to mid-range mobility market and possibly even in the discrete GPU market when we look into the next 3-5 years. But don’t get surprised, Intel has certain limits to what they can do on existing silicon, unlike discrete GPUs, integrated solutions require cooling, higher power and the demand keeps on increasing. Intel won’t certainly be able to tackle AMD or NVIDIA in the mid to high end discrete GPU market unless the come up with their own solution unlike the failed Larrabee which now serves the foundation of their Xeon Phi (MIC HPC Accelators) line.
Chip Name
GPU Core
GFlops (GPU Only)
GFlops (Whole Package)
AMD Radeon R7 360
Tobago Pro
1536 GFlops
N/A
NVIDIA GeForce GTX 750 Ti
Maxwell GM107
1389 GFlops
N/A
AMD Radeon R7 250X
Cape Verde XT
1216 GFlops
N/A
Intel Skylake Gen9 GT4/e
Intel Iris Pro 580
1152 GFlops @ 1 GHz
TBC
NVIDIA GeForce GTX 750
Maxwell GM107
1044 GFlops
N/A
AMD Radeon R9 M370X
Venus XT
992 GFlops
N/A
Intel Skylake Gen9 GT3/e
Intel Iris 560/570?
884 GFlops (Estimation)
TBC
AMD Carrizo FX-8800P
GCN 1.2
819 GFlops
1070 GFlops
Intel Core i7-5775C
Intel Iris Pro 6200
768 GFlops @ 1 GHz
883 GFlops
AMD Kaveri A10-7850K
GCN 1.1
737 GFlops
856 GFlops
Intel Core i7-5557U
Intel Iris 6100
724 GFlops
845 GFlops
AMD Richland A10-6800K
VLIW4
648 GFlops
779 GFlops
Intel Skylake Core i7-6700K
Intel HD 530
442 GFlops
TBC
Intel Haswell Core i7-4790K
Intel HD 4600
400 GFlops
512 GFlops
All this talk goes off to show that the integrated graphics cores from AMD and Intel might actually change the discrete market as we know it by reducing the need for entry level discrete class graphics. Now there are plans by AMD in the future to scale down several TFlops GPUs in APUs which are termed as HPC APUs. Specifically designed for server spaces, these high-end “TFlops class” SOCs from Intel, AMD and even NVIDIA if they get to enhance their GPGPU performance for Denver CPUs to rival the likes of high-end discrete cards. Now call me skeptical but I don’t expect to see these parts several years ahead but once they do, things are going to become a lot more interesting in the graphics world.
[nextpage title=”Intel Skylake Speed Shift Technology – Power Performance and Energy Efficiency”]
Up till Skylake, Intel has made dramatic changes to the power management systems found on their new chips. With Skylake, Intel takes a step forward by reducing power consumption and enhancing efficiency by taking away a few things and integrating a few new things. We know that a Skylake SOC can consist of 2-4 CPU cores, graphics cores, media IPs, a ring interconnect, cache and ISA (Integrated System Agent). An SOC design can further more consist of an on-package PCH and eDRAM. Intel simply adds in a new unit known as PCU (Package Control Unit) that is a power management logic and controller firmware that can track internal statistics of the SOC, collects internal and external power telemetry (iMon, Psys) and can even interface to higher power management hierarchies such as OS, BIOS, EC, graphics driver and DPTF.
Intel also incorporates several hardware level P-states which divides the energy and frequency demands into several tiers and controlled by the operating system. This demand base algorithm of P-State is a bit slower as it leads into several P-states from energy efficient mode (min V) Pn, to P1 and P2 states and then P0 -1 core or P0-2 cores that deliver the highest frequency when demanded. With Skylake, Intel has housed their latest speed shift technology (hardware P-state) that is a highly dynamic power management system that can configure multi-core designs, AVX and accelerators to enhance efficiency. The power gating system is a lot more brute on Skylake that can even shut down AVX2 completely when its not in use.
By exposing the entire frequency range, Skylake will deliver smarter power management by allowing small form factors with larger turbo frequency range and finer grain, micro architectural observability by allowing both the hardware and software to share power and performance control. The lowest frequency with the latest management is now set to 100 MHz on Skylake.
Intel Skylake Hardware P-States:
• Operating system directed minimum quality of service, directive of desired performance and energy performance balancing
• Autonomous algorithms that self-manage the P-states within the OS directed range (normally full min to max range)
• Detecting user interaction and accelerating responsiveness
• Energy efficient race to halt
• Core duty cycling
• The power and power delivery controls including: PL1/2/3/TDC/Psys
The post [IDF15]Intel’s 6th Gen Skylake Unwrapped – CPU Microarchitecture, Gen9 Graphics Core and Speed Shift Hardware P-State by Hassan Mujtaba appeared first on WCCFtech.