Developer.amd.com

CodeXL for game developers: How to analyze your HLSL for GCN

2014-05-16

The Graphics Core Next (GCN) architecture powers the AMD Radeon™ R7 series, R9 series, and most HD 7000 series graphics cards, including the world’s fastest graphics card1, the AMD Radeon™ R9 295X2, and the popular AMD Radeon™ R9 290X. With GCN also being used in next-generation consoles, optimizing shaders for this architecture is a high priority for game developers.

Previously, game developers could use GPU ShaderAnalyzer (GSA) for analyzing DirectX® HLSL shader performance. However, GSA currently lacks support for GCN. The recently-released CodeXL 1.4 now provides this functionality via CodeXL’s command-line tool. Specifically, the CodeXLAnalyzer command-line tool outputs disassembly of the generated GCN hardware shader. It can also provide useful shader statistics such as general-purpose register (GPR) usage. And because the Instruction Set Architecture (ISA) for GCN is publically available (for ASIC family codenames “Southern Islands” and “Sea Islands”), game developers can gain a deeper understanding of how the shader is executing and optimize their HLSL to achieve more efficient results.

CodeXLAnalyzer overview

CodeXLAnalyzer is a command-line tool that supports offline compilation of HLSL shaders into GCN hardware shaders. It uses a live-driver model, meaning it utilizes the compilers of the DirectX runtime and AMD driver installed on the local machine. First, it uses the D3DCompile API to compile HLSL into DirectX binary. Because of this step, CodeXLAnalyzer requires certain arguments that correspond to parameters in the D3DCompile function. For example, it requires the shader function name (the entry point parameter in D3DCompile) and the profile (the target parameter in D3DCompile, e.g. ps_5_0). All errors and warnings from the D3DCompile API will appear in the console output.

After compiling into DirectX binary, CodeXLAnalyzer will compile this result into a GCN hardware shader using AMD’s DirectX driver and write the disassembled GCN code listing to a file. You can specify which GCN ASIC to target, independent of the graphics card installed on the local machine. You can also have the tool output shader resource usage statistics to a comma-separated value (CSV) file. Let’s look at some examples.

Example 1: Print the help text

Use the -h option to get a list of available options.

Note that CodeXL also supports OpenCL. For HLSL, skip the OpenCL section and go to the section below marked “DX Instructions & Options (Windows Only).”

Example 2: The basics

Assume the simple pixel shader above is in a source file named Example.hlsl. The following command line will compile that pixel shader for the Hawaii ASIC (e.g. R9 290X):

The output will look this this:

Let’s break down the options one by one.

-c Hawaii

Compile for the Hawaii ASIC. This does not have to match the card in your machine.

-f PsExample

Compile the function named PsExample.

-s HLSL

Specify that the source is HLSL.

-p ps_5_0

Use the ps_5_0 profile for D3DCompile.

-a PsExampleStats.csv

Perform shader analysis and write resulting shader statistics to the specified file.

--isa PsExampleISA.txt

Write the GCN disassembly to the specified file.

Example.hlsl

Specify the HLSL source file.

Example 3: D3DCompile DLL location

Note that, by default, CodeXLAnalyzer will use the D3DCompiler_*.DLL installed in %WINDIR%\System32 (and %WINDIR%\SysWow64). This is where the legacy DirectX SDK installed D3DCompile. To use a different version (e.g. from the Win8.x SDK), use the --DXLocation command-line option.

Example 4: D3DCompile flags

You can use the --DXFlags option in CodeXLAnalyzer to compile the HLSL source with the same flags that your engine uses. The argument of the --DXFlags option is passed directly to the Flags1 parameter in D3DCompile. This requires you to determine the integer value for the set of flags that you use. This is relatively easy to do in the Visual Studio debugger. One option is to put your flags into a variable and then set a breakpoint and look at the resulting value. For example:

Alternatively, you can use the left-shift expressions from the D3DCOMPILE defines in d3dcompiler.h directly in a watch window. For example, here are the defines for the flags used in the above example taken directly from d3dcompiler.h:

And here is the resulting watch expression (along with one for the CompileFlags variable from the above example):

And lastly, here is the DXFlags option added to the command line for our simple pixel shader:

If you do not use the DXFlags option, CodeXLAnalyzer passes zero to D3DCompile, which corresponds to D3DCOMPILE_OPTIMIZATION_LEVEL1.

Example 5: Precompiler defines

Assume the compute shader above is in a source file named ExampleCS.hlsl. It is similar to compute shader code in the TiledLighting11 Radeon SDK sample that uses the depth buffer as input and calculates per-screen-tile depth bounds (min and max depth). However, for simplicity, the above example only calculates the max.

For this shader, the screen tile size maps to the X and Y parameters of the numthreads attribute. Note that NUM_THREADS_X and NUM_THREADS_Y are not defined in the shader file. They need to be specified during compilation. You can use the -D option in CodeXLAnalyzer to specify defines for D3DCompile:

GCN hardware shader statistics

The -a option causes certain shader statistics to be written to a comma-separated value (CSV) file. For CodeXL 1.4, details are given for the following hardware resources: Scalar General Purpose Registers (SGPRs), Vector General Purpose Registers (VGPRs), Local Data Share (LDS), and scratch memory.

Here are the results for the simple pixel shader in Example 2:

VGPRsUsed:

For pixel shaders, you will often be concerned about the number of VGPRs used. To understand why, let’s briefly review basic GCN architecture. A GCN GPU is made up of several compute units (CUs) operating in parallel. A CU contains four SIMDs, each consisting of a vector ALU and 64 KB of memory for vector GPRs. Work is performed on the SIMDs in groups of 64 work-items (i.e. 64 threads) called wavefronts.

The SIMDs hide memory latency by having several wavefronts in flight at the same time, allowing the compute unit scheduler to switch between different wavefronts. For example, while one wavefront is waiting for results from memory, other wavefronts can issue memory requests. Each SIMD supports a maximum of 10 simultaneous wavefronts in flight. However, whether a particular kernel (i.e. shader) can achieve the maximum depends on several factors. For HLSL pixel shaders, the limiting factor is often VGPR usage.

Recall that each SIMD has 64 KB of local memory for VGPRs. Each VGPR used by a shader (VGPRsUsed in the CodeXLAnalyzer shader statistics) holds a single 32-bit float (as opposed to a 4-float vector). 64 KB with 4 bytes per VGPR and 64-thread wavefronts means there is a limit of 256 VGPRs per thread (i.e. per shader). But at that 256-VGPR limit, the entire 64 KB VGPR memory is being used by a single wavefront, and so you cannot have multiple wavefronts in flight. This can limit latency hiding and lead to lower performance.

The following table shows VGPR usage vs. the resulting limit on simultaneous wavefronts in flight:

For example, if VGPRsUsed in the CodeXLAnalyzer shader statistics is in the range of 49 to 64, the shader is limited to 4 simultaneous wavefronts per SIMD. Note that if your shader is currently near the next higher VGPR threshold, a seemingly minor change in the HLSL can send it into the next range and drop your waves per SIMD down. This can result in a noticeable performance decrease and leave you wondering why your seemingly minor change had a not-so-minor performance impact. CodeXLAnalyzer can help you diagnose these situations.

SGPRsUsed:

VGPRs are only one factor used to determine the max number of simultaneous wavefronts per SIMD. SGPRs are another potentially limiting factor. Each SIMD has 2 KB of memory for SGPRs (shown as a single 8 KB memory in the GCN Compute Unit diagram above). The value in a particular SGPR is shared across all threads in a wavefront (whereas each thread has a unique value for VGPRs). Like VGPRs, each SGPR holds a single 32-bit float. 2 KB per SIMD with 4 bytes per SGPR means there is a limit of 512 SGPRs per thread. However, only 104 of these are available to the shader (and two of these are reserved by the AMD driver’s shader compiler, leaving 102). This leads to the following table:

There are two other factors that determine the number of simultaneous wavefronts for a shader. The first is local data share (LDS) usage, which is given by UsedLDS in the CodeXLAnalyzer shader statistics. In DirectX terminology, LDS maps to Thread Group Shared Memory (TGSM). That is, shader variables marked with the groupshared modifier will result in non-zero UsedLDS in the statistics. We will discuss UsedLDS more in the next section.

The last remaining factor is the number of wavefronts in a work-group. A work-group is a collection of work-items executing on a compute unit and can be comprised of one or more wavefronts. This is the same concept as a thread group in HLSL compute shaders. Thus, the X, Y, and Z parameters of the numthreads attribute will determine the work-group size for a compute shader. With multiple wavefronts per work-group, there is a maximum of 16 work-groups per GCN compute unit, caused by a limit in the number of simultaneously active hardware barriers. When there is only one wavefront per work-group, barriers are unnecessary and the barrier limit is not applicable. If, however, a compute shader resulted in 2 wavefronts per work-group, that would mean a maximum of 32 (2 times 16) wavefronts per compute unit, or 8 waves per SIMD.

The lowest max waves per SIMD from these four factors (VGPRs, SGPRs, LDS, and work-group size) will determine the number of simultaneous wavefronts for a shader. Note that for pixel shaders (and all other HLSL shader types other than compute shaders), you do not have control over LDS usage or work-group size. Thus, when analyzing these types of shaders, you need only focus on VGPRs and SGPRs. Furthermore, in practice, it is rare for SGPRs to be the limiting factor in HLSL pixel shaders, and so your focus will often be VGPR usage.

UsedLDS:

As mentioned in the previous section, compute shader variables marked with the groupshared modifier will result in non-zero UsedLDS in the statistics. Here are the results for the simple compute shader in Example 5:

Note that only a single uint was declared groupshared in the example shader, but 16 bytes is the minimum amount of LDS that can be allocated. The MaxLDS value of 32768 in the shader statistics represents the 32 KB maximum size for the groupshared storage class in D3D11.

To determine how LDS usage affects the number of simultaneous wavefronts, note that the LDS used by a shader is shared by all work-items in a work-group (or in HLSL parlance, all threads in a thread group). A GCN compute unit has 64 KB LDS, so divide 64 KB (not the 32 KB D3D11 limit) by UsedLDS and floor the result. This gives you the maximum number of work-groups that can fit into LDS. Our simple example of 16 bytes UsedLDS isn’t very interesting, so assume 8 KB was used instead. That results in 8 work-groups fitting in LDS. Converting that to wavefronts requires knowing the work-group size (i.e. the number of wavefronts per work-group).

Recall the numthreads declaration from Example 5:

NUM_THREADS_X and NUM_THREADS_Y both were defined to be 16 in our example command line, resulting in 256 threads per thread group (work-items per work-group). With 64 threads per wavefront, we have 4 wavefronts per work-group. 8 work-groups fitting in LDS with 4 wavefronts per workgroup results in 32 wavefronts max on the compute unit (CU), or 8 simultaneous wavefronts in flight per SIMD (32 divided by 4 due to 4 SIMDs per CU). Thus, even if your GPR usage allowed for the max of 10 waves per SIMD, using 8 KB of LDS in your shader would reduce that max to 8.

usedScratchBytes:

The last item in the shader statistics is usedScratchBytes. When the AMD driver’s shader compiler needs to allocate more than 256 VGPRs, it spills into scratch memory. Main video memory is used for scratch, backed by the L1 and L2 caches.

Having to spill VGPRs into scratch does not come up often in practice for HLSL shaders. However, when it does, it will almost surely result in sub-optimal performance and your shader should be modified to reduce VGPR pressure and eliminate scratch usage. For example, temporary arrays (i.e. local arrays) use VGPRs and potentially can cause high VGPR usage. (As an aside, indexing into local arrays is an expensive operation on GCN, so you should be wary of relying heavily on local arrays already, beyond the potential for high VGPR usage.)

More about GPR pressure:

If more waves per SIMD is better for latency hiding, and this is often achieved through lower VGPR usage, a question arises: Why doesn’t the AMD driver’s shader compiler just schedule instructions such that VGPR usage is minimized? Consider the following simple pseudo-code example of two possible ways to schedule instructions:

Schedule 1 requires 3 VGPRs, whereas Schedule 2 uses only a single VGPR to do the same work. However, although Schedule 2 uses fewer registers, it is not necessarily better, because you have to wait for each load to complete before the data is available for use by the VALU.

Grouping loads together as in Schedule 1 is better for latency hiding within a particular wavefront, but Schedule 2 is better for GPR pressure, which could mean more waves per SIMD, which also hides latency (by switching between wavefronts). So basically, it’s complicated.

Still, it can often be beneficial to try to reduce GPR pressure when optimizing HLSL shaders, particularly if your shader is currently just over a waves-per-SIMD threshold (e.g. 68 VGPRs). Potential causes for increased GPR pressure include deep nesting, local array declarations, and long-lived temps. But as always, profile any changes to ensure performance actually improved.

Requirements

The new CodeXLAnalyzer command-line functionality for HLSL requires CodeXL 1.4 or later, available for download from the CodeXL page.

You will also need the AMD Catalyst Beta driver 14.4 RC v1.0 or later. Alternatively, you can use the AMD Catalyst 14.4 (or later) WHQL driver.

More information

The Southern Islands and Sea Islands Instruction Set Architecture documents include explanations for every GCN shader instruction, as well as additional details on VGPRs, SGPRs, and LDS.

More GCN information from AMD is available here:

GCN Performance Tweets

The AMD GCN Architecture: A Crash Course

There is also useful GCN-related material from non-AMDers, including the following:

Michal Drobot: Low Level Optimizations for GCN: pdf, pptx

Emil Persson: Low-Level Shader Optimization for Next-Gen and DX11: pdf, pptx

Bart Wronski: GCN – Two Ways of Latency Hiding and Wave Occupancy: blog

In addition, CodeXL includes a help file that, while currently still largely focused on OpenCL, contains a few pages relevant to DirectX. First, there is a page on kernel occupancy (search for occupancy in the Search tab). It provides more details about the various limitations on waves per SIMD.

The help file also contains documentation about the newly added DirectX functionality. The easiest way to find it is to search for DirectX in the Search tab.

The help file is located here: %CodeXLDir%\Help (e.g. C:\Program Files (x86)\AMD\CodeXL\Help).

Feel free to ask questions in the comments. Thanks for reading.

-Jason Stewart

Jason Stewart is a Developer Technology Engineer at AMD, working with game developers to integrate technologies and optimize performance on AMD graphics hardware. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

1In a 3DMark Fire Strike benchmark test in 1080p, the AMD Radeon™ R9 295X2 outperforms the Titan Black, Nvidia’s highest performing graphics card as of March 12, 2014, by a score of 15,862 to 9,878 in the Performance preset, and 8,764 to 4,725 using the Extreme preset. Test system: Intel i7 4960X CPU, 16GB memory, Nvidia driver 334.89, AMD Catalyst driver 14.10 and Windows 8.1. GRDT-36