Most people think that a GPU is only made to render millions of triangles. But GPU power isn’t reserved to a few graphic folks and any general programmer can use it. Today we’re gonna outperform our oldskool image converter tool using a GPU! Isn’t that cool?

Audience

This post is made for programmer who doesn’t know much about GPU programming, but want to optimize some of their CPU code.

Starting with numbers…

For those most hesitant to use the power of the GPU for non-graphical stuff, I will start with numbers that will motivate you. I wrote all my latest Cycle-op Amiga demo using my good old 2016 MacBook pro (Core i5 5287U, 4 threads). Even if we optimised the brute force HAM converter ( read the previous blog post for details ), processing a single image on the old MAC took 62 seconds! On the same laptop, just by using its old integrated Iris Graphics GPU, I can convert the same image in… 4 seconds! Around 15 times faster!

Apple MacBook Pro 2016 Amiga HAM image converter time
CPU: Intel core i5 5287U 61.59 sec
GPU: Intel Iris Graphics 6100 3.91 sec
Speed-up factor x15.7

so why would you avoid a 15.7 times speed up when you’re working on your next Amiga demo? :)

Cycle-op demo main intro pic: SHAM converted image, 4 seconds on GPU, 62 on CPU

A bit of History

GPU stands for “Graphics Processor Unit”. A wide definition of a GPU would be some piece of silicon that could help the main processor unit (CPU) for graphics specific tasks. By this definition, you can consider that the first GPUs were made for 2d graphics in the 70s for the first arcade machines. Also you could say Amiga Blitter from 1985 is a GPU because it can draw lines using Bresenham’s algorithm in hardware. Then in the 90s came GPUs capable of 3d and triangle rasterization (boomers recall Sega Saturn or Sony PlayStation 1).

In 2000s video games were more and more realistic, and GPU vendors made their chipset more flexible by introducing programmable shaders. ( NVidia GeForce3 ). Today, you’re not stuck in the graphics world only and you can process almost whatever data you want with a GPU. This is called “GPGPU” for “General-Purpose computing on Graphics Processing Units”.

Single Instruction, Multiple Data

You may know the SIMD concept on CPU. Instead of computing one scalar value at a time, CPU vendors introduced wider registers. Most common is 128 bits registers that can handle four floating point values. Back in time many people created 3d vector maths libraries using these new “vector” registers.

This approach have some pros but many cons:

  • If your algorithm deals with scalar value, you don’t benefit of the new wider registers
  • If your algorithm deals with vector3, it’s better, but you still run at ¾ of the theoretical performance of a vector4 register
  • When your hardware supports wider registers, like vector8 or even vector16 in AVX512, you have to change your programming model!

GPU programming model: scalar for the win!

GPU hardware are SIMD monsters. It’s common to have 1024 bits wide registers in GPUs. One single register can operate on 32 float values at a time! If wasting ¾ of the power when working on vector3 was acceptable in 2000, it would be totally stupid to do the same thing with vector32!

Today when you compile HLSL shader source code, even if you’re still using “float3” or “float4”, the compiler will treat everything as scalar values. Let’s imagine you’re updating some particles positions like:

pos.xyz += speed.xyz;

The compiler will attribute 3 registers for pos, as if it was 3 scalar values like

pos_x += speed_x;
pos_y += speed_y;
pos_z += speed_z;

But the real gain here is that these registers could contain up to 32 scalar values! So you can update 32 particles at a time! In reality, it’s just SIMD: one instruction, and 32 independent lanes. Lane 0 of register pos_x contains the x position of the first particle. Lane 1 of the same register contains the x position of the second particle, and so on.

When SIMD instructions are executed in sequence, 32 lanes registers are updated. You can see it as 32 virtual threads running at the exact same speed, totally in sync. Each virtual thread is updating a scalar value. They are all running in lock-step. This block of 32 threads is called a “wavefront” or “warp” depending on the GPU vendor naming convention.

A modern GPU had tons of wide registers, and tons of SIMD units. In our example a modern GPU could operate thousands of particles at a time. This is where it outperforms a CPU because you can’t find CPU with thousands of cores at your local store!

You can watch a funny MythBusters video about CPU/GPU here

Programming environment

There are several options to program your GPU: Cuda, OpenCL, OpenGL, Vulkan, DirectX, etc. Coming from the game industry I prefer using graphics API and directly write what’s called “compute shader”. A compute shader is a small piece of code (called “Kernel”). Nowadays we program GPUs using high level language. We’ll use HLSL for our HAM converter. HLSL syntax is very close to C and it’s really easy to use.

Kernel will only operate on data located in GPU memory. So typical GPGPU round trip looks like that:

  1. CPU Prepare the input data and copy it from main memory to GPU memory
  2. CPU issues a GPU “Dispatch” command to start the GPU. Your shader kernel function will be executed over as many items as your requested in Dispatch command
  3. During this time, CPU has to wait for GPU to finish (or do other CPU task if needed)
  4. Once GPU work is done, CPU can read GPU memory to get the result back to main memory

Scalar and vector registers

Ultra wide registers are called VGPR (Vector General Purpose Register). In our particle example, we can manage positions of 32 particles at the same time using 3 vector registers. But now what about some constant values across all particles. Let’s suppose we need a current clock during the simulation, like a scalar float value “m_clock”. Obviously it will be the same value for all particles during a frame. If we’re using a VGPR for that, our 32 lanes will contain the exact same value. You will get correct results, but what a waste of precious registers! GPU vendors introduced another kind of register: SGPR (Scalar General Purpose Register). It’s like a vector register but with a single lane. You know this register will have the same value for any GPU thread running.

The compiler knows at compile time if a variable needs to use a scalar or vector register. Most of the time, vector registers are needed, and they are precious. You may have heard about “VGPR pressure reduction” in some articles about GPU optimization. It consists of techniques to help the compiler to use more scalar registers instead of VGPR, when possible.

You don’t need to know so many details to start doing GPGPU. Anyway if you want more details about VGPR/SGPR and lockstep magic, you can read this great article by my friend Francesco.

Types of memory

Dealing with CPU memory is straightforward: you have one big memory and just load and store instructions. A GPU is kind of weird regarding memory access. You access memory through different kinds of “buffers”. For our purpose we’ll just describe 3 types of buffers:

CB (Constant buffer)

A constant buffer is a GPU read-only buffer. Typically small, it’s used to store constant arguments to pass to the GPU program. Its content won’t change during the complete GPU call. It’s like read-only global variables for your kernel code.. The compiler knows that all variables in the CB are constants across all GPU threads. So these variables will be fetched into SGPR instead of VGPR. We’ll use that to pass some info to our GPU code such as image width, height, and current palette entry to brute force.

SRV (Shader Resource View)

This is a read only buffer too, generally used to store large data, such as graphic textures. Could be very large. You can use complex read operations ( like texture filtering ), but you can also “just” read a 32 bits value. We’ll use that kind of buffer to put the original image to convert

UAV (Unordered Access View)

Very similar to SRV but the GPU can read & write into it. We’ll use it to store all the 16 colors palette per line. When the GPU has completed the work, the CPU will read back this buffer to get the resulting best palettes.

Kernel and thread group

The shader code executed on GPU is called kernel and have an entry point function. A thread group is a fixed amount of GPU threads that will proceed the data. In most graphics API you fix the thread group size at compile time using numthreads directive. For our converter we decided to use a 128 threads group size. As most GPUs have 32 lanes SIMD, it’s always a good idea to use at least 32 threads in a thread group. ( Max value in DirectX is 1024 )

[numthreads(128, 1, 1)]
void ShamKernel(uint3 DTid : SV_GroupID, uint3 TGid : SV_GroupThreadID)
{
  
}

NOTE: Thread group size is defined as a triplet (a,b,c). It’s quite confusing first but it makes things easier when dealing with 2d or 3d data. Anyway just keep in mind the number of threads in thread group is just a*b*c. ( so here, 128*1*1 )

A thread group is the minimal amount of work a GPU can proceed. When the CPU wakes up GPU with a Dispatch command, you don’t specify the amount of items to proceed, but the amount of thread groups. Let’s say we want to run a simulation on 2048 particles, and we have a thread group of size 128, we will call Dispatch( 16 ). ( it means the GPU will proceed 16 thread groups of 128 threads each, so 2048 items in total )

NOTE: several thread groups are running at the same time. You shouldn’t suppose anything about the amount of them running in parallel. You shouldn’t suppose anything about thread group execution order too.

Thread Group Local Memory

GPU also have some local and fast small memory region that you can use in your algorithms. This memory is called LDS (Local Data Share) and is only visible by threads within a Thread Group. When a thread group start or end its work, the LDS content is considered as “undefined”

Execution flow

To start the GPU processing, you call a Dispatch(N) command. N will be the number of times a complete thread group (of 128 threads in our case) will be executed. As for the thread group, this N is also a triplet (d,e,f). At the end, your kernel HLSL function will be executed a*b*c*d*e*f times!

When kernel function is executed, you can retrieve which item is proceed using specific shader vars: SV_GroupID and SV_GroupThreadID. Both are uint3 ( because of triplets ). The former is the current position within the Dispatch (from 0 to N-1), and the latter is the current thread index within the thread group (from 0 to 127 in our case)

Memory & buffers

A kernel code should explicitly declare memory buffers to work with. Let’s look at our SHAM Amiga brute force kernel

#define	MY_THREAD_GROUP_SIZE		128
#define	MY_BRUTE_FORCE_PER_THREAD	(4096/MY_THREAD_GROUP_SIZE)		// 4096 possible Amiga colors

RWByteAddressBuffer inOutPalettes : register(u0);
ByteAddressBuffer inImage : register(t0);

groupshared uint sharedBestError;
groupshared uint sharedBestColor[MY_THREAD_GROUP_SIZE];

cbuffer processInfo : register(b0)
{
	uint	m_w;
	uint	m_h;
	uint	m_palEntry;
	uint	m_pad;
};

NOTE: in HLSL RWByteAddressBuffer means UAV (read-write), ByteAddressBuffer means SRV (read-only, as a common texture), and cbuffer means CB. groupshared prefix is used to declare variables in LDS (local thread-group memory)

The algorithm

If you have no idea what Amiga HAM graphics mode is, you should read my previous post.

Let’s study the SHAM case (one HAM palette per raster line). We’ll store the original input image in a texture (32bits per pixel). For a 320*256 pixels image, we’ll brute force 256 palettes of 16 entries each. As each palette entry depends on the previous one, the GPU will brute force each palette entry, one by one. And the CPU will issue 16 Dispatch(). High level CPU loop will be

for (int palEntry = 1; palEntry < 16; palEntry++)
{
	int processInfo[4] = { w, h, palEntry, 0 };
	constantBuffer.UpdateData(m_pDx11Context, processInfo, 16));
	m_pImmediateContext->Dispatch(h, 1, 1);
}

We’re using a small constant buffer to store the current palette entry to brute force ( palEntry ) and we call Dispatch on the number of lines of the image ( h ).

void ShamKernel(uint3 DTid : SV_GroupID, uint3 TGid : SV_GroupThreadID)
{
	uint scanline = DTid.x;
	uint colorChunk = TGid.x;

	if (0 == TGid.x)
		sharedBestError = 0xffffffff;

	GroupMemoryBarrierWithGroupSync();

	lineErrorCompute(scanline, colorChunk);

	GroupMemoryBarrierWithGroupSync();
	if (TGid.x == 0)
	{
		const uint bestIndex = sharedBestError & uint(MY_THREAD_GROUP_SIZE-1);
		inOutPalettes.Store((scanline * 16 + m_palEntry) * 4, sharedBestColor[bestIndex]);
	}
}

NOTE: You can’t suppose anything about GPU threads order execution within a thread group. When several threads need to access a variable in the LDS, you should sync them using the GroupMemoryBarrierWithGroupSync() function. After this call, you’re guaranteed that all the threads of the thread-group have reached this point.

The kernel function will brute force the 4096 possible Amiga colors for the current palette entry. For each of the 4096 possible color of the palette entry, an error is computed over the image scanline. At the end, we keep the color with the smallest error.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
void	lineErrorCompute(int scanline, in uint colorChunk)
{

	const uint bruteForceColorStart = colorChunk * MY_BRUTE_FORCE_PER_THREAD;
	uint bestColor;
	uint bestError = 0xffffffff;
	[loop]
	for (uint bruteForceColor = bruteForceColorStart; bruteForceColor < bruteForceColorStart + MY_BRUTE_FORCE_PER_THREAD; bruteForceColor++)
	{
		uint prevColor = inOutPalettes.Load((scanline * 16) * 4);	// current color on first pixel is background color
		uint imgAd = scanline * m_w * 4;
		uint err = 0;
		[loop]
		for (uint x = 0; x < m_w; x++)
		{
			// search best fit in palette
			uint pixelErr;
			uint pixelColor = inImage.Load(imgAd);
			imgAd += 4;
			prevColor = getBestHAMColor(pixelColor, prevColor, pixelErr, scanline, bruteForceColor);
			err += pixelErr;
		}
		if (err < bestError)
		{
			bestError = err;
			bestColor = bruteForceColor;
		}
	}

	sharedBestColor[colorChunk] = bestColor;
	bestError = (bestError * MY_THREAD_GROUP_SIZE) | colorChunk;
	InterlockedMin(sharedBestError, bestError);
}

As our thread group size is 128, we have to spread these 4096 tests into 128 “chunks” of 32 colors each. Each thread will test 32 colors (within a for loop, line 14), and store the best color of this chunk in shared memory ( LDS ). Each thread also keep track of the smallest error and best chunk updated (using an atomic InterlockedMin instruction, line 32)

When the thread group is over, we could get the best entry palette by looking at the best color of the best chunk.

Sample source code

I know many parts of GPGPU haven’t been covered by this blog post. So for technical implementation details (which can be a headache, especially DirectX buffer creation flags), I just released my bitmap converter tool code on Github. All the GPGPU stuff is located in dx11.h and dx11.cpp

Conclusion

It’s not easy to get into GPGPU in one blog post and you may be sceptical about using it for offline tools. But if you’re already comfortable with the C language you can jump into it easily. When programming my old skool retro demo stuff, I enjoy using all available hardware of my 7 years old laptop.

Also please note that no efforts have been made to optimise the shader code. (as we did for the CPU code in the previous post). Optimising GPU code is a vast subject and totally out of the scope of this post. But look, with a first naive GPU version, we proceed lovely Amiga pixels 15 times faster than with our optimal CPU version.

So you have no excuse anymore: use your GPU!

Follow me on Twitter

ABC (AmigAtari Bitmap Converter) tool source code