The amount of control offered by Vulkan is not a very welcome property for users who just want to run a simple shader to compute something quickly, and the effort required for the "first good run" is often quite deterrent. To ease the struggle, this tutorial gives precisely the small "bootstrap" piece of code that should allow you to quickly run a compute shader on actual data. In short, we walk through the following steps:
- Opening a device and finding good queue families and memory types
- Allocating memory and buffers
- Compiling a shader program and filling up the structures necessary to run it:
- specialization constants
- push constants
- descriptor sets and layouts
- Making a command buffer and submitting it to the queue, efficiently running the shader
using Vulkan
+instance = Instance([], [])
Instance(Ptr{Nothing} @0x0000000009b5b6d0)
Take the first available physical device (you might check that it is an actual GPU, using get_physical_device_properties
physical_device = first(unwrap(enumerate_physical_devices(instance)))
PhysicalDevice(Ptr{Nothing} @0x000000000921fdd0)
At this point, we need to choose a queue family index to use. For this example, have a look at vulkaninfo
command and pick the good queue manually from the list of VkQueueFamilyProperties
– you want one that has QUEUE_COMPUTE
in the flags. In a production environment, you would use get_physical_device_queue_family_properties
to find a good index.
qfam_idx = 0
Create a device object and make a queue for our purposes.
device = Device(physical_device, [DeviceQueueCreateInfo(qfam_idx, [1.0])], [], [])
Device(Ptr{Nothing} @0x0000000006798d60)
Similarly, you need to find a good memory type. Again, you can find a good one using vulkaninfo
or with get_physical_device_memory_properties
. For compute, you want something that is both at the device (contains MEMORY_PROPERTY_DEVICE_LOCAL_BIT
) and visible from the host (..._HOST_VISIBLE_BIT
memorytype_idx = 0
Let's create some data. We will work with 100 flimsy floats.
data_items = 100
+mem_size = sizeof(Float32) * data_items
Allocate the memory of the correct type
mem = DeviceMemory(device, mem_size, memorytype_idx)
DeviceMemory(Ptr{Nothing} @0x0000000008c300f8)
Make a buffer that will be used to access the memory, and bind it to the memory. (Memory allocations may be quite demanding, it is therefore often better to allocate a single big chunk of memory, and create multiple buffers that view it as smaller arrays.)
buffer = Buffer(
+ device,
+ mem_size,
+ [qfam_idx],
+bind_buffer_memory(device, buffer, mem, 0)
First, map the memory and get a pointer to it.
memptr = unwrap(map_memory(device, mem, 0, mem_size))
Ptr{Nothing} @0x0000000008bedc00
Here we make Julia to look at the mapped data as a vector of Float32
s, so that we can access it easily:
data = unsafe_wrap(Vector{Float32}, convert(Ptr{Float32}, memptr), data_items, own = false);
For now, let's just zero out all the data, and flush the changes to make sure the device can see the updated data. This is the simplest way to move array data to the device.
data .= 0
+unwrap(flush_mapped_memory_ranges(device, [MappedMemoryRange(mem, 0, mem_size)]))
SUCCESS::Result = 0
The flushing is not required if you have verified that the memory is host-coherent (i.e., has MEMORY_PROPERTY_HOST_COHERENT_BIT
Eventually, you may need to allocate memory types that are not visible from host, because these provide better capacity and speed on the discrete GPUs. At that point, you may need to use the transfer queue and memory transfer commands to get the data from host-visible to GPU-local memory, using e.g. cmd_copy_buffer
Now we need to make a shader program. We will use glslangValidator
packaged in a JLL to compile a GLSL program from a string into a spir-v bytecode, which is later passed to the GPU drivers.
shader_code = """
+#version 430
+layout(local_size_x_id = 0) in;
+layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
+layout(constant_id = 0) const uint blocksize = 1; // manual way to capture the specialization constants
+layout(push_constant) uniform Params
+ float val;
+ uint n;
+} params;
+layout(std430, binding=0) buffer databuf
+ float data[];
+ uint i = gl_GlobalInvocationID.x;
+ if(i < params.n) data[i] = params.val * i;
"#version 430\n\nlayout(local_size_x_id = 0) in;\nlayout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;\n\nlayout(constant_id = 0) const uint blocksize = 1; // manual way to capture the specialization constants\n\nlayout(push_constant) uniform Params\n{\n float val;\n uint n;\n} params;\n\nlayout(std430, binding=0) buffer databuf\n{\n float data[];\n};\n\nvoid\nmain()\n{\n uint i = gl_GlobalInvocationID.x;\n if(i < params.n) data[i] = params.val * i;\n}\n"
Push constants are small packs of variables that are used to quickly send configuration data to the shader runs. Make sure that this structure corresponds to what is declared in the shader.
struct ShaderPushConsts
+ val::Float32
+ n::UInt32
Specialization constants are similar to push constants, but less dynamic: You can change them before compiling the shader for the pipeline, but not dynamically. This may have performance benefits for "very static" values, such as block sizes.
struct ShaderSpecConsts
+ local_size_x::UInt32
Let's now compile the shader to SPIR-V with glslang
. We can use the artifact glslang_jll
which provides the binary through the Artifact system.
First, make sure to ] add glslang_jll
, then we can do the shader compilation through:
using glslang_jll: glslangValidator
+glslang = glslangValidator(identity)
+shader_bcode = mktempdir() do dir
+ inpath = joinpath(dir, "shader.comp")
+ outpath = joinpath(dir, "shader.spv")
+ open(f -> write(f, shader_code), inpath, "w")
+ status = run(`$glslang -V -S comp -o $outpath $inpath`)
+ @assert status.exitcode == 0
+ reinterpret(UInt32, read(outpath))
325-element reinterpret(UInt32, ::Vector{UInt8}):
+ 0x07230203
+ 0x00010000
+ 0x0008000a
+ 0x00000030
+ 0x00000000
+ 0x00020011
+ 0x00000001
+ 0x0006000b
+ 0x00000001
+ 0x4c534c47
+ ⋮
+ 0x0003003e
+ 0x0000002b
+ 0x00000029
+ 0x000200f9
+ 0x0000001d
+ 0x000200f8
+ 0x0000001d
+ 0x000100fd
+ 0x00010038
We can now make a shader module with the compiled code:
shader = ShaderModule(device, sizeof(UInt32) * length(shader_bcode), shader_bcode)
ShaderModule(Ptr{Nothing} @0x0000000008c78b58)
A descriptor set layout
describes how many resources of what kind will be used by the shader. In this case, we only use a single buffer:
dsl = DescriptorSetLayout(
+ device,
+ [
+ DescriptorSetLayoutBinding(
+ 0,
+ descriptor_count = 1,
+ ),
+ ],
DescriptorSetLayout(Ptr{Nothing} @0x00000000081aeb58)
Pipeline layout describes the descriptor set together with the location of push constants:
pl = PipelineLayout(
+ device,
+ [dsl],
+ [PushConstantRange(SHADER_STAGE_COMPUTE_BIT, 0, sizeof(ShaderPushConsts))],
PipelineLayout(Ptr{Nothing} @0x0000000008db40a8)
Shader compilation can use "specialization constants" that get propagated (and optimized) into the shader code. We use them to make the shader workgroup size "dynamic" in the sense that the size (32) is not hardcoded in GLSL, but instead taken from here.
const_local_size_x = 32
+spec_consts = [ShaderSpecConsts(const_local_size_x)]
1-element Vector{Main.ShaderSpecConsts}:
+ Main.ShaderSpecConsts(0x00000020)
Next, we create a pipeline that can run the shader code with the specified layout:
pipeline_info = ComputePipelineCreateInfo(
+ PipelineShaderStageCreateInfo(
+ shader,
+ "main", # this needs to match the function name in the shader
+ specialization_info = SpecializationInfo(
+ [SpecializationMapEntry(0, 0, 4)],
+ UInt64(4),
+ Ptr{Nothing}(pointer(spec_consts)),
+ ),
+ ),
+ pl,
+ -1,
+ps, _ = unwrap(create_compute_pipelines(device, [pipeline_info]))
+p = first(ps)
Pipeline(Ptr{Nothing} @0x000000000841e328)
Now make a descriptor pool to allocate the buffer descriptors from (not a big one, just 1 descriptor set with 1 descriptor in total), ...
dpool = DescriptorPool(device, 1, [DescriptorPoolSize(DESCRIPTOR_TYPE_STORAGE_BUFFER, 1)],
DescriptorPool(Ptr{Nothing} @0x0000000006f4d288)
... allocate the descriptors for our layout, ...
dsets = unwrap(allocate_descriptor_sets(device, DescriptorSetAllocateInfo(dpool, [dsl])))
+dset = first(dsets)
DescriptorSet(Ptr{Nothing} @0x0000000006d563c0)
... and make the descriptors point to the right buffers.
+ device,
+ [
+ WriteDescriptorSet(
+ dset,
+ 0,
+ 0,
+ [],
+ [DescriptorBufferInfo(buffer, 0, WHOLE_SIZE)],
+ [],
+ ),
+ ],
+ [],
Let's create a command pool in the right queue family, and take a command buffer out of that.
cmdpool = CommandPool(device, qfam_idx)
+cbufs = unwrap(
+ allocate_command_buffers(
+ device,
+ CommandBufferAllocateInfo(cmdpool, COMMAND_BUFFER_LEVEL_PRIMARY, 1),
+ ),
+cbuf = first(cbufs)
CommandBuffer(Ptr{Nothing} @0x00000000081ae5f0)
Now that we have a command buffer, we can fill it with commands that cause the kernel to be run. Basically, we bind and fill everything, and then dispatch a sufficient amount of invocations of the shader to span over the array.
+ cbuf,
+cmd_bind_pipeline(cbuf, PIPELINE_BIND_POINT_COMPUTE, p)
+const_buf = [ShaderPushConsts(1.234, data_items)]
+ cbuf,
+ pl,
+ 0,
+ sizeof(ShaderPushConsts),
+ Ptr{Nothing}(pointer(const_buf)),
+cmd_bind_descriptor_sets(cbuf, PIPELINE_BIND_POINT_COMPUTE, pl, 0, [dset], [])
+cmd_dispatch(cbuf, div(data_items, const_local_size_x, RoundUp), 1, 1)
Finally, find a handle to the compute queue and send the command to execute the shader!
compute_q = get_device_queue(device, qfam_idx, 0)
+unwrap(queue_submit(compute_q, [SubmitInfo([], [], [cbuf], [])]))
SUCCESS::Result = 0
After submitting the queue, the data is being crunched in the background. To get the resulting data, we need to wait for completion and invalidate the mapped memory (so that whatever data updates that happened on the GPU get transferred to the mapped range visible for the host).
While queue_wait_idle
will wait for computations to be carried out, we need to make sure that the required data is kept alive during queue operations. In non-global scopes, such as functions, the compiler may skip the allocation of unused variables or garbage-collect objects that the runtime thinks are no longer used. If garbage-collected, objects will call their finalizers which imply the destruction of the Vulkan objects (via vkDestroy...
). In this particular case, the runtime is not aware that for example the pipeline and buffer objects are still used and that there's a dependency with these variables until the command returns, so we tell it manually.
GC.@preserve buffer dsl pl p const_buf spec_consts begin
+ unwrap(queue_wait_idle(compute_q))
SUCCESS::Result = 0
Free the command buffers and the descriptor sets. These are the only handles that are not cleaned up automatically (see Automatic finalization).
free_command_buffers(device, cmdpool, cbufs)
+free_descriptor_sets(device, dpool, dsets)
Just as with flushing, the invalidation is only required for memory that is not host-coherent. You may skip this step if you check that the memory has the host-coherent property flag.
unwrap(invalidate_mapped_memory_ranges(device, [MappedMemoryRange(mem, 0, mem_size)]))
SUCCESS::Result = 0
Finally, let's have a look at the data created by your compute shader!
data # WHOA
100-element Vector{Float32}:
+ 0.0
+ 1.234
+ 2.468
+ 3.702
+ 4.936
+ 6.17
+ 7.404
+ 8.638
+ 9.872
+ 11.106
+ ⋮
+ 112.294
+ 113.528
+ 114.76199
+ 115.995995
+ 117.229996
+ 118.464
+ 119.698
+ 120.932
+ 122.166
This page was generated using Literate.jl.