Clover: Using Clang and LLVM to Launch Kernels

Unlike OpenGL and its GLSL language, OpenCL uses a subset of C easily implementable with little compiler support.

It is known that at least Apple and nVidia use Clang and LLVM to compile OpenCL programs. Furthermore, Intel recently released a SDK advertised as using LLVM (with the possibility to also use an in-house JIT).

This widespread usage of Clang is very good for Clover, as Clang already supports the vast majority of what is needed in order to have a good OpenCL compiler. LLVM is very efficient at using vectors (it isn't used by Apple and Mesa GL for their software acceleration path without reason), and its API is very well done. It takes an average of one or two days to implement of modify something using LLVM in Clover.

This page explains how Clang and LLVM are used in Clover to compile, manage and launch OpenCL kernels.

Compiling OpenCL C to LLVM IR

The first step when one wants to launch a program is to compile it. It is done API-wise by the clCreateProgramWithSource() and clBuildProgram() functions.

The first function creates a Coal::Program object, using Coal::Program::loadSources(). You can see at the top of this function a line prepending the OpenCL C standard header to the source being built. This function consists mainly of a concatenation of the strings given (they may be zero-terminated or not).

Once the Coal::Program objects holds the source, clBuildProgram() can be used to compile it. It does so by invoking Coal::Program::build().

This big function compiles and links the program, so it will be explained later. The actual compilation job is done by Coal::Compiler. It does that in Coal::Compiler::compile(), and then keeps the compilation log and options at hand for future use.

When a program is compiled, the client application can retrieve it by using clGetProgramInfo().

Linking the program

The compilation step produced an "unlinked" module, that needs to be linked with the OpenCL C standard library, but only if the device for which the program is being built needs to. It's also possible to load a previously-compiled binary in a Coal::Program using Coal::Program::loadBinaries(). Doing this also loads an unlinked binary.

The separation between the unlinked binary and the linked one is the reason for the existence of Coal::Program::DeviceDependent::unlinked_binary. The source is compiled to LLVM IR in a module (temporarily stored in linked_module, though it isn't linked yet), that is dumped to unlinked_binary and then linked to form a full executable program.

So, Coal::Program::build() runs its code for every device for which a program must be built. These devices are either given at Coal::Program::loadBinaries(), or as arguments to Coal::Program::build().

The first step is to see if the program was loaded with sources. If it's the case, they have to be compiled (see Compiling OpenCL C to LLVM IR).

Then, if the device for which the program is being built asks for that (Coal::DeviceProgram::linkStdLib(), Coal::CPUDevice does so), the program is linked with the OpenCL C standard library of Clover. An hardware-accelerated device normally will not want to have stdlib linked, as it's easier to convert LLVM IR to hardware-specific instructions when OpenCL built-ins functions are left in the form "call foo" instead of being inlined with inefficient CPU-centric code.

After this linking pass, optimization passes are created. The first ones are created by Coal::Program itself. They remove all the functions that are not kernels and are not called by a kernel. It allows LLVM to remove all the unused stdlib functions.

Then, the device is allowed to add more optimization or analysis passes. Coal::CPUProgram::createOptimizationPasses() adds standard link-time optimizations, but hardware-accelerated devices could add autovectorizing, lowering, or analysis passes.

Finally, Coal::DeviceProgram::build() is called. It's a no-op function for Coal::CPUDevice as it uses directly the module with a LLVM JIT, but hardware devices could use this function to actually compile the program for the target device (LLVM to TGSI transformation for example).

The program is now built and ready to be usable !

Finding kernels

Now that the program is built, it's time to get its kernels. The functions declared as kernel in OpenCL C (with the __kernel attribute) are registered by Clang using the !opencl.kernels metadata. They are read from the LLVM module by Coal::Program::kernelFunctions(). Note that this function is device dependent, as it have to use the LLVM IR generated for the specified device.

When the kernels are found, Coal::Kernel objects can be instantiated. These objects are again device-independent as requested by the OpenCL spec. A Coal::Kernel object is mainly a name and a list of device-specific information. There is for instance the llvm::Function object that will be called.

Once the Coal::Kernel object is created, Coal::Kernel::addFunction() is called for every device for which the Coal::Program is built. This function has the responsibility to explore the arguments of the function and to create a list of device-independent Coal::Kernel::Arg objects (kernel arguments). For instance, when it sees an argument of type <4 x i32>, it converts it to a Coal::Kernel::Arg of kind Coal::Kernel::Arg::Int32 and vector dimension 4.

Setting kernel arguments

The Coal::Kernel::Arg objects are interesting. They are an abstraction layer between the host CPU and the device. They also enable Coal::Kernel to implement its Coal::Kernel::setArg() function, that performs checks on the value given as argument.

This class also contains semantic information specific to OpenCL. For instance, in LLVM, the address space qualifiers like __global or __local are represented as address spaces (0 = private, etc). Coal::Kernel::addFunction() translates these address spaces into Coal::Kernel::Arg::File values.

When the users call clSetKernelArg(), the execution flow arrives at Coal::Kernel::setArg(). This function puts the correct value in the Coal::Kernel::Arg object, and does some checks. It is also his responsibility to recognize Coal::Sampler objects.

Samplers are a bit special as they are pointers to Coal::Sampler objects on the host CPU, and plain uint32_t values on the kernel side. This makes their translation from LLVM type to Coal::Kernel::Kind a bit difficult, as Clover only sees an LLVM i32 type for a sampler and also for a normal uint32.

The trick used in Clover is to store in memory a list of the known samplers. When a Coal::Sampler object is created, it is registered in this list. When it is deleted, its index is removed from the list. This in implemented in Coal::Object and shared between all the Coal classes. It allows the implementation of functions like Coal::Object::isa(), very useful to check that arguments given by the user are sane.

So, Coal::Object::isa() is used to recognize when an argument passed to Coal::Kernel::setArg() is in fact a sampler. When it is the case, the pointer to Coal::Sampler is replaced by the sampler's "bitfield" representation, using Coal::Sampler::bitfield().

  // Special case for samplers (pointers in C++, uint32 in OpenCL).
  if (size == sizeof(cl_sampler) && arg_size == 4 &&
      (*(Object **)value)->isA(T_Sampler))
  {
      unsigned int bitfield = (*(Sampler **)value)->bitfield();

      arg.refineKind(Arg::Sampler);
      arg.alloc();
      arg.loadData(&bitfield);

      return CL_SUCCESS;
  }

This trick is described in more detail at the end of this blog post : http://steckdenis.wordpress.com/2011/08/07/when-easy-is-difficult-and-vice-versa/ .

Queuing an event

Once the Coal::Kernel object is created and its args set, the client application can call clEnqueueTask() or clEnqueueNDRangeKernel(). These functions create a Coal::KernelEvent object responsible for telling the device to execute the kernel.

When the event arrives on the device (see Command Queues, Events and Worker Threads), Coal::CPUDevice initializes the LLVM JIT (Coal::CPUProgram::initJIT()) and then does that in src/core/cpu/worker.cpp :

 KernelEvent *e = (KernelEvent *)event;
 CPUKernelEvent *ke = (CPUKernelEvent *)e->deviceData();

 // Take an instance
 CPUKernelWorkGroup *instance = ke->takeInstance();
 ke = 0;     // Unlocked, don't use anymore

 if (!instance->run())
     errcode = CL_INVALID_PROGRAM_EXECUTABLE;

 delete instance;

The first step is to use Coal::Event::deviceData() to get a Coal::CPUKernelEvent object. See Coal::Event::setDeviceData() and Coal::DeviceInterface::initEventDeviceData().

This Coal::CPUKernelEvent holds information about the event needed by Coal::CPUDevice.

How Clover handles work-groups and work-items

The next line is interesting : Coal::CPUKernelEvent::takeInstance() is called. This function works in pair with Coal::CPUKernelEvent::reserve() called from Coal::CPUDevice::getEvent().

A kernel is run in multiple "work groups", that is to say batches of work items. The worker threads (see Command Queues, Events and Worker Threads) take work-groups one at a time, so there can be multiple work groups of a single kernel running concurrently on a multicore CPU.

Command Queues, Events and Worker Threads gives more details about that, but the main principle is that there is a list of events a worker thread can execute. For Coal::KernelEvent, a worker thread calls Coal::CPUKernelEvent::reserve() to see if there is a work-group available for execution (that is to say if the work groups aren't yet all executed). If there is one available, a mutex is locked and the function returns. Then, the worker thread calls Coal::CPUKernelEvent::takeInstance() to actually get the work-group, and runs it through Coal::CPUKernelWorkGroup::run().

Passing arguments to the kernel

Once the work-group is taken, it is run and must call the kernel function (using the JIT) for every work-item. This is done very simply by getting a function pointer to the kernel using llvm::ExecutionEngine::getPointerToFunction(). This function must now be called with the needed arguments.

The difficult thing is that C++ doesn't allow to give arbitrary arguments to a function. A function can receive arbitrary arguments, using void foo(int argc, ...), but an arbitrary list of arguments cannot be passed like in foo(va_build(std_vector));. They must be known at compilation-time.

The solution used by LLVM is to use a function like llvm::JIT::runFunction(function, vector of args). This function internally creates a "stub" function taking zero arguments but calling the target function itself with its arguments passed as constants. That is to say, when we want to call bar(3, 4);, a stub is compiled like this :

 void stub() {
     bar(3, 4);
 }

This stub is then JITed and run directly :

 void (*stub)() = getPointerToFunction(stub_function);
 stub();

LLVM then destroys the stub. This is a waste of time as a stub is slow to generate and JIT compile.

Clover also uses stubs, and not libffi (for Foreign Function Interface). Libffi is a library allowing to call a function with arguments only known at run-time, but it is too slow (the arguments are re-built for every call) and doesn't fully support vectors (it supports XMM registers, but slowly).

The solution retained was to mimic the way LLVM does its stub. As a kernel can be run multiple times with different arguments (when the application explicitly does so, or when there are __local pointers needing to be reallocated between each work-group), the stub function cannot simply use constants, because rebuilding it for each set of arguments would be too slow.

So, the Clover's stub takes exactly one parameter: a void* pointer. This pointer contains the actual parameters, carefully aligned by Coal::CPUKernel::typeOffset(). The stub itself is built by Coal::CPUKernel::callFunction() and is like that :

 void stub(void *args) {
     // We know the args the kernel takes and their types.
     kernel(
         *(int *)args, // For an int argument
         *(float **)((char *)args + 8), // For a float* argument, after the int in args, and aligned to sizeof(void*)
         *(sampler_t *)((char *)args + 16));
 }

Each argument is simply built in LLVM IR like this :

 param = load(bitcast(getelementptr(args, offset_in_args) to param_type*))

Built-ins and Thread Local Storage

The OpenCL C language provides built-ins that can be called from the kernels. For the most of them, there is no problem: they can either be implemented as LLVM instructions and then compiled for the CPU, or the standard library (src/core/runtime/stdlib.c) provides an implementation.

But there are cases where information outside the kernel is needed. For example, the get_work_dim() builtin takes no argument, but has to return a value dependent of the current Coal::KernelEvent being run.

In order to handle that, a call is made from the kernel to the Clover library. It's made possible by a very handy LLVM function: llvm::ExecutionEngine::InstallLazyFunctionCreator() called by Coal::CPUProgram::initJIT(). This function allows Clover to register a function that will resolve function names to function addresses. This way, a function called "get_work_dim" in the kernel will be passed to this function creator, that will return a pointer to get_work_dim() in src/core/cpu/builtins.cpp.

 void *getBuiltin(const std::string &name)
 {
     if (name == "get_global_id")
         return (void *)&get_global_id;
     else if (name == "get_work_dim")
         return (void *)&get_work_dim;

     return (void *)&unimplemented_stub;
 }

It's good, but one problem remains: get_work_dim() doesn't take any argument, but has to return an information about the currently-running kernel. Here, the internal structure of Coal::CPUDevice has to be taken into account. The device creates one worker thread per CPU core, and each of these worker threads can run only one work-group at a time, but multiple worker threads can run different kernels and work groups concurrently.

So, the solution retained is a Thread-Local variable. Such a variable is like a global variable (shared among all the classes and functions of a project), but its value is private to the currently-running thread. As a thread always handles only one work-group, a TLS variable is what is needed, and what Clover uses. It's named g_work_group.

One of these built-ins is particularly interesting, see Implementing barriers.

The call

Finally, the work-items can be called in sequence. The stub and its kernel function are JITed only once, it's fast :

 do
 {
     // Simply call the "call function", it and the builtins will do the rest
     p_kernel_func_addr(p_args);
 } while (!p_had_barrier &&
          !incVec(p_work_dim, p_dummy_context.local_id, p_max_local_id));

This code can be found in Coal::CPUKernelWorkGroup::run(). The incVec() call is there to handle the 3D global and local IDs. It returns true when the vector we are incrementing reaches p_max_local_id.

More explanation of this part can be found on the Implementing barriers page.