Clover: Implementing barriers

barrier() is an OpenCL C built-in function allowing synchronization between the work-items of a work-group. When a work-item encounters a barrier(), it must wait for all the other work-items of its work-group to also encounter the same barrier() call.

An example of use-case could be encrypting a buffer. A work-item would be the operation run on a single byte, and work-groups blocks of data that must be processed in chunks. Each work-item will first read its byte, then encrypt it using maybe the next and previous bytes in the work-group. After that, the data must be written back into the buffer, but if the work-item does so, another one could try to read what it have just written (because the work-item reads the next and previous bytes), and that will not be correct.

So, the work-item first calls barrier(), to let the others read their sane data before overwriting it with encrypted data.

How Clover handles work-groups and work-items

Before beginning with the explanation of barrier(), a first note about how Clover handles the work-groups and work-items.

A work-group is a set of work-items. In the spec, work-groups can be run in parallel, and their work-items can also be run in parallel. This allows massively parallel GPUs to launch kernels efficiently (they are slower than a CPU but made of thousands of cores). A CPU isn't very parallel, so it makes no sense to have one thread per work-item, it would require up to hundreds of thousands of threads for kernels running on a huge amount of data (for example converting an image from RGB to sRGB, it's the same computation for every pixel, so each pixel can be run in parallel).

Clover uses another technique: each work-group is run in parallel, but the work-items are run sequentially, one after the other. This allows Clover to be pretty fast for most of the cases, as all the CPU cores are used and no time is lost in thread switch and synchronization primitives. An interesting function here is Coal::CPUKernel::guessWorkGroupSize(). It tries to divide the number of work-items by a number the closest possible to the amount of CPU cores, falling back to one work-group if the number of work-items is prime (or not easily divisible, to avoid having 517583527 work-groups of only one work-item).

In short, the work-items are run sequentially in Clover.

The problem

The problem is that barrier() must oblige the current work-item to wait for the others. But as they are run sequentially, a work-item cannot wait for the others as it must have finished for them to be run.

The solution is to pause the current work-item and to jump into the next. When the next also encounters barrier(), it also jumps to the next, and so forth. When the last work-item encounters a barrier(), it jumps back to the first one, that can continue its execution past its barrier() call.

/home/steckdenis/programs/Clover/doc/html/inline_dotgraph_1.dot

This graphs shows a special case: when an item finishes, the next resumes where it left, not at its beginning. This explain why there is a test for barriers in Coal::CPUKernelWorkGroup::run() :

 do
 {
     p_kernel_func_addr(p_args);
 } while (!p_had_barrier &&       // <== here
          !incVec(p_work_dim, p_dummy_context.local_id, p_max_local_id));

Technical solution

Now that the problem is solved on paper, a working solution has to be found. What Clover wants to achieve is stopping a function in the middle of it, and then resuming it.

Some people may know the setjmp() and longjmp() functions. They do nearly what is needed but are not considered secure to use for resuming a function (that means that we can longjmp() from a function to another, but we cannot then resume the function that called longjmp()).

Another solution is POSIX contexts, managed by the functions setcontext(), getcontext() and swapcontext(). These are what Clover uses. When a barrier() call is encountered, the current work-item is saved in its context, and then the next is executed (using swapcontext()). This is done in Coal::CPUKernelWorkGroup::barrier().

The problem of stacks

In fact, it's a bit more complex than that. The first reason is that Clover doesn't want to slow down kernels not using barrier() (the majority), so the barrier() overhead in the case this built-in is never called must be as low as possible. Creating a context for each work-item takes time, it's not good.

Another thing to keep in mind is that a function (and kernels are functions) stores parameters, local variables and temporaries on the stack. If a work-item is halted, its stack mustn't be clobbered by another work-item. So, each work-item must have a separate stack, and these stacks must be created.

Clovers uses for that mmap(), a function that can be used to allocate large chunks of data, way faster than malloc() (malloc() uses mmap() internally, with also a memory pool). Stacks are in fact "large", currently 8 KB (but future version of OpenCL will run an analysis pass to count the maximum amount of data alloca'ed() by the kernel and its sub-functions), and each work-item must have its one.

For kernels designed for barrier(), that is to say with a known number of work-items per work-group (usually low), there is no problem. Even 512 work-items take only 4 MB of stack, a single Huge Page on x86, and nothing with regard to the amount of RAM currently found on modern computers.

But the problem is for kernels not designed for barrier(). These ones use higher work-groups, or even let Clover decide how to split the work-items into work-groups (using Coal::CPUKernel::guessWorkGroupSize()). For a 1024x1024 image, with one work-item per pixel, and a 4-core CPU, Clover will create work-groups of 262144 work-items ! If each of them must have its own 8KB stack, that means a memory usage of 2 GB !

So, Clover cannot use independent work-items when barrier() is never called. This is achieved by using a tiny dummy context at the beginning of Coal::CPUKernelWorkGroup::run(), holding only the data needed by the built-in functions like get_global_id(), that is to say the work-item index. Then, a while() loop is used to execute the work-items sequentially, incrementing the work-item index of the dummy context at each loop iteration. This makes the whole thing working when no barrier() calls are issued, and without any slowdown (maybe about ten CPU cycles, but JITing the caller functions takes way more time than that).

Implementation

How is barrier() implemented then ? Simple, when the first call to barrier() is made, Coal::CPUKernelWorkGroup::barrier() begins by allocating memory for the stacks (and puts its pointer in a TLS memory location). This function is also able to reuse the memory allocated by a previous work-group run on the same thread if it is large enough to hold the stacks. This way, it's a bit faster.

A note about this memory : it doesn't contain only the stacks, but also a Coal::CPUKernelWorkGroup::Context structure. This memory is accessed using Coal::CPUKernelWorkGroup::getContextAddr() that is given an index and returns a pointer to a Coal::CPUKernelWorkGroup::Context. The stack of this context is located just after the Context structure.

Once the memory is allocated, Coal::CPUKernelWorkGroup::barrier() can proceed as if all was normal. It first tries to take the next context, and checks if it is initialized (that is to say its work-item has already begun and is currently halted somewhere in its execution). It's easy to do as mmap() zero-initializes the memory. If the work-item isn't yet initialized, it is created and initialized to point to the kernel function, with the correct args.

After the context creation, all that is needed is to swap the contexts, that is to say to save the current context in the memory location, and to jump to the next. If this is the first time barrier() is called, the current context is simply the "main context" of the thread, and it gets saved like any other context: barrier() successfully achieved to work even when a dummy context is used.

The context being swapped, the execution can begin or continue in the next work-item.

End of execution

One thing remains to be done, as pointed out at the end of The problem : when a barrier() has been encountered and a work-item finishes, we cannot just launch the next, as it has already begun before. So, we need to separately handle the case of a barrier() having been called. Coal::CPUKernelWorkGroup::run(), if a barrier() was called and when the first work-item finishes, doesn't launch the next one but goes directly in another loop.

This loop simply calls swapcontext() for each remaining work-item. The other work-items will each terminate. An interesting property of contexts is used here: when a context finishes its execution, the execution returns to the context that have created it, the context given here in Coal::CPUKernelWorkGroup::barrier() :

 Context *main = getContextAddr(0);  // The context not created with makecontext
 
 if (next->initialized == 0)
 {
     // [...]
     next->context.uc_link = &main->context;  // <== here

That means for clover that when a work-item finishes, the execution flow will return to Coal::CPUKernelWorkGroup::run() where it has left, that is to say at the swapcontext() call. This allows Clover to simply terminate all the work-items.