09:50 loryruta[d]: Hello all, I have a low level question although I’m not sure nouveau channel is the right one to ask… 😅
09:50 loryruta[d]: I’m writing a Vulkan abstraction for which I have to cope with automatic flushing of commands. What I mean is that the user of this abstraction doesn’t have to care about recording and submitting command buffers manually.
09:50 loryruta[d]: This is pretty much what I assume OpenGL/OpenCL drivers are doing right? Command flushing is implicit and you’d optionally call clFlush().
09:50 loryruta[d]: So what’s the algorithm for doing that? How many commands would you record and when would you submit?
10:15 karolherbst[d]: There are sometimes hardware limitations on how big a command submission can be, but then drivers break it up. The bigger constraints are around timeouts the hardware is configured to to bail on a submission, but in theory you can record until you run out of memory (or time)
10:27 x512[m]: It would be nice to have Vulkan extension to control long command buffer execution (disable timeout for GPUs with preemption support, cancel execution without causing lost device).
10:28 x512[m]: GPU acceleration API is kinda still in Windows 3.11 era with cooperative multitasking and watchdog.
10:53 loryruta[d]: karolherbst[d]: So what an OpenGL/OpenCL driver would do? Since we're in the Discord channel #nouveau🔗 let's talk about NVIDIA discrete GPU's. Are you saying GL commands are submitted when their GPU execution is believed to be approaching the timeout?
10:55 karolherbst[d]: normally a driver just picks a timeout that is good enough
10:55 karolherbst[d]: or none
12:29 karolherbst[d]: Though because GL doesn't necessarily have a defined command buffer, some drivers are also doing submissions randomly like the old nouveau driver when the buffer gets big enough
12:33 loryruta[d]: karolherbst[d]: yea, that’s the thing, the GL API doesn’t expose command buffers, so - most notably for VK backends - you have to figure out when to create those buffers, record and submit
12:34 karolherbst[d]: the nouveau gl driver goes by size
12:34 karolherbst[d]: but not quite sure what zink is doing
12:34 loryruta[d]: karolherbst[d]: size of the command buffer? you mean number of recordings or something else?
12:34 karolherbst[d]: loryruta[d]: raw buffer size
12:35 karolherbst[d]: it's all memory in the end and the buffer is finitely sized. So once that gets close to the end, it gets submitted on the spot
12:36 karolherbst[d]: but not saying that this is a good strategy, because it causes random multi threading and state synchronization issues with multiple GL contexts
12:36 loryruta[d]: okay, I guess this is something you can do when you work “beneath” the vulkan layer, as for nouveau. there’s no way of querying raw buffer size from vk as far as i’m searching :/
12:37 loryruta[d]: karolherbst[d]: yeah…
12:38 karolherbst[d]: anyway, vulkan drivers will just split it up internally
12:54 loryruta[d]: karolherbst[d]: Yes I wouldn't say it's the best way. I believe some commands whose gpu execution time is long can occupy the buffer size to commands whose execution time is short.
12:54 loryruta[d]: Ideally on hardware with a max execution cap, you want to record enough commands to approach that, right? That would be a good solution. But you can't predict how long a command will take... unless you've already seen it (?)
12:55 karolherbst[d]: yeah, that's why timeouts are generally quite high. There will be many submissions per frame, so unless the workload is very heavy it's not an issue. All those things become way more critical with compute workloads that might even run for a couple of seconds/minutes
13:08 loryruta[d]: I'm indeed working with compute workloads, but not heavy ones... surely I'm not using Vulkan the standard way (for graphics, frames, etc.) 😅
13:08 loryruta[d]: My issue is that I'm developing a library where the user can leverage VK to dispatch compute kernels, transparently from the VK API. They're not responsible for command buffer recording nor queue submissions. It's basically a CUDA/OpenCL but backed on VK. Hence why I'm finding myself trying to understand driver's logic for recording/submission. I've setup a test where the user dispatches 10'000
13:08 loryruta[d]: sequential, 1M -dimensional, compute kernels, and I found the OpenCL implementation to be faster than mine by 3-4 seconds.
13:08 loryruta[d]: The way I've designed it is: my dispatcher has a ring of **N** command buffers. I start recording the first and place full memory barriers between each dispatch. If **1ms** has passed since last recording I stop the command buffer and submit it. Between adjacent command buffers I have an ever incresing timeline semaphore. When I reach the first command buffer of the ring, I host -wait it.
13:08 loryruta[d]: However I have two hardcoded hyperparamaters there: N and 1ms. I was wondering if drivers were being smarter. I was also keen on looking at ANGLE CL implementation.
13:12 karolherbst[d]: have you checked if you are GPU or CPU bound?
13:12 karolherbst[d]: and also often it's just slow because the CPU stalls waiting on the GPU
13:14 karolherbst[d]: like you don't want to wait on any submissions on the thread that does the submissions
13:14 karolherbst[d]: because then you'll starve the GPU
13:15 karolherbst[d]: have you checked how rusticl + zink + nvk perform compared to your impl? I know that nvidia is a lot faster because they also do usermode command submission, so might not be a good comparison for a microbenchmarks submitting tons of things
13:16 karolherbst[d]: because nvk doesn't do usermode submission yet
13:16 loryruta[d]: karolherbst[d]: I was CPU bound, originally on alloc/update/bind of descriptors. I switched to bindless and it got substantially faster. Now I'm measuring cpu time during the execution of those 10'000 1M kernel dispatches. The kernels are 1-1 with OpenCL so I wouldn't say there's much on a difference on GPU execution
13:16 karolherbst[d]: right, but the GPU might be idle between dispatches
13:17 karolherbst[d]: and a native CL driver can do a lot more optimized submissions and threading
13:17 loryruta[d]: karolherbst[d]: uhm, I'm actually waiting on main thread 🤔 maybe I shouldn't...
13:17 karolherbst[d]: rusticl spawns two threads per cl queue
13:17 karolherbst[d]: one to submit and the other to wait
13:18 karolherbst[d]: it was using one thread previously, but Rob noticed on freedreno that the GPU gets starved when the thread doing the submissions also waited
13:19 loryruta[d]: karolherbst[d]: well actually I wrote in the nouveau channel but what I'm comparing is OpenCL with ~~shitty~~ proprietary nvidia drivers and VK with proprietary nvidia drivers
13:19 loryruta[d]: I really think multi-threading is the way to go
13:19 karolherbst[d]: I have a prototype to run rusticl on top of CUDA btw 🙃
13:20 loryruta[d]: and wait on main thread e.g. if you want to readback values on gpu
13:20 loryruta[d]: karolherbst[d]: I'm not familiar with rusticl 🥲
13:20 loryruta[d]: oh okay
13:20 karolherbst[d]: loryruta[d]: yeah... the main reason it's two threads in rusticl is, because you have cl_event objects and you need to tell as quickly as possible if the the GPU is done, because the application thread might want to check it
13:20 karolherbst[d]: so depending on the API it might not have to be its own thread
13:21 karolherbst[d]: rusticl uses condvars to synchronize and the "wait" thread just signals those so the app thread wakes up or skips waiting all to gether
13:21 karolherbst[d]: *alltogether
13:22 loryruta[d]: it's very clever, thank you for the suggestion 🙂
13:22 karolherbst[d]: loryruta[d]: just a mesa based CL impl I was starting to replace clover
13:23 loryruta[d]: that's so cool how many software you're developing here!
13:23 loryruta[d]: btw I think OpenCL should burn in hell nowadays
13:24 karolherbst[d]: yeah maybe.... but there is no viable alternative really, so it's what we have
13:25 x512[m]: Open source CUDA for NVIDIA?
13:25 karolherbst[d]: though more and more companies spend time on CL, so it might become a lot better (there are e.g. command buffer extensions now and a lot of other things)
13:25 x512[m]: Not only for AMD...
13:25 karolherbst[d]: nah.. CUDA isn't a great API actually 🙃
13:25 karolherbst[d]: the thing why CUDA is great is because nvidia is a software company providing great tooling
13:26 karolherbst[d]: but the API itself is kinda janky in many ways
13:26 karolherbst[d]: at least the low level driver API. I think the C++ abstraction is a lot better
13:26 karolherbst[d]: but never looked into it much
13:28 karolherbst[d]: but anyway.. CUDA isn't a great API for cross vendor stuff, because a lot of nvidia specific things are baked into PTX and it's going to be sub-optimal for many other devices.
13:31 loryruta[d]: karolherbst[d]: Vulkan compute?
13:31 karolherbst[d]: that's just Vulkan tho
13:31 karolherbst[d]: I mean it's a good idea, but also it's not the easiest API to just use
13:32 x512[m]: Vulkan compute have no UVM.
13:33 karolherbst[d]: but it could have
13:33 loryruta[d]: x512[m]: UVM?
13:33 snowycoder[d]: loryruta[d]: Unified Virtual Memory
13:33 x512[m]: Unified Video Memory.
13:34 karolherbst[d]: tldr: you get the same address on the host and the GPU to the same logical memory allocation
13:35 loryruta[d]: I'm not using it, but don't you have `HOST_VISIBLE + DEVICE_LOCAL` (depending on the hw)?
13:35 karolherbst[d]: well it's more than that
13:35 loryruta[d]: In VK
13:35 karolherbst[d]: it's kinda like bda but more advanced
13:36 karolherbst[d]: so you can just have e.g. linked list on the host side, that point into mapped GPU allocations and you can read/write to them on the host
13:36 karolherbst[d]: and then the GPU can just take those without you having to convert pointers
13:36 x512[m]: It mean the same virtual address space between CPU and GPU. You use the same pointers for GPU and CPU. Memory can dynamically migrate between sysmem and VRAM.
13:36 karolherbst[d]: well dynamic migration is kinda an optional feature and kinda a big massive pita
13:36 loryruta[d]: oh... this reminds me of `cudaMallocManaged` lol
13:37 snowycoder[d]: It should be also faster for non-dedicated GPUS (mobile) because you remove all memcpys and you already share the RAM
13:37 karolherbst[d]: loryruta[d]: yeah basically that
13:37 karolherbst[d]: but it's slower 🙃
13:37 karolherbst[d]: like managed allocs are slower
13:38 loryruta[d]: yes I recall people were avoiding that 😆
13:38 karolherbst[d]: snowycoder[d]: yeah.. with unified memory (not to be confused with unified virtual memory) it's pretty trivial to support
13:38 karolherbst[d]: you just map the buffer at placed virtual addresses and do some mmap magic
13:38 karolherbst[d]: I've implemented SVM (because why call it unified if you can call it shared) in rusticl that does handle it
13:39 karolherbst[d]: but I have not added a proper opt for unified memory yet
13:40 karolherbst[d]: anyway, it's a lot less magic than people assume it is
13:41 karolherbst[d]: anyway, would need an ext in vulkan for it and any of those exts will be complex, because you are messing with virtual memory 🙃
13:42 karolherbst[d]: loryruta[d]: so the big issue with managed is, that it does implicit migrations, which means it relies on page faults on the GPU, which means it interrupts the GPU from doing work. A lot of it can be mitigate by migration hints and telling the runtime to move the memory ahead of time.
13:44 karolherbst[d]: good use case is e.g. huge databases, where accesses are not predictable and doing ad-hoc migration is cheaper than migration tons of GiB of data that might not even be used
13:46 karolherbst[d]: I think you can also oversubscribe?
13:48 loryruta[d]: karolherbst[d]: I don't have much knowledge about memory management so I can't really comment on that... the way I work is that memory is either host or device, it could be host visible, if it is I can map it 😄
13:48 loryruta[d]: So when you map you get "new" addresses for the host to work on. In UVM you would have a unique address which can be accessed by host and gpu, right? 🤔
13:48 loryruta[d]: In a discrete GPU setup, where would this buffer live?
13:48 karolherbst[d]: yes
13:49 karolherbst[d]: loryruta[d]: depends
13:49 karolherbst[d]: that's why I said "logical allocation" above
13:49 karolherbst[d]: it might be that the runtime reserves space on the host _and_ the gpu
13:49 karolherbst[d]: and just migrates content on the fly
13:49 karolherbst[d]: or it maps it
13:49 karolherbst[d]: or...
13:50 karolherbst[d]: like the CPU can access VRAM, and the VRAM can access host memory, but you also have the PCIe BAR in between for host mapped VRAM and that's limited
13:50 karolherbst[d]: and for speed you want to operate on VRAM
13:50 karolherbst[d]: so there are a lot of heuristics at play
13:50 karolherbst[d]: Intel has a cl_intel_unified_shared_memory extension that makes that more explicit
13:50 karolherbst[d]: like you can decide how it's placed and how it's migrated
13:51 karolherbst[d]: it really depends how the memory is used
13:54 loryruta[d]: karolherbst[d]: this reminds me of the managed memory in cuda, as far as my understanding "migrating" memory is not different from a copy cpu<->gpu (is it?). Hence I've always preferred to work with explicit host and device -only buffers, and occasionally allocate staging buffers for sharing data. However as I'm working on mobile devices I'm sure it's worth looking into unified memory
13:54 karolherbst[d]: loryruta[d]: I think you can pin it, but yeah
13:55 karolherbst[d]: if the GPU and CPU share the same memory all of this becomes way less relevant, because it's just the same memory
13:56 loryruta[d]: yes, but I could avoid duplicating buffers when having to access data on host (resp gpu) 😅
13:56 karolherbst[d]: yeah
13:56 loryruta[d]: I should look into it!
13:56 karolherbst[d]: in CL you can e.g. also do HOST_PTR allocations, but then you get different addresses
13:56 karolherbst[d]: there is also a CL bda extension that guarantees you the same address without any of the other UVM/SVM stuff
13:56 karolherbst[d]: or rather.. it gives you a stable address
13:59 loryruta[d]: that's great, there is a bda extension also for vk
13:59 loryruta[d]: but didn't look into it yet...
13:59 karolherbst[d]: yeah.. the virtual cl_mem handle CL is doing is a bit awkward for modern times
14:02 karolherbst[d]: there is also chipstar that tries to implement Cuda on top of CL: https://github.com/CHIP-SPV/chipStar and the paper about it was published recently (and I even participated in writing that one, so shameless plug kinda 🙃 )
14:03 karolherbst[d]: the CL bda ext is one thing that was created as a result of it
14:12 loryruta[d]: karolherbst[d]: wow that's an insane amount of work... congrats 😅
17:35 mhenning[d]: anyone want to do a quick review for https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39812 and https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39977 ?
18:39 karolherbst[d]: mhenning[d]: that .reliable flag doesn't do anything
18:39 karolherbst[d]: it's metadata for compilers/debuggers
19:41 airlied[d]: loryruta[d]: at least for graphics apps there is usually an indicator to flush like a buffer swap or explicit sync stuff. Though I think otherwise it's just arbitrary size limits. Nobody queues up stuff close to the 10s timeout because latency and also that would be nuts
19:52 airlied[d]: zink also has a flush queue thread
23:21 jannau: can someone test https://gitlab.freedesktop.org/mesa/mesa/-/issues/14975#note_3356323
23:22 jannau: I fixed only the copy hk and not the source in nvk