07:30 karolherbst: pmoreau, RSpliet: so apparently non uniformly indirect const buffer accesses are hurting performance quite a bit where accessing the data via g[] can actually be faster... do you know anything about the specifics here?
07:31 karolherbst: according to HdkR the hardware has to serialize on each thread in order to fetch the data
07:31 karolherbst: just wondering if some of you was looking into it at some point
07:32 HdkR: wark wark wark
07:34 karolherbst: HdkR: I am also wondering why nvidia doesn't optimize this const buffer access to global memory if it's indeed faster to use global memory in this case...
07:34 karolherbst: could improve perf inside compute shaders especially
07:35 karolherbst: or maybe vertex shaders as well
07:36 HdkR: Well if you can't see in to how it is indireectly accessing the UBO then maybe more often then not it actually uniform in regular workloads
07:36 karolherbst: right, but you should know in most cases
07:36 karolherbst: if you depend on texture data then yes... it can become quite difficult to tell
07:37 karolherbst: but sometimes you also know it's totally not uniform
07:37 karolherbst: like if you access with the thread id
07:37 karolherbst: which I am sure is not _that_ uncommon
07:38 HdkR: Yea, could probably have a heuristic there. Maybe the more reliable response is that if any app is doing that then...It's bad practice so fix your app? :P
07:38 karolherbst: mhhh
07:38 karolherbst: wouldn't think so
07:38 karolherbst: imagine you write an OpenCL kernel
07:38 karolherbst: and you use global* memory for your input at first
07:39 karolherbst: and at some point you are smart and replace it with constant* because you think it becomes faster
07:39 karolherbst: and it's quite common to just do input[tid]
07:39 HdkR: yea
07:39 HdkR: I can definitely understand hitting the pitfall
07:41 karolherbst: pmoreau: btw, did you see this nice PR? https://github.com/KhronosGroup/SPIRV-LLVM-Translator/pull/244
07:42 karolherbst: with that we can tell the translator what extensions we support
07:42 karolherbst: and what spir-v version
10:08 RSpliet: karolherbst: "non uniformly indirect const buffer accesses" is quite a mouthful. I've never really looked at this no. But I guess from a hardware perspective I can come up with reasons why this might be true
10:08 karolherbst: yeah
10:08 karolherbst: it seems like the scheduler has to serialize the access
10:08 karolherbst: but it's kind of interesting
10:08 RSpliet: Presumably the whole point of hardware backed constbuffers is to reduce latency. Not necessarily increasing throughput.
10:09 RSpliet: Can you confirm this on a GPU with a 64-bit DRAM bus?
10:09 karolherbst: if I find time to do that :p
10:09 karolherbst: but yeah, it's kind of interesting
10:09 karolherbst: would allow us to optimize a few cases in compute shaders
10:09 karolherbst: where we have that c0[tid] access
10:10 RSpliet: No worries if not. But I imagine that indexed accesses from g[] might be faster if we can increase parallelism on the DRAM bus. Which "multiple DRAM channels" can definitely provide.
10:10 karolherbst: yeah
10:10 RSpliet: So it'd be interesting if g[] is still faster on a card with a single channel :-)
10:10 karolherbst: mhh, yeah.. no idea
10:11 karolherbst: my GPU has a 128 wide bus :/
10:11 karolherbst: or higher
10:11 karolherbst: ohh
10:11 karolherbst: I have a 1030
10:11 karolherbst: this only has 64
10:11 RSpliet: Yeah you only find this on the lowest of the lowest end cards
10:11 karolherbst: RSpliet: sure that 64 is single channel?
10:11 karolherbst: because there are cards with a 96 bus
10:11 RSpliet: Hmm...
10:12 karolherbst: or 192
10:12 karolherbst: or 352 :p
10:12 karolherbst: you get the idea
10:12 RSpliet: Yeah
10:12 RSpliet: That's a good point.
10:13 RSpliet: But if 2 is the lowest number of channels you can go (hence the least parallelism), it might give you the best comparison point nonetheless. Don't sweat it if you think the data coming out is too ehh.. "academic"? :-P
10:19 karolherbst: RSpliet: there was some unity developer looking into that and he was having some vec4 loads within a loop with 256 iterations and 256 threads per group and he showed that ubo accesses are significantly slower than plain ssbo
10:19 karolherbst: so that's where this is coming from
10:19 karolherbst: but it's dx only
10:19 karolherbst: so I can't just run it locally :(
10:21 RSpliet: Okay. That sounds like a bit of a corner case though. If you limit the # threads per group you limit the GPUs ability to mask latency. That could matter if the shader is more than just that loop.
10:22 RSpliet: Could, because I'm guessing they did determine the loop is the bottleneck, and that'll remain so w/ bigger workgroups
10:22 karolherbst: yeah.. no ieda
10:22 RSpliet: But those were vec4 loads in a loop to random (data dependent) locations?
10:22 karolherbst: no
10:22 karolherbst: well
10:23 karolherbst: there are several access modes
10:23 karolherbst: and the worst one was where the thread id was used
10:23 karolherbst: worst for ubo perf
10:23 karolherbst: in a linear matter
10:23 RSpliet: Okay, because that's where DRAM will clearly achieve peak performance. That makes sense :-)
10:23 karolherbst: RSpliet: https://github.com/sebbbi/perftest/blob/master/perftest/loadConstantBody.hlsli
13:54 pmoreau: karolherbst: Yes, I quickly saw the PR: that will be very useful!
19:25 RSpliet: karolherbst: actually thinking about it, I'm not sure if it really matters that constmem is slower in this case. The shader you linked is a very static example that hopefully will not rock up in actual games
19:25 karolherbst: yeah...
19:26 karolherbst: but I am still wondering if a direct c[tid] access might be still bad enough
19:26 RSpliet: even if the individual buffer access is slower, the fact that reads are not using up DRAM bandwidth means they leave more bandwidth for other threads
19:26 karolherbst: right
19:27 RSpliet: So then it depends on whether the shader is DRAM bound (in which case slower c[] loads could still speed up the program) or compute bound
19:27 sravn: Anyone here that knows if drmP removal patches has landed in nouveau? They were not in the pull that was sent out earler to upstream
19:31 pmoreau: sravn: I can see your patches in Ben’s master branch: https://github.com/skeggsb/nouveau/commits/master?after=82889637213c3386cff8faa2137722a0fd7fa376+34
19:32 pmoreau: Looking at his 5.2 branch, it cut a bit before your series.
20:11 sravn: pmoreau: thanks. Wanted to know that they were not lost in space somewhere