02:33airlied[d]: gfxstrand[d]: your blackwell rebase put the sm >= 100 texture handle in the wrong place in lower_tex
02:33gfxstrand[d]: oops
02:34gfxstrand[d]: That merge conflict was pretty bad and there were like 5 patches that moved that around so I'm not too surprised.
02:34airlied[d]: now instead of vkcube crashing I get an mmu fault at an impossible address 🙂
02:34gfxstrand[d]: Feel free to throw me a squash
02:35airlied[d]: fixup is on my nvk/blackwell, I'll just try and get vkcube to draw
02:43gfxstrand[d]: ok
02:47airlied[d]: I can haz cube
02:47gfxstrand[d]: \o/
02:49airlied[d]: linear texturing was broken, and vkcube uses it
02:49airlied: (that and I forced Z32 depth)
02:49gfxstrand[d]: airlied[d]: Oops. 😳 I've pulled your fix in and squashed it and pushed my branch
02:51airlied[d]: the tic is on my branch, may as well grab and squash it as well
02:51airlied[d]: tic fix
02:55airlied[d]: okay that gets a few sascha demos going
02:58airlied[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1372045390429294694/gb203.png?ex=68255858&is=682406d8&hm=0d659ec11d1727b65c70080b57e2c72b338c86e99caf41e4095d563edc435311&
03:45airlied[d]: // In non-uniform control-flow, we can't collect uniform vectors so
03:45airlied[d]: // we need to insert copies to warp regs which we can collect.
03:45airlied[d]: I think that is biting ldcu/tex combos in a few places
03:47airlied[d]: killing that fixes that parallaxmapping demo
04:09airlied[d]: I put a hack in my branch which undoes that with r2ur, but we should probably do better
04:20gfxstrand[d]: airlied[d]: Yeah, for now we need to not use uniform tex sources outside of uniform control-flow.
04:20gfxstrand[d]: Or just not use them at all for now
04:21airlied[d]: we can't not use them on blackwell
04:21airlied[d]: they are mandatory for the handle
04:23airlied[d]: since we can't do c[][] in the tex instr anymore
04:26gfxstrand[d]: Right. So we'll have to use bindless in some cases
04:28gfxstrand[d]: We can probably still use them in non-uniform control-flow in a bunch of cases as long as we hoist the load and pin the result. We have a pass for that for bindless cbufs already and we can extend it..
04:28gfxstrand[d]: But for anything where the texture is non-uniform or where we don't have enough regs to pin, we need to fall back to bindless.
04:32gfxstrand[d]: The really annoying part is that this means texture lowering may need to happen way later in the pipeline, after we've determined how much is pinnable. That's gonna be tricky so for now we can just use bindless whenever we're in non-uniform control-flow.
04:40airlied[d]: Okay I'll look at hoisting the loads maybe, might be good enough so we don't use bindless where we don't currently
05:31gfxstrand[d]: We can't just hoist. We need to also update the pinning pass and move tex lowering around so we can fall back to bindless if we run out of registers.
05:53airlied[d]: starting to sound above my paygrade 🙂
06:06karolherbst[d]: can't you use bound textures/samplers?
06:06karolherbst[d]: though I guess that doesn't really fit vulkans model very well...
06:06airlied[d]: no that's the problem, since bound ones need two UR registers now whereas before they didn't
06:06karolherbst[d]: uhhhh
06:07karolherbst[d]: annoying
06:08airlied[d]: the dodgy hacker in me says we should just reserve two and make the ldcu and tex instructions never move 😛
06:08airlied[d]: and that's why I'm not a compiler engineer
07:55asdqueerfromeu[d]: airlied[d]: As good as Terakan 🔺
08:09gfxstrand[d]: airlied[d]: Because the hardware can be executing multiple basic blocks simultaneously for the same warp so there's no guarantee that doing so won't still cause a stomp. Compilers are fun!
08:40gfxstrand[d]: How much should I trust extension implementations that I type at 4 in the morning? ☕
08:42tiredchiku[d]: gfxstrand[d]: very
08:42tiredchiku[d]: :D
08:43gfxstrand[d]: Very trusting or very skeptical?
08:45tiredchiku[d]: very trusting
08:45tiredchiku[d]: 4am manic codewriting is how arcane code is written <a:myy_BunnyNod:639648670731468840>
08:56phomes_[d]: if it is VK_EXT_zero_initialize_device_memory then it looks correct. I was poking at the same one 🙂 I guess this needs a new build of proton to test?
09:16karolherbst[d]: gfxstrand[d]: yes
10:20snowycoder[d]: Kepler image storage Draft MR open: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34975
10:20snowycoder[d]: There are still a couple bugs to fix, but it's more stable
15:52mhenning[d]: gfxstrand[d]: I've actually been wondering if we can do this and prevent yields as long as the temporary uregs are live
15:53mhenning[d]: It would simplify things if we could use uregs in some cases in non-uniform control flow
15:53mhenning[d]: another case where this would help is using redux, which always writes to a ugpr
16:08gfxstrand[d]: I don't think it's just about preventing yield. I think on Turing+, they can actually be parallel. But I also don't fully understand the details.
16:13mhenning[d]: Right, I know they can be parallel, but I think that counts as a yield?
16:13mhenning[d]: We'd probably need to understand the hardware better to make anything like that work
16:17gfxstrand[d]: So there's "can preempt each other" and then there's "can run in parallel". I'm not sure which they do. But I wouldn't be surprised if the SMs are now actually 4 SIMT8 SMs which can be locked together if needed or something like that.
16:17gfxstrand[d]: I also don't know how to turn it on and off. (I've heard a rumour there's a bit.)
16:19mhenning[d]: Okay. My understanding is that it's "can preempt each other" but yeah we need to understand the hardware more clearly
16:33mhenning[d]: From the volta whitepaper: "Note that execution is still SIMT: at any given clock cycle, CUDA cores execute the
16:33mhenning[d]: same instruction for all active threads in a warp just as before, ..." https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
16:34mhenning[d]: They depict it as interleaving the execution of diverged threads
16:41gfxstrand[d]: Okay. That's good to know
16:41gfxstrand[d]: So if we can temporarily prevent yield, we should be good.
16:41gfxstrand[d]: I wonder if there's a control reg or something for that
16:42mhenning[d]: setting the yield bit might be enough
16:44mhenning[d]: I'd be curious if we can get the cuda compiler to use a ureg in non-uniform control flow - that might be informative
17:34snowycoder[d]: Is it ok if I bump up the size of `nvk_push_descriptor_set` from 512 bytes to 4KiB?
17:34snowycoder[d]: It unfortunately is kinda needed for Kepler support unless we specialize it in two structs.
23:43gfxstrand[d]: Probably fine, TBH.