00:46 olivia_fl[d]: mohamexiety[d]: I don't see anything in nvk host image copy to deal with (de)interleaving D24_UNORM_S8_UINT. How do you avoid that?
00:46 olivia_fl[d]: (or am I missing where you do it)
02:15 gfxstrand[d]: olivia_fl[d]: We don't support host copies on Z/S
02:17 gfxstrand[d]: skeggsb9778[d]: Why is the Kernel doing anything with tiling for CPU maps? That should always be linear.
02:18 gfxstrand[d]: Or maybe they are? I honestly can't detangle the CPU making from VM mapping on the GPU.
02:20 gfxstrand[d]: On Kepler the mangling is so bad we've just shut off VRAM maps entirely. On Maxwell, linear BOs work well enough that we can keep them mapped for things like descriptors. But today's experimentation kinda makes me doubt that.
02:21 gfxstrand[d]: Unless we're getting different GPU kinds on VRAM vs. GART... But I'm pretty sure we're always using uncompressed kinds, even for VRAM.
02:22 gfxstrand[d]: But on Kepler, our maps are screwed up even for totally linear BOs.
02:41 airlied[d]: CPU maps in the VRAM bar can go via detiling vm access
02:42 airlied[d]: probably have to see what happens if you hacked the kernel a bit to sotp that
02:44 airlied[d]: probably have to drop the kind args in nouveau_ttm_io_mem_reserve/nvif_object_map_handle
02:44 airlied[d]: but that might also explode it's own way
05:32 skeggsb9778[d]: gfxstrand[d]: Primarily because of what I mentioned above about older GPUs that do additional reordering *beyond* the PTE kind (tied to the *physical* vram address)
05:33 skeggsb9778[d]: Nouveau's TTM/GEM code is *ancient*, mostly due to sheer lack of time to keep up beyond people keeping it "working" as TTM changed. I'd love to fix that even despite nova, as I expect even the GPUs left behind there would benefit despite low clocks
05:51 skeggsb9778[d]: airlied[d]: I *think* from Turing onwards (perhaps a little earlier) - that might be ok
05:52 skeggsb9778[d]: Older uvm versions had code to handle "big page swizzling" (that doesn't appear to be in current versions), that only affected <maxwell iirc
06:38 airlied: anarsoul[m]: https://paste.centos.org/view/a4383038 remove all other patches, let me know if that one is more stable :(
07:25 asdqueerfromeu[d]: airlied[d]: I thought some extension for vkd3d(-proton) got merged 🐸
08:58 mohamexiety[d]: olivia_fl[d]: Yeah depth stencil on pre-Blackwell HW is really really really messed up layout wise (different swizzle for each format and each swizzle is some big brained cursed interleaving of depth and stencil in horrible ways) so I just disabled it (especially since it doesn’t have any users anyways). With Blackwell it’s a lot easier that I might get back to it just for completeness but later
09:37 djdeath3483[d]: Curious, are the swizzles different for each msaa variant?
10:15 mohamexiety[d]: I didnt check but I don't think so. on NV before blackwell, GOB* swizzling was just dependent on format except for colors where all formats had the same swizzle. with blackwell, color also has a different swizzle depending on block size
10:15 mohamexiety[d]: \* the tiling is tiered. an image is split into linear blocks, which are themselves split into linear GOBs, and then finally each GOB is split into sectors and the sector layout is the swizzle.
11:01 djdeath3483[d]: Compare that with intel Xe2 where each format-bpp + msaa gives you a different swizzle...
11:10 mohamexiety[d]: :bleakekw:
12:11 gfxstrand[d]: djdeath3483[d]: Nope. MSAA is just a bigger 2D texture with exactly the same swizzle. Think Sandy Bridge or pre-DG2 depth.
12:13 gfxstrand[d]: skeggsb9778[d]: Turing+ is fine. Maps just work there.
12:13 gfxstrand[d]: No funny business
12:14 gfxstrand[d]: It's Kepler-Volta where things get messed up.
12:15 gfxstrand[d]: Though if there's additional swizzling in VRAM based on physical address, that would make sense. It would also make sparse and VM_BIND kinda bogus, which is concerning.
12:20 mohamexiety[d]: kind of confused though, is this additional swizzling something the HW does by itself or is it kernel managed?
13:52 gfxstrand[d]: If it's based on physical address, it's probably hardware, with the kernel doing stuff to make it look sane to userspace.
13:52 gfxstrand[d]: Intel used to do shenanigans like that
14:02 snowycoder[d]: What are the recommended games to test Vulkan performance (minimizing headaches)?
14:02 snowycoder[d]: I want to test if instruction scheduling really gives Kepler a boost.
15:13 gfxstrand[d]: pixmark piano through Zink is good for showing scheduling boosts
15:14 gfxstrand[d]: https://www.geeks3d.com/gputest/
15:17 karolherbst[d]: the best part of it is that you can reliably detect +-0.1% changes well...
15:17 karolherbst[d]: yeah.. on kepler
15:17 karolherbst[d]: on GSP you'd have the issue that changing GPU clocks can skew the results 🙃
15:18 karolherbst[d]: that reminds me.. we should wire up the perf controls of GSP in debugfs or something
15:18 karolherbst[d]: it's better for benchmarks if you can have stable clocks
15:19 djdeath3483[d]: I thought piano also had unstable rendering
15:20 karolherbst[d]: might be.. but in my testing with nouveau back then, the result only varied by 1 frame between runs
15:21 karolherbst[d]: so at least for perf numbers it's stable enough that even small improvements show up
15:21 karolherbst[d]: did a "use ffma instead of f2f for abs/neg" perf a long time ago, gave a stable +0.2% perf improvement
15:25 gfxstrand[d]: djdeath3483[d]: I don't think it's intrinsically unstable. It's just very sensitive to minor NIR changes. But instruction scheduling shouldn't matter.
15:26 djdeath3483[d]: On the image diff it's always the edges that are different
15:27 djdeath3483[d]: I thought that was overdraw, one time you pick one triangle or the other on the edge and you get slightly different results (admitely not visible for the eye and likely inconsequential for benchmarking)
15:57 snowycoder[d]: Oh wow The Talos Principle runs on my GT 710!
15:57 asdqueerfromeu[d]: karolherbst[d]: Why not outside of debugfs (for nouveau overclocking for example)?
15:57 karolherbst[d]: asdqueerfromeu[d]: because we don't want users to rely on the interface to be stable
15:57 karolherbst[d]: everything else is UAPI
17:31 mhenning[d]: airlied[d]: Do you have instruction latencies for the volta+ MATCH instruction?
17:38 karolherbst[d]: mhenning[d]: decoupled
17:40 karolherbst[d]: decoupleagu for ampere
17:40 karolherbst[d]: maybe just decouple for ampere.. does it even matter..
17:41 karolherbst[d]: ohh it does..
17:41 karolherbst[d]: decoupled then
17:42 mhenning[d]: uh, decoupled for each of RegLatencySM75, RegLatencySM80, and PredLatencySM80 ?
17:42 karolherbst[d]: oh right it writes a pred...
17:43 gfxstrand[d]: Ooh! Does that instruction do what I think it does?
17:44 gfxstrand[d]: `vote_eq`, effectively?
17:45 karolherbst[d]: yeah
17:45 karolherbst[d]: ohh wait
17:45 karolherbst[d]: well
17:45 karolherbst[d]: it's across the warp
17:46 mhenning[d]: gfxstrand[d]: yes, it's a warp-level vote_eq. currently wiring it up
17:46 karolherbst[d]: you are aware of the differences between .ALL and .ANY?
17:47 mhenning[d]: yes, it's documented in ptx
17:47 karolherbst[d]: ahh
17:47 mhenning[d]: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-match-sync
17:47 karolherbst[d]: heh
17:47 karolherbst[d]: ~~I just read that text already~~
17:48 gfxstrand[d]: mhenning[d]: Neat!
17:49 karolherbst[d]: mhenning[d]: sounds like it's decoupled for all
17:49 gfxstrand[d]: Yeah, that'll be a lot better than having to first do a `broadcast_first`
17:49 mhenning[d]: yep, that's the hope
17:49 karolherbst[d]: impressively it has a .U64 flag
17:52 gfxstrand[d]: I mean, that's kind of important. You can't just do it on the two halves and them boolean combine them.
17:52 gfxstrand[d]: Well, .all might work
17:53 gfxstrand[d]: Okay, this is fun:
17:53 gfxstrand[d]: > `.any`: d is set to mask of non-exited threads in membermask that have same value of operand a.
17:54 gfxstrand[d]: So each thread gets the mask of what other threads match it.
17:54 gfxstrand[d]: IDK what I'd use that for but it seems useful
17:55 mhenning[d]: yeah, not sure where you'd use that
17:56 gfxstrand[d]: I mean, if you had masked versions of things like scan/reduce, I could see it being useful. Compare one thing and then use that to clump scan/reduce on another thing.
17:57 cwabbott: it's used for parallel sorting
17:57 cwabbott: the sorting library we imported into mesa for BVH building has an nvidia path that uses that match thing
17:57 cwabbott: that's currently unused
18:03 mhenning[d]: oh, huh. guess there's an extension for that GL_NV_shader_subgroup_partitioned
18:05 marysaka[d]: btw have we wired RED?
18:05 mhenning[d]: yes, they're represented in the ir as atomics without destinations
18:09 gfxstrand[d]: mhenning[d]: Oh, right. I couldn't remember if that ever landed. I know they tried to make it EXT/KHR but I guess it's just NV. We can totally implement
18:11 gfxstrand[d]: mhenning[d]: Still not 100% sure if that was the right call or not. They potentially have slightly different semantics but IDK that those differences are observable
18:11 anarsoul: airlied: extra sleep for synchronization? :) I'll try it later today
18:12 mhenning[d]: gfxstrand[d]: eh, seems reasonable to me
18:12 gfxstrand[d]: airlied[d]: I've assigned both of your outstanding MRs (minus one patch) to Marge. I'm working my way through your nvk/blackwell branch now
18:16 gfxstrand[d]: I'm a little confused by `P_IMMD(SET_TEXTURE_HEADER_VERSION, 1)` when the header is called `V2`. :frog_upside_down:
18:18 gfxstrand[d]: It's expressly a hopper thing, though, so maybe Hopper supports v1 and v2?
18:18 gfxstrand[d]: Feels a bit like the Maxwell A bit
18:20 marysaka[d]: gfxstrand[d]: if you look at it in openrm https://github.com/NVIDIA/open-gpu-kernel-modules/blob/fade1f7b2056a637ea3b7d6d4e78cc680cf358dc/src/common/unix/nvidia-3d/src/nvidia-3d-hopper.c#L39
18:20 marysaka[d]: (codepath taken by blackwell as well)
18:21 marysaka[d]: but seems not like MAXWELL_A bit as it stays valid on Blackwell
18:21 gfxstrand[d]: Okay, yeah, the new header is Hopper+. The new tiling is Blackwell+
18:21 marysaka[d]: (where it was removed on Pascal)
19:06 snowycoder[d]: Kepler instruction scheduling on PixMark makes the bench 2.5x faster, now we're slightly faster than our current gallium OpenGL driver! (56 points to 61 points).
19:06 snowycoder[d]: Zink is awesome!
19:06 karolherbst[d]: mhenning[d]: ... soo.. turns out that using pack/unpack fpr the vector phis prevents quite a bunch of optimizations and that's why those parallel_rdp shaders regress.. They do quite a lot of control flow based on the values and it seems that it prevents nir to tidy up quite a few blocks and ifs...
19:08 mhenning[d]: karolherbst[d]: is it possible to fix that by changing the optimization order? the new pass can probably run pretty late, right?
19:08 karolherbst[d]: CSE and copy_prop seem to be the biggest passes affected
19:08 karolherbst[d]: mhenning[d]: mhhhhhhh
19:08 karolherbst[d]: why haven't I thought of that 🙃
19:08 karolherbst[d]: yeah. atm I put `nir_lower_phis_to_pack` right before `nir_lower_phis_to_scalar`
19:09 karolherbst[d]: I don't want `nir_lower_phis_to_scalar` to optimize it... but I think it will take a few iterations of opts before everything is cleaned up? dunno...
19:10 mhenning[d]: yeah, you want it before lower_phis_to_scalar, but maybe both can be moved later?
19:10 karolherbst[d]: yeah.. will play around with it a bit
19:21 karolherbst[d]: ~~writing an AI/ML thing to tell me the best pass order~~
19:24 karolherbst[d]: oh wow...
19:25 karolherbst[d]: anyway.. yeah.. it doesn't help.. mhh
19:25 karolherbst[d]: I have a silly idea 🙂
19:30 karolherbst[d]: everything can be fixed by running even more passes
19:31 karolherbst[d]: this terrible idea only increases compile times by like 15%, it's great
19:33 mhenning[d]: If you still have it in the opt loop, it might make sense to move it out of the opt loop so everything can run to fixed point at least once before that pass triggers
19:34 gfxstrand[d]: snowycoder[d]: Nice!
19:40 karolherbst[d]: mhenning[d]: ... I duplicated the opt loop, once without phi stuff and once with 🙂
19:40 karolherbst[d]: the results are _good_
19:40 karolherbst[d]: well..
19:40 karolherbst[d]: except that the one shader still regresses 😢
19:41 karolherbst[d]: but it helps a bunch of random stuff, it's interesting
19:41 karolherbst[d]: https://gist.githubusercontent.com/karolherbst/24266f07afaf45b1101620649bd797c6/raw/835b094574495b57e96663b9f282ed4dea990291/gistfile1.txt
19:43 karolherbst[d]: maybe I need to look more at this shader... but I think there is some optimization which really only triggers through a scalar phi and it's a bit odd tbh
19:44 karolherbst[d]: maybe I need to do the phi handling even later
19:44 karolherbst[d]: `nak_optimize_nir` is called a couple of times so...
19:45 karolherbst[d]: maybe I just do it in the last one called inside `nak_postprocess_nir` and see if that helps
19:49 gfxstrand[d]: Okay, now I'm down to the tricky patches.
19:49 gfxstrand[d]: Like the ones where Dave added his own headers.
19:49 mhenning[d]: gfxstrand[d]: openrm has stubs for some stuff
19:50 gfxstrand[d]: Not enough stubs, sadly
19:50 airlied[d]: hopefully when the real ones show up I'm not too wrong 😛
19:50 gfxstrand[d]: lol
20:04 karolherbst[d]: okay, this was a bad idea: `SLM Size: 879816 -> 926864 (+5.35%); split: -0.00%, +5.35%` 😢
20:12 mohamexiety[d]: airlied[d]: double down on it. open a MR against openrm correcting the real ones
20:40 gfxstrand[d]: Okay, I think this fswzadd patch is just bogus
20:40 airlied[d]: the one that makes the test pass? 🙂
20:41 airlied[d]: the one that sets the .dxy bit
20:41 gfxstrand[d]: The bit I can't find?
20:41 gfxstrand[d]: That doesn't disassemble?
20:42 airlied[d]: it's two bits
20:42 gfxstrand[d]: Do you remember what test it made pass? Because that's not recorded anywhere and I'm having trouble finding failing derivative tests
20:43 airlied[d]: *compute*derivative* might match it
20:43 airlied[d]: dang it ass more *
20:45 airlied[d]: oh it might be bogus, maybe it just needs one bit and I copied too much from tmml
20:45 airlied[d]: it might just need bit 77 set
20:48 gfxstrand[d]: Yeah, I think it just wants .ndv
20:48 gfxstrand[d]: Yeah, that makes it pass
20:49 airlied[d]: I was over zealous in thinking nvidia might be consistent 😛
20:50 gfxstrand[d]: So what does `.ndv` do?
20:51 airlied[d]: Force the quad to be treated as non-divergent.
20:51 airlied[d]: that should be quoted, it's from the docs
20:52 airlied[d]: oh if the quad is divergent, you get 0.0 or +inf depending on ShaderControl.DefaultPartial
20:52 gfxstrand[d]: ah
20:52 gfxstrand[d]: I wonder if we should be setting that on texture ops, too. Probably?
20:52 gfxstrand[d]: We're not
20:53 gfxstrand[d]: Well, if it takes an implicit derivative that is
20:54 airlied[d]: so NDV is also there on the older GPUs, but on blackwell it changed into a 2-bit field
20:54 airlied[d]: cargo run --bin nvfuzz SM120 76..78 00047f69 000004ff 080000ff 002f6200
20:54 airlied[d]: Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.09s
20:54 airlied[d]: Running `target/debug/nvfuzz SM120 76..78 00047f69 000004ff 080000ff 002f6200`
20:54 airlied[d]: With 0x0: tmml.lod rz, r4, r0, ur4, 1d, 0x0 &req={1} &wr=0x5 ?trans1
20:54 airlied[d]: With 0x1: tmml.lod.ndv rz, r4, r0, ur4, 1d, 0x0 &req={1} &wr=0x5 ?trans1
20:54 airlied[d]: With 0x2: tmml.lod.invalid2 rz, r4, r0, ur4, 1d, 0x0 &req={1} &wr=0x5 ?trans1
20:54 airlied[d]: With 0x3: tmml.lod.dxy rz, r4, r0, ur4, 1d, 0x0 &req={1} &wr=0x5 ?trans1
20:54 airlied[d]: and for at least tmml 0x3 was needed to pass tests
20:54 airlied[d]: but the docs aren't updated for that
20:55 gfxstrand[d]: Right, I remember seeing dxy
20:56 anarsoul: airlied: it didn't explode on reboot, so extra sleep is helping with PM. I'll run it for a while to see if it fails later
20:56 gfxstrand[d]: I wonder of dxy is quad and ndv is warp, maybe?
20:56 airlied: anarsoul: thanks!, just have to calculate a value now, though burning 100ms probably isn't insane
20:57 airlied: 50ms also appears stable, 10ms was not
20:57 anarsoul: airlied: I wonder if there are any clues in the blob
20:58 airlied: nope they don't sleep at all, but they have a lot of code running between those two points
20:58 anarsoul: *sigh* and I suppose no docs or nothing in the docs?
20:58 airlied: whereas it's back to back for nouveau
20:58 airlied: this isn't known by nvidia at all, they are looking into it
20:59 skeggsb9778[d]: i'm going to check the fw side of things later on this morning too
20:59 skeggsb9778[d]: to see if i can find an explanation why the delay might be needed
20:59 anarsoul: thanks for looking into that
20:59 airlied: theory is firmware bug, missing barrier or just use after signalling finished, or it could be we overwrite something
21:00 airlied: skeggsb9778[d]: I was going to at least dump all the msgq indices we use in suspend to see if we could have the overwrite
21:01 skeggsb9778[d]: yeah, at this point i'm mostly suspicious of that option
21:02 gfxstrand[d]: How is anything working if we're setting NDV to false?!?
21:04 airlied[d]: maybe on the older gpus we aren't diverging 🙂
21:05 gfxstrand[d]: I wonder if .ndv is faster?
21:06 gfxstrand[d]: nv50 appears to be unconditionally setting it on a bunch of stuff
21:07 gfxstrand[d]: But as usual it's impossible to actually tell. 😩
21:08 karolherbst[d]: mhhhhh
21:08 karolherbst[d]: so the divergency relates to quads
21:09 gfxstrand[d]: I'm tonna set ndv = true everywhere and see if the CTS blows up
21:09 karolherbst[d]: it will
21:09 karolherbst[d]: but yeah, would be fun
21:10 gfxstrand[d]: Yeay but this seems like a broader issue than just Blackwell enabling
21:12 karolherbst[d]: anyway.. .NDV is just an override
21:12 gfxstrand[d]: overriding what?
21:12 karolherbst[d]: making the operation to be considered non divergent
21:12 gfxstrand[d]: Yes but what does that mean?
21:13 karolherbst[d]: no idea 🙂
21:13 gfxstrand[d]: Well, that's not very helpful. 😛
21:14 mhenning[d]: does blackwell still have that reconvergence issue? maybe they're related?
21:17 gfxstrand[d]: Yes, blackwell can still diverge
21:26 gfxstrand[d]: Okay, I'll let the CTS keep running but I think .NDV is a deep enough rabbit hole that it's becoming a tomorrow project.
21:27 mohamexiety[d]: huh, how do we reset query pools?
21:27 gfxstrand[d]: It just means memset to 0
21:27 mohamexiety[d]: yeah on other drivers they obtain the address and just zero it but looking in nvk we send methods to cmdrestquerypool :thonk:
21:28 gfxstrand[d]: `nvk_ResetQueryPool()`
21:28 mohamexiety[d]: I did not even notice that being a thing :blobcatnotlikethis:
21:28 mohamexiety[d]: sorry
21:30 gfxstrand[d]: no worries
21:31 gfxstrand[d]: I have a feeling `.ndv` is more useful on DX where the rules around texture ops are somewhat more relaxed.
21:32 gfxstrand[d]: In GL/Vulkan, the texop has to be in quad-uniform control-flow if it takes a derivative. In DX, the op itself can be in divergent control-flow as long as the coordinates are present and valid in all lanes.
21:34 gfxstrand[d]: So you can do
21:34 gfxstrand[d]: vec2 coord = blah();
21:34 gfxstrand[d]: if (divergent)
21:34 gfxstrand[d]: x = tex(t, s, coord);
21:34 gfxstrand[d]: and that's still allowed to work.
21:34 gfxstrand[d]: In Vulkan, that's invalid
21:36 karolherbst[d]: are there special rules in regards to dead threads?
21:37 gfxstrand[d]: In DX? Nah... DX doesn't have rules. If it looks like it should work, it should work.
21:37 karolherbst[d]: heh
21:38 karolherbst[d]: all I know is that ndv doesn't mess with the thread mask, so if a thread is dead, it's dead and won't be promoted to active for the tex op
21:38 gfxstrand[d]: Yeah, I think it's just so you don't end up in the default derivative case by accident
21:38 gfxstrand[d]: Which, again, for Vulkan should never happen
21:40 gfxstrand[d]: Though I suspect we may find app bugs where it's useful as a quirk
21:42 karolherbst[d]: I wonder if dxvk lowers it..
21:42 gfxstrand[d]: I'm sure they do something
21:44 karolherbst[d]: I wonder if it matters for performance
21:50 pendingchaos: dxvk doesn't fix non-quad-uniform texture samples
21:52 karolherbst[d]: so.. what's the end result of this then?
21:56 pendingchaos: the spir-v has sample/ddx/ddy in quad-divergent control flow and ac_nir_lower_tex detects and fixes this case if the coordinates/sources are constants or input loads
21:57 airlied[d]: gfxstrand[d]: have you pushed out a rebased branch?
21:57 pendingchaos: it moves the constants or input loads outside control flow and uses aco-specific intrinsics to keep the coordinates valid for inactive invocations
21:58 gfxstrand[d]: airlied[d]: nvk/blackwell in my tree
21:59 gfxstrand[d]: airlied[d]: Did we ever figure out what `ldc.tex_unpack` actually does?
22:00 gfxstrand[d]: Does it just unpack the 12.20 to two u32s?
22:00 gfxstrand[d]: I bet that's what it does
22:01 airlied[d]: yes
22:01 airlied[d]: header index then sampler index
22:01 airlied[d]: you can also unpack 2 of them into 4 regs if you specify .64
22:01 gfxstrand[d]: Because that's useful?!? 😂
22:02 airlied[d]: URd 19:0 URd+1 31:20
22:02 gfxstrand[d]: So does it shift the sampler index down or leave it in the top 12 bits?
22:03 airlied[d]: I assume it shifts it down, but the docs aren't explicit
22:04 gfxstrand[d]: I should figure out how to unit test it
22:05 gfxstrand[d]: I'm actually a little tempted to model it as a `MemTyp`
22:05 gfxstrand[d]: Because it's pretty similar to how, say, `.i16` is handled
22:05 airlied[d]: as long as it doesn't end up in a non-UR register 🙂
22:06 gfxstrand[d]: Is it only available on `uldc`? Or can you set it on `ldc` too?
22:07 gfxstrand[d]: (Yes, I know you can only use the ureg form of the tex ops with it but I'm thinking of more nefarious potential uses.)
22:09 gfxstrand[d]: Also, the ureg form sounds really useful even without `uldc.tex_unpack` because it lets us avoid some annoying ALU
22:10 gfxstrand[d]: And it may be faster in general
22:10 gfxstrand[d]: Since it's less data sent from the shader to the texture unit
22:14 gfxstrand[d]: It's so annoying that they change the data layout so much based on uniform vs. not, though.
22:15 gfxstrand[d]: Although a quick "Is it in uniform control flow?" should get us 95% of the cases
22:16 mhenning[d]: gfxstrand[d]: possibly. there's a paper that claims a uniform -address ldg is a few cycles cheaper than a nonuniform ldg, which I'm guessing is because of the ureg ldg form
22:16 airlied[d]: no only on ldcu
22:16 gfxstrand[d]: Is that ureg form or just a uniform address? Because a uniform address is going to be faster. I guarantee there's a HW optimization that only sends one memory transaction when it detects uniform.
22:17 gfxstrand[d]: Also, there's less memory to load in that case
22:17 mhenning[d]: gfxstrand[d]: iirc the paper was using cuda c++ so it might depend on the details of how they were writing it
22:20 gfxstrand[d]: gfxstrand[d]: So either I failed at CTSing or this is fine.
22:21 gfxstrand[d]: I didn't fail at CTSing
22:23 airlied[d]: https://tenor.com/view/this-is-fine-gif-24177057
22:24 gfxstrand[d]: Nah, makes sense. It should do nothing for valid Vulkan shaders since they're always quad-uniform.
22:24 gfxstrand[d]: Now why it matters on Blackwell begs the question...
22:26 airlied[d]: I did take a brief look for shader header/qmd bits that might matter, but couldn't spot anything
22:27 airlied[d]: and we still have that failing maximal reconvergence test
22:27 gfxstrand[d]: Yeah, but it's failing even in straight-line shaders with no control flow
22:29 gfxstrand[d]: Oh, I have a theory...
22:30 gfxstrand[d]: Two theories, actually
22:33 mhenning[d]: gfxstrand[d]: Okay, I found it "We can see that global memory accesses are faster if instructions use uniform registers for computing their addresses rather than regular registers. This difference is due to a faster address calculation. When using uniform registers, all threads in a warp share the same register, and thus, a single memory address needs to be computed. On the other hand, when using
22:33 mhenning[d]: regular registers, each thread needs to compute a potentially different memory address." https://arxiv.org/pdf/2503.20481 section 5.4
22:34 mhenning[d]: so it's a ureg thing
22:34 mhenning[d]: they even find the ureg form has shorter WAR hazards (table 2)
22:36 gfxstrand[d]: Right. Yes, the uniform integer pipe is faster
22:36 gfxstrand[d]: It's just annoying because it's so easy to fall off the uniform pipe pre-Blackwell
22:39 mohamexiety[d]: how is NV usage of uniform like btw? I dabbled a bit with microbenchmarking a while back along with others and we couldnt really get NV to emit uniform instructions at all
22:39 mohamexiety[d]: opencl and vulkan at the time
22:39 gfxstrand[d]: There's no uniform float pre-Blackwell so it's pretty limited
22:40 mohamexiety[d]: the kind of weird thing is even for address generation and such we couldnt get uniform instructions which was weird because that's like the ideal case for it
22:44 airlied[d]: the shader linked in https://gitlab.freedesktop.org/mesa/mesa/-/issues/12817 uses them in a few places
22:46 mohamexiety[d]: hmm
22:47 airlied[d]: a lot of address calcs end up taking in some system values and derailing it, though in my hacks I managed to get most of it to optimise out fine to ur + const offset + per-lane reg
22:47 airlied[d]: I think for proper address calcs keeping all 3 fields separate until the consumer was optimal
23:05 gfxstrand[d]: Annoyingly, they all have different signedness and bit sizes.
23:42 gfxstrand[d]: Got rid of all `.ndv` and `.dxy` except on `fswzadd` and...
23:42 gfxstrand[d]: dEQP-VK.api.driver_properties.conformance_version,Fail
23:42 gfxstrand[d]: dEQP-VK.binding_model.descriptor_buffer.sparse_binding_buffer.multiple.compute_comp_buffers32_sets1,Timeout
23:42 gfxstrand[d]: dEQP-VK.binding_model.descriptor_buffer.sparse_binding_buffer.multiple.graphics_comp_buffers32_sets1,Timeout
23:42 gfxstrand[d]: dEQP-VK.binding_model.descriptor_buffer.sparse_binding_buffer.multiple.graphics_frag_buffers32_sets1,Timeout
23:42 gfxstrand[d]: dEQP-VK.binding_model.descriptor_buffer.sparse_binding_buffer.multiple.graphics_vert_buffers32_sets1,Timeout
23:42 gfxstrand[d]: dEQP-VK.binding_model.descriptor_buffer.sparse_residency_buffer.multiple.compute_comp_buffers32_sets1,Timeout
23:42 gfxstrand[d]: dEQP-VK.binding_model.descriptor_buffer.sparse_residency_buffer.multiple.graphics_comp_buffers32_sets1,Timeout
23:42 gfxstrand[d]: dEQP-VK.binding_model.descriptor_buffer.sparse_residency_buffer.multiple.graphics_frag_buffers32_sets1,Timeout
23:42 gfxstrand[d]: dEQP-VK.binding_model.descriptor_buffer.sparse_residency_buffer.multiple.graphics_vert_buffers32_sets1,Timeout
23:42 gfxstrand[d]: dEQP-VK.binding_model.descriptor_buffer.traditional_buffer.multiple.compute_comp_buffers32_sets1,Timeout
23:42 gfxstrand[d]: dEQP-VK.binding_model.descriptor_buffer.traditional_buffer.multiple.graphics_comp_buffers32_sets1,Timeout
23:42 gfxstrand[d]: dEQP-VK.binding_model.descriptor_buffer.traditional_buffer.multiple.graphics_frag_buffers32_sets1,Timeout
23:42 gfxstrand[d]: dEQP-VK.binding_model.descriptor_buffer.traditional_buffer.multiple.graphics_vert_buffers32_sets1,Timeout
23:42 gfxstrand[d]: dEQP-VK.info.device_extensions,Fail
23:42 gfxstrand[d]: dEQP-VK.spirv_assembly.instruction.compute.compute_shader_derivatives.compute.lod_op.sample.linear.16_1_1.mip_1,Fail
23:42 gfxstrand[d]: dEQP-VK.spirv_assembly.instruction.compute.compute_shader_derivatives.compute.lod_op.sample.linear.4_4_1.mip_1,Fail
23:42 gfxstrand[d]: dEQP-VK.spirv_assembly.instruction.compute.compute_shader_derivatives.compute.lod_op.sample.quads.4_4_1.mip_1,Fail
23:42 gfxstrand[d]: 3 fails
23:48 gfxstrand[d]: So I think .dxy is overrides compute shader behavior that forces tex to `.lz` implicitly and says "No, actually do a derivative"
23:52 gfxstrand[d]: pre-Turing, NVIDIA hardware smashed to `.lz` in non-fragment-shaders. It looks like maybe that's controlled per-op now with `.dxy`
23:52 gfxstrand[d]: Starting on Turing, derivatives in CS are okay
23:52 gfxstrand[d]: But I think it still overrides in all the geometry stages
23:52 gfxstrand[d]: I wonder if `tex.dxy` works in vertex shaders now. 🤔
23:59 mhenning[d]: OpMatch is here: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35778