00:00gfxstrand[d]: Like I think WaW doesn't matter for FP64 ops because they're all on the same unit, just it might be slow compared to other stuff.
00:01gfxstrand[d]: I'll try to pull that back out and dust it off tomorrow.
00:05mhenning[d]: gfxstrand[d]: are you sure? on eg. x86 it's common for denorms to cost extra cycles, not sure if nv has that for fp64 or not
00:05gfxstrand[d]: I'm not sure actually sure about the FP64 example.
00:06gfxstrand[d]: I think there are units which are a FIFO with respect to themselves but not with respect to the integer/float unit.
00:06mhenning[d]: also worth noting that some of what you're working on might conflict with https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33306
00:07mhenning[d]: gfxstrand[d]: yeah, that sounds plausible
00:07gfxstrand[d]: And I'm pretty sure some of the workstation cards have "full rate" FP64 where it's a fixed cycle count and I expect that consumer cards have the same hardware, just less of it.
00:07gfxstrand[d]: mhenning[d]: Yeah, I'm well aware. Hopefully it's actually easier / better abstracted once I'm done with it, though.
00:10mhenning[d]: gfxstrand[d]: btw, are you back to work enough now that I can start pestering you for code review again?
00:12gfxstrand[d]: I am. But it's gonna take a bit to clock all the way up so be patient. 💜
00:14mhenning[d]: sure! https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33105 is a pretty easy one, when you get a chance
10:47gfxstrand[d]: karolherbst[d]: What do .EF/.EL/.LU etc. do on tex ops?
10:48karolherbst[d]: cache eviction hints
10:48karolherbst[d]: e == evict
10:48karolherbst[d]: F == first
10:48karolherbst[d]: l == last
10:49karolherbst[d]: u == use
10:49karolherbst[d]: there is a special .NA which means "no caching/streaming"
10:49karolherbst[d]: ehh it's for streaming, so it does _not allocate_
10:50karolherbst[d]: gfxstrand[d]: if you want to optimize tex, you need to deal with .NDV
10:50karolherbst[d]: fyi
10:50karolherbst[d]: ehh..
10:50karolherbst[d]: and .NODEP
10:50karolherbst[d]: I mean .NODEP, not .NDV
10:50gfxstrand[d]: Yeah, looking at nodep now.
10:51karolherbst[d]: it's for derivatives
10:51gfxstrand[d]: Looks like codegen never sets it. :frog_upside_down:
10:51karolherbst[d]: if there are no further ones to calculate, you can specify .NODEP
10:51karolherbst[d]: well
10:51karolherbst[d]: it can break stuff if you aren't careful
10:51gfxstrand[d]: It has a bit in the tex op which is plumbed through but the bit is never set.
10:51karolherbst[d]: I had a patch _almost_ working, it speed up a lot of things
10:51karolherbst[d]: but got random issues here and there :blobcatnotlikethis:
10:52karolherbst[d]: sooo
10:52gfxstrand[d]: Wait, what do you mean by "no further ones to calculate"?
10:52karolherbst[d]: .NODEP means that killed threads won't participate
10:52gfxstrand[d]: Right
10:52karolherbst[d]: so if future derivatives calculations don't depend on it, you can specify .NODEP
10:53karolherbst[d]: or something like that
10:53gfxstrand[d]: ugh
10:53gfxstrand[d]: right
10:53karolherbst[d]: yeah.....
10:53karolherbst[d]: it's a pita optimization
10:53karolherbst[d]: but apparently it matters
10:53gfxstrand[d]: Actually, IDK that it's that bad in NIR.
10:53gfxstrand[d]: Seems like a generally useful analysis pass
10:54karolherbst[d]: yeah.. I looked at it before I moved nouveau to nir 🙃
10:55karolherbst[d]: .NDV forces the op to be considered non-divergent, whatever that might help with
10:57karolherbst[d]: but yeah.. .NODEP helps a lot, because it makes killed threads not do any loads saving memory bw
10:57gfxstrand[d]: Ugh... Just not doing things in helpers is tricky
10:57gfxstrand[d]: Because they might still `txf` a loop bound
10:59gfxstrand[d]: But we could probably come up with something "good enough" that works in 95% of cases.
10:59karolherbst[d]: mhhhh
11:00karolherbst[d]: yeah... might be enough to only set it for now in cases where it's easy to determine
11:00karolherbst[d]: to be safe to set
11:00karolherbst[d]: obviously the big gain here is inside helper invocations
11:01karolherbst[d]: I think...
11:01karolherbst[d]: maybe it only matters for explicitly killed threads.. mhhh
11:02karolherbst[d]: maybe I left some notes in my commits I've written..... 8? years ago
11:04karolherbst[d]: I guessed the 8 years right without looking 😭, kinda weird how time flies by at some point
11:04karolherbst[d]: ahh yes
11:04karolherbst[d]: "mproves performance in gputest furmark by roughly 12.5%" :3
11:04karolherbst[d]: gfxstrand[d]: ^^
11:05karolherbst[d]: no wonder I thought we were closer to nvidia there
11:05karolherbst[d]: https://github.com/karolherbst/mesa/commit/20d79f1b982f4e4f664c377199f700845a12d8ba
11:05karolherbst[d]: mhhhh...
11:06karolherbst[d]: `if (getType() == TYPE_FRAGMENT)` _interesting_
11:06karolherbst[d]: but checks out
11:07karolherbst[d]: but sadly I haven't written down what my stuff broke.. but yeah.. it wasn't really finished at that point
11:27gfxstrand[d]: Is it valid to set nodep on a tex that does an implicit derivative?
11:27gfxstrand[d]: I would assume so as you'd still want to optimize the memory traffic for those
11:29karolherbst[d]: as long as it's not used for anything and doesn't influence the calculations in other threads it should be fine
11:29mohamexiety[d]: given the earlier subchannel switching talk, this may be interesting?
11:29mohamexiety[d]: > When an application submits a sequence of different work types (e.g. Draw, then Dispatch) within a single queue, the hardware may insert an implicit barrier between them. This implicit barrier is called a Subchannel Switch; it involves a pipeline flush and wait-for-idle at the Front End, preventing parallelism across the barrier. To identify where these occurred on the timeline, under the
11:29mohamexiety[d]: “Overlays” menu, enable the “Subchannel Switches” checkbox. This feature is available on NVIDIA Ampere and Ada Architecture GPUs. On NVIDIA Blackwell Architecture GPUs and newer; subchannel switches do not occur between 3D and compute workloads, the overlay is therefore unavailable for those architectures.
11:29mohamexiety[d]: source: https://docs.nvidia.com/nsight-graphics/UserGuide/index.html#subchannel-switch-overlay
11:30karolherbst[d]: ohhh the blackwell part is interesting but also...
11:30karolherbst[d]: uhhhh
11:30karolherbst[d]: sounds like work
11:30karolherbst[d]: could imply that 3d+compute works quite differently, maybe it's one class now.. 😄
11:31karolherbst[d]: or maybe they fixed the state sharing in such a way, that the state won't become invalid or something
11:32mohamexiety[d]: yeah the wording is vague but I think the implication is that they're one class now given they say "subchannel switches do not occur"
11:32karolherbst[d]: yeah...
11:32karolherbst[d]: maybe 3D is all compute now, lol
11:33karolherbst[d]: I should check some git
11:36karolherbst[d]: mhh the headers in the open source kernel repo are all empty. sad
11:36karolherbst[d]: though not sure the kernel would do compute stuff
11:36karolherbst[d]: so at least it seems like 3d is identical
11:36karolherbst[d]: or well..
11:36karolherbst[d]: not changed much
11:37karolherbst[d]: should ask nvidia to release class headers for blackwell
11:37mohamexiety[d]: yup
12:02gfxstrand[d]: karolherbst[d]: What's NDV?
12:03karolherbst[d]: non divergent
12:03gfxstrand[d]: Ah
12:04gfxstrand[d]: Does that mean the destination is non-divergent or that the handle is non-divergent?
12:04karolherbst[d]: unclear to me
12:04gfxstrand[d]: Is it new on Turing?
12:05karolherbst[d]: it exists on Volta, no idea about older gens
12:05gfxstrand[d]: I've got it in my SM50 code but IDK if that's a lie
12:06karolherbst[d]: looks like it's a maxwell+ thing yeah...
12:06karolherbst[d]: envytools also has it
12:06karolherbst[d]: maybe was called something different on older gens
12:06gfxstrand[d]: Looks like it's mapped to deriveAll
12:07karolherbst[d]: ahh
12:07karolherbst[d]: I think it's more about control flow than anything else
12:08karolherbst[d]: but not sure
12:08gfxstrand[d]: Oh, that could be
12:08karolherbst[d]: but yeah.. I think it matters if the quad itself is divergent and you want to force it not to be
12:28gfxstrand[d]: furmark you say?
12:40gfxstrand[d]: How on earth is `.nodep` causing faults?!?
12:40gfxstrand[d]: It must not mean quite what you think it means
12:41karolherbst[d]: gfxstrand[d]: well... yeah.. 🙂
12:42karolherbst[d]: that's part of the details I haven't figured out
12:42karolherbst[d]: though, if address calculations depend on derivatives, and .nodep messing with that, it can mess up stuff
12:43karolherbst[d]: anyway, I know it only impacts killed threads
12:45karolherbst[d]: maybe notthatclippy[d] can share something there? dunno... but given it's perf critical we might get better docs on it, because the things I have doesn't really tell me more than what we kinda used to know already
12:47notthatclippy[d]: No, he can't, because my first thought was exactly Faith's quoted message above.
12:48notthatclippy[d]: Maybe ahuillet or skeggsb9778[d] would know? Otherwise, start an email thread and we'll find someone.
12:48karolherbst[d]: I think the fault is just calculations being different
12:48karolherbst[d]: and then ending up with an invalid address or something
12:48karolherbst[d]: who knows what furmark is doing
12:50karolherbst[d]: I just know it's important for good perf 😄
12:52karolherbst[d]: might also make sense to check how nvidia uses .NODEP in furmark shaders
12:55karolherbst[d]: notthatclippy[d]: can you explain what "killed" threads continue doing? maybe that would help...
12:55karolherbst[d]: it just stops outputting a color or more?
12:56karolherbst[d]: well and no global memory writes
12:56karolherbst[d]: and to other memory
12:56karolherbst[d]: gfxstrand[d]: sooo.. I think the thing is, that killed threads continue to execute the shader, however they don't cause any side-effects. So if a killed shader executes tex.nodep, and uses that result in a future memory operation, it could fault
12:57karolherbst[d]: soo.. I think any tex which result doesn't cause visible side-effects (including faults, and loads through potentially faulting later) can get a .nodep
12:58notthatclippy[d]: karolherbst[d]: Sorry, I really don't think I know more here than you. I can ask around, maybe see if there's anyone willing to come by and explain.
12:58gfxstrand[d]: Hrm... furmark2 doesn't seem affected
12:59karolherbst[d]: there is furmark2?
12:59gfxstrand[d]: Yeah. It has Vulkan
12:59karolherbst[d]: ahh
13:00karolherbst[d]: anyway.. maybe my latest theory helps implementing it correctly and does improve perf in the original furmark
13:00gfxstrand[d]: karolherbst[d]: Ugh... Yeah, that's an issue.
13:01gfxstrand[d]: I mean, 0 should typically be a safe value but I suppose it's possible that it returns utter garbage, not 0.
13:01karolherbst[d]: killed thread continue executing instruction because they have to participate in tex instruction because derivatives and stuff
13:01karolherbst[d]: yeah...
13:01karolherbst[d]: might not overwrite the original value
13:03gfxstrand[d]: gfxstrand[d]: Yeah, 0 isn't always a safe value. Not when the shader is indexing arrays with it. 🤦🏻♀️
13:03karolherbst[d]: I only know that it doesn't load
13:03karolherbst[d]: yep
13:04karolherbst[d]: the trivial case it helps with is if you load from a texture to set the output color + some math, but yeah...
13:05karolherbst[d]: also need to check if it's used in subgroup ops and tons of other things
13:06karolherbst[d]: it's called .nodep probably because it means "it's not a dependency for anything"
13:16gfxstrand[d]: karolherbst[d]: I've got subgroup ops covered, I think.
13:16gfxstrand[d]: But potentially memory things is tricky
13:16karolherbst[d]: have you got branch conditions + predicates covered as well? 😄
13:16gfxstrand[d]: yes
13:16karolherbst[d]: nice
13:30gfxstrand[d]: Okay, I think I have most I/O addresses sorted. I'm just not sure how I'm going to do texture handles. If a texop is used to compute another texture's bindless handle, that's a side-effect.
13:30gfxstrand[d]: 😩
13:31karolherbst[d]: yeah...
13:31gfxstrand[d]: It's easier if the pass runs before nak_nir_lower_tex but it's also easier to set nodep if it runs after
13:31karolherbst[d]: might do it in reverse and figure out if it's safe rather than if it's unsafe to use 😄
13:32gfxstrand[d]: <a:shrug_anim:1096500513106841673>
13:33gfxstrand[d]: Annoyingly, the handle/index source moves all over everywhere in the texture instructions.
13:33gfxstrand[d]: I think I need to run it pre-lower somehow
13:35karolherbst[d]: I think simply because they are used in a tex op is already reason enough to consider the source to have a dep
13:35gfxstrand[d]: Yeah, maybe scorched earth is the right choice here
13:36karolherbst[d]: could make an exception for "is this a lod and the tex is not used for anything" because that's _kinda_ safe
13:36karolherbst[d]: I think...
13:36karolherbst[d]: not sure if arbitrary lods are legal
13:36karolherbst[d]: at least that won't load from arbitrary memory
13:37karolherbst[d]: the big gains you get with "outputColor = load + some math" code here anywya
13:38marysaka[d]: karolherbst[d]: hmmm I think on Maxwell that was legal but my memory around that is quite rusty...
13:38karolherbst[d]: depth compare might also be legal
13:38karolherbst[d]: but those are so super special cases...
13:39marysaka[d]: one of the things I remember well is how painful the encoding for texture instructions was on Maxwell 😄
13:39karolherbst[d]: marysaka[d]: yeah... it's probably easier to allow arbitrary ones, because the hardware will have to do a max(scaled, 1) anywaya
13:40karolherbst[d]: marysaka[d]: not only on maxwell
13:40karolherbst[d]: 😄
13:40karolherbst[d]: but yeah...
13:40gfxstrand[d]: karolherbst[d]: They're clampped
13:40karolherbst[d]: the scalar variants were all sorts of fun
13:40gfxstrand[d]: Generally, any parameters the user can pass to a tex op are sanitized
13:41gfxstrand[d]: But for the purposes of compiler analysis, it's hard to assume those things
13:41karolherbst[d]: if the dependent tex is .nodep, then the tex feeding into it can probably also be .nodep, and that's probably good enough
13:42karolherbst[d]: you want to do this analysis backwards aanywya
13:42karolherbst[d]: probably
13:42gfxstrand[d]: Yes
13:43karolherbst[d]: are you doing indirect tex indexing?
13:43karolherbst[d]: not sure if that's safe
13:43karolherbst[d]: though if it doesn't load, why should it do any bound checks.. (or even fetching the header)
13:44gfxstrand[d]: nak/tex-nodep in my gitlab if you want to see what I'm doing
13:44gfxstrand[d]: I'm CTSing now. There are some fails but no faults so far
13:44karolherbst[d]: how is the furmark benching?
13:45zmike[d]: furry
13:45karolherbst[d]: gfxstrand[d]: ohh yeah.. soo. if you don't set .EF it's doing unpredictable eviction
13:45karolherbst[d]: just fyi
13:46karolherbst[d]: sounds like it's optimizing for traffic
13:46karolherbst[d]: like uhm.. the default value means that
14:00gfxstrand[d]: karolherbst[d]: So we should be using .EF? Codegen is all over the board here.
14:01karolherbst[d]: I think when codegen was written nobody had any clue what any of it menas
14:01gfxstrand[d]: Well yeah
14:01gfxstrand[d]: At least I've got the bits labeled now. 🤷🏻♀️
14:01karolherbst[d]: I _think_ the default is best unless you know better
14:01gfxstrand[d]: I should look at what the blob does. Unfortunately, my blob box died.
14:02karolherbst[d]: like .NA is useful if you know the same cache line won't be accessed again
14:04karolherbst[d]: if you access the same cacheline a few instructions later, but then never again, prolly want to use .EL and then .EF
14:04gfxstrand[d]: That might be something we can reasonably do for SSBOs or something but not textures.
14:05karolherbst[d]: not sure it's based on cachlines tho
14:05karolherbst[d]: but global loads and surface ops have the same flags
14:05gfxstrand[d]: Sure
14:06gfxstrand[d]: I should probably add an enum to NAK and plumb it through but idk how much we want to mess with it.
14:06karolherbst[d]: anyway, the default value (1) is probably the safest bet
14:07gfxstrand[d]: Yup
14:07karolherbst[d]: but yeah.. I can see a benefit in using .EL if you operate on the same texture a bit later
14:08karolherbst[d]: or memory
14:58gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33402
14:58gfxstrand[d]: Seems pretty close to correct to me and I'm seeing .nodep in shaders
14:58gfxstrand[d]: IDK if it helps anything
14:58karolherbst[d]: does it help the old furmark?
14:59karolherbst[d]: but yeah.. some benching should help to figure out if it's helpful
15:00gfxstrand[d]: It's also possible that it just doesn't matter as much on Ada and mattered a lot on Maxwell or whatever you were benchmarking before.
15:00karolherbst[d]: might be
15:01karolherbst[d]: or the gl driver just wasting a lot of memory bw for no reason
15:01gfxstrand[d]: It could also be furmark 2 being different
15:02karolherbst[d]: yeah... you probably should test with the first one if you have that avaialable
15:02gfxstrand[d]: They don't have Linux builds of furmark 1 on the website anymore
15:02karolherbst[d]: huh?
15:02karolherbst[d]: weird
15:02karolherbst[d]: http://www.phoronix-test-suite.com/benchmark-files/GpuTest_Linux_x64_0.7.0.zip
15:02karolherbst[d]: 😄
15:03karolherbst[d]: lol
15:03karolherbst[d]: and here I thought I could upload it
15:09gfxstrand[d]: No difference on Furmark 1 running under Zink, either.
15:10gfxstrand[d]: :frog_upside_down:
15:10gfxstrand[d]: Perf work is hard. 😢
15:11mohamexiety[d]: aside from the GSP HW utilization thing, was there talk about what else would be needed for nvk to be able to leverage NSight?
15:11mohamexiety[d]: feels like that would be massive help here at least
15:11gfxstrand[d]: I've been pondering. I think it would require me to have a chat with the NSight team.
15:12gfxstrand[d]: Which I'm happy to have
15:12gfxstrand[d]: I don't even know how NSight works or gathers metrics or anything.
15:12mohamexiety[d]: yeah I think that may be worthwhile for stuff like this
15:12gfxstrand[d]: AMD has this neat thing with hardware trace/replay that RADV is pretty easily able to wire up. IDK how any of that works on NV.
15:13gfxstrand[d]: mohamexiety[d]: Oh, absolutely! It's just a matter of whether or not it's practical.
15:14mohamexiety[d]: yeah I don't know how the metrics works/are gathered but the few times I used nsight on windows NV it was really nice; you get info on pretty much everything
15:22gfxstrand[d]: gfxstrand[d]: At least it passes CTS. <a:shrug_anim:1096500513106841673>
15:23gfxstrand[d]: And I'm enough convinced of the correctness of my pass that if we can find anything it helps, I'm happy to merge.
15:28gfxstrand[d]: mhenning[d]: Since I'm doing perf things that don't do anything... Might as well. 😂
15:56karolherbst[d]: gfxstrand[d]: should check on the gl driver with my broken pass again...
16:30gfxstrand[d]: With or without, NVK+Zink is like 5-10% faster than nouveau GL
16:34asdqueerfromeu[d]: gfxstrand[d]: And MMU fault-free I guess
16:44gfxstrand[d]: Furmark is, anyway.
16:52zmike[d]: 💪 :vulkan: :muscleright:
16:54tiredchiku[d]: <a:speedL:762086108447375360>:vulkan:<a:speedR:762086108707422278>
17:00gfxstrand[d]: Don't get too excited. Furmark isn't the worlds best benchmark.
17:01gfxstrand[d]: But I think landing cbuf textures closed most of the gap with nouveau GL.
17:01gfxstrand[d]: That was the one big thing I was aware of that we couldn't do.
17:01gfxstrand[d]: The gap with the blob is going to be a lot harder.
17:13zmike[d]: gfxstrand[d]: sounds like excuses to me
17:14gfxstrand[d]: Hey, if you wanna write the "NVK performance is now complete" blog post, I'm not gonna stop you. 😉
17:15zmike[d]: ???????????
17:15zmike[d]: you were literally there
17:16zmike[d]: https://youtu.be/Z6XLwkyo6Nw?t=1869
17:24mohamexiety[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1336749332841627729/image.png?ex=67a4f054&is=67a39ed4&hm=4ae3c89b71464ff384589112f282ab444f3ea8664c9784f0f8b6a2b9e67f5619&
17:24mohamexiety[d]: this is amazing
17:25karolherbst[d]: good times
17:26karolherbst[d]: got rusticl on zink running at the same time or something
17:26zmike[d]: yeah that was an intense conference
17:27karolherbst[d]: carrying all those laptops around was good trianing tho
17:27zmike[d]: hahah
17:31ermine1716[d]: > Hit the gym!