IRC Logs of #nouveau on irc.freenode.net for 2025-02-05

00:00 gfxstrand[d]: Like I think WaW doesn't matter for FP64 ops because they're all on the same unit, just it might be slow compared to other stuff.
00:01 gfxstrand[d]: I'll try to pull that back out and dust it off tomorrow.
00:05 mhenning[d]: gfxstrand[d]: are you sure? on eg. x86 it's common for denorms to cost extra cycles, not sure if nv has that for fp64 or not
00:05 gfxstrand[d]: I'm not sure actually sure about the FP64 example.
00:06 gfxstrand[d]: I think there are units which are a FIFO with respect to themselves but not with respect to the integer/float unit.
00:06 mhenning[d]: also worth noting that some of what you're working on might conflict with https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33306
00:07 mhenning[d]: gfxstrand[d]: yeah, that sounds plausible
00:07 gfxstrand[d]: And I'm pretty sure some of the workstation cards have "full rate" FP64 where it's a fixed cycle count and I expect that consumer cards have the same hardware, just less of it.
00:07 gfxstrand[d]: mhenning[d]: Yeah, I'm well aware. Hopefully it's actually easier / better abstracted once I'm done with it, though.
00:10 mhenning[d]: gfxstrand[d]: btw, are you back to work enough now that I can start pestering you for code review again?
00:12 gfxstrand[d]: I am. But it's gonna take a bit to clock all the way up so be patient. 💜
00:14 mhenning[d]: sure! https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33105 is a pretty easy one, when you get a chance
10:47 gfxstrand[d]: karolherbst[d]: What do .EF/.EL/.LU etc. do on tex ops?
10:48 karolherbst[d]: cache eviction hints
10:48 karolherbst[d]: e == evict
10:48 karolherbst[d]: F == first
10:48 karolherbst[d]: l == last
10:49 karolherbst[d]: u == use
10:49 karolherbst[d]: there is a special .NA which means "no caching/streaming"
10:49 karolherbst[d]: ehh it's for streaming, so it does _not allocate_
10:50 karolherbst[d]: gfxstrand[d]: if you want to optimize tex, you need to deal with .NDV
10:50 karolherbst[d]: fyi
10:50 karolherbst[d]: ehh..
10:50 karolherbst[d]: and .NODEP
10:50 karolherbst[d]: I mean .NODEP, not .NDV
10:50 gfxstrand[d]: Yeah, looking at nodep now.
10:51 karolherbst[d]: it's for derivatives
10:51 gfxstrand[d]: Looks like codegen never sets it. :frog_upside_down:
10:51 karolherbst[d]: if there are no further ones to calculate, you can specify .NODEP
10:51 karolherbst[d]: well
10:51 karolherbst[d]: it can break stuff if you aren't careful
10:51 gfxstrand[d]: It has a bit in the tex op which is plumbed through but the bit is never set.
10:51 karolherbst[d]: I had a patch _almost_ working, it speed up a lot of things
10:51 karolherbst[d]: but got random issues here and there :blobcatnotlikethis:
10:52 karolherbst[d]: sooo
10:52 gfxstrand[d]: Wait, what do you mean by "no further ones to calculate"?
10:52 karolherbst[d]: .NODEP means that killed threads won't participate
10:52 gfxstrand[d]: Right
10:52 karolherbst[d]: so if future derivatives calculations don't depend on it, you can specify .NODEP
10:53 karolherbst[d]: or something like that
10:53 gfxstrand[d]: ugh
10:53 gfxstrand[d]: right
10:53 karolherbst[d]: yeah.....
10:53 karolherbst[d]: it's a pita optimization
10:53 karolherbst[d]: but apparently it matters
10:53 gfxstrand[d]: Actually, IDK that it's that bad in NIR.
10:53 gfxstrand[d]: Seems like a generally useful analysis pass
10:54 karolherbst[d]: yeah.. I looked at it before I moved nouveau to nir 🙃
10:55 karolherbst[d]: .NDV forces the op to be considered non-divergent, whatever that might help with
10:57 karolherbst[d]: but yeah.. .NODEP helps a lot, because it makes killed threads not do any loads saving memory bw
10:57 gfxstrand[d]: Ugh... Just not doing things in helpers is tricky
10:57 gfxstrand[d]: Because they might still `txf` a loop bound
10:59 gfxstrand[d]: But we could probably come up with something "good enough" that works in 95% of cases.
10:59 karolherbst[d]: mhhhh
11:00 karolherbst[d]: yeah... might be enough to only set it for now in cases where it's easy to determine
11:00 karolherbst[d]: to be safe to set
11:00 karolherbst[d]: obviously the big gain here is inside helper invocations
11:01 karolherbst[d]: I think...
11:01 karolherbst[d]: maybe it only matters for explicitly killed threads.. mhhh
11:02 karolherbst[d]: maybe I left some notes in my commits I've written..... 8? years ago
11:04 karolherbst[d]: I guessed the 8 years right without looking 😭, kinda weird how time flies by at some point
11:04 karolherbst[d]: ahh yes
11:04 karolherbst[d]: "mproves performance in gputest furmark by roughly 12.5%" :3
11:04 karolherbst[d]: gfxstrand[d]: ^^
11:05 karolherbst[d]: no wonder I thought we were closer to nvidia there
11:05 karolherbst[d]: https://github.com/karolherbst/mesa/commit/20d79f1b982f4e4f664c377199f700845a12d8ba
11:05 karolherbst[d]: mhhhh...
11:06 karolherbst[d]: `if (getType() == TYPE_FRAGMENT)` _interesting_
11:06 karolherbst[d]: but checks out
11:07 karolherbst[d]: but sadly I haven't written down what my stuff broke.. but yeah.. it wasn't really finished at that point
11:27 gfxstrand[d]: Is it valid to set nodep on a tex that does an implicit derivative?
11:27 gfxstrand[d]: I would assume so as you'd still want to optimize the memory traffic for those
11:29 karolherbst[d]: as long as it's not used for anything and doesn't influence the calculations in other threads it should be fine
11:29 mohamexiety[d]: given the earlier subchannel switching talk, this may be interesting?
11:29 mohamexiety[d]: > When an application submits a sequence of different work types (e.g. Draw, then Dispatch) within a single queue, the hardware may insert an implicit barrier between them. This implicit barrier is called a Subchannel Switch; it involves a pipeline flush and wait-for-idle at the Front End, preventing parallelism across the barrier. To identify where these occurred on the timeline, under the
11:29 mohamexiety[d]: “Overlays” menu, enable the “Subchannel Switches” checkbox. This feature is available on NVIDIA Ampere and Ada Architecture GPUs. On NVIDIA Blackwell Architecture GPUs and newer; subchannel switches do not occur between 3D and compute workloads, the overlay is therefore unavailable for those architectures.
11:29 mohamexiety[d]: source: https://docs.nvidia.com/nsight-graphics/UserGuide/index.html#subchannel-switch-overlay
11:30 karolherbst[d]: ohhh the blackwell part is interesting but also...
11:30 karolherbst[d]: uhhhh
11:30 karolherbst[d]: sounds like work
11:30 karolherbst[d]: could imply that 3d+compute works quite differently, maybe it's one class now.. 😄
11:31 karolherbst[d]: or maybe they fixed the state sharing in such a way, that the state won't become invalid or something
11:32 mohamexiety[d]: yeah the wording is vague but I think the implication is that they're one class now given they say "subchannel switches do not occur"
11:32 karolherbst[d]: yeah...
11:32 karolherbst[d]: maybe 3D is all compute now, lol
11:33 karolherbst[d]: I should check some git
11:36 karolherbst[d]: mhh the headers in the open source kernel repo are all empty. sad
11:36 karolherbst[d]: though not sure the kernel would do compute stuff
11:36 karolherbst[d]: so at least it seems like 3d is identical
11:36 karolherbst[d]: or well..
11:36 karolherbst[d]: not changed much
11:37 karolherbst[d]: should ask nvidia to release class headers for blackwell
11:37 mohamexiety[d]: yup
12:02 gfxstrand[d]: karolherbst[d]: What's NDV?
12:03 karolherbst[d]: non divergent
12:03 gfxstrand[d]: Ah
12:04 gfxstrand[d]: Does that mean the destination is non-divergent or that the handle is non-divergent?
12:04 karolherbst[d]: unclear to me
12:04 gfxstrand[d]: Is it new on Turing?
12:05 karolherbst[d]: it exists on Volta, no idea about older gens
12:05 gfxstrand[d]: I've got it in my SM50 code but IDK if that's a lie
12:06 karolherbst[d]: looks like it's a maxwell+ thing yeah...
12:06 karolherbst[d]: envytools also has it
12:06 karolherbst[d]: maybe was called something different on older gens
12:06 gfxstrand[d]: Looks like it's mapped to deriveAll
12:07 karolherbst[d]: ahh
12:07 karolherbst[d]: I think it's more about control flow than anything else
12:08 karolherbst[d]: but not sure
12:08 gfxstrand[d]: Oh, that could be
12:08 karolherbst[d]: but yeah.. I think it matters if the quad itself is divergent and you want to force it not to be
12:28 gfxstrand[d]: furmark you say?
12:40 gfxstrand[d]: How on earth is `.nodep` causing faults?!?
12:40 gfxstrand[d]: It must not mean quite what you think it means
12:41 karolherbst[d]: gfxstrand[d]: well... yeah.. 🙂
12:42 karolherbst[d]: that's part of the details I haven't figured out
12:42 karolherbst[d]: though, if address calculations depend on derivatives, and .nodep messing with that, it can mess up stuff
12:43 karolherbst[d]: anyway, I know it only impacts killed threads
12:45 karolherbst[d]: maybe notthatclippy[d] can share something there? dunno... but given it's perf critical we might get better docs on it, because the things I have doesn't really tell me more than what we kinda used to know already
12:47 notthatclippy[d]: No, he can't, because my first thought was exactly Faith's quoted message above.
12:48 notthatclippy[d]: Maybe ahuillet or skeggsb9778[d] would know? Otherwise, start an email thread and we'll find someone.
12:48 karolherbst[d]: I think the fault is just calculations being different
12:48 karolherbst[d]: and then ending up with an invalid address or something
12:48 karolherbst[d]: who knows what furmark is doing
12:50 karolherbst[d]: I just know it's important for good perf 😄
12:52 karolherbst[d]: might also make sense to check how nvidia uses .NODEP in furmark shaders
12:55 karolherbst[d]: notthatclippy[d]: can you explain what "killed" threads continue doing? maybe that would help...
12:55 karolherbst[d]: it just stops outputting a color or more?
12:56 karolherbst[d]: well and no global memory writes
12:56 karolherbst[d]: and to other memory
12:56 karolherbst[d]: gfxstrand[d]: sooo.. I think the thing is, that killed threads continue to execute the shader, however they don't cause any side-effects. So if a killed shader executes tex.nodep, and uses that result in a future memory operation, it could fault
12:57 karolherbst[d]: soo.. I think any tex which result doesn't cause visible side-effects (including faults, and loads through potentially faulting later) can get a .nodep
12:58 notthatclippy[d]: karolherbst[d]: Sorry, I really don't think I know more here than you. I can ask around, maybe see if there's anyone willing to come by and explain.
12:58 gfxstrand[d]: Hrm... furmark2 doesn't seem affected
12:59 karolherbst[d]: there is furmark2?
12:59 gfxstrand[d]: Yeah. It has Vulkan
12:59 karolherbst[d]: ahh
13:00 karolherbst[d]: anyway.. maybe my latest theory helps implementing it correctly and does improve perf in the original furmark
13:00 gfxstrand[d]: karolherbst[d]: Ugh... Yeah, that's an issue.
13:01 gfxstrand[d]: I mean, 0 should typically be a safe value but I suppose it's possible that it returns utter garbage, not 0.
13:01 karolherbst[d]: killed thread continue executing instruction because they have to participate in tex instruction because derivatives and stuff
13:01 karolherbst[d]: yeah...
13:01 karolherbst[d]: might not overwrite the original value
13:03 gfxstrand[d]: gfxstrand[d]: Yeah, 0 isn't always a safe value. Not when the shader is indexing arrays with it. 🤦🏻‍♀️
13:03 karolherbst[d]: I only know that it doesn't load
13:03 karolherbst[d]: yep
13:04 karolherbst[d]: the trivial case it helps with is if you load from a texture to set the output color + some math, but yeah...
13:05 karolherbst[d]: also need to check if it's used in subgroup ops and tons of other things
13:06 karolherbst[d]: it's called .nodep probably because it means "it's not a dependency for anything"
13:16 gfxstrand[d]: karolherbst[d]: I've got subgroup ops covered, I think.
13:16 gfxstrand[d]: But potentially memory things is tricky
13:16 karolherbst[d]: have you got branch conditions + predicates covered as well? 😄
13:16 gfxstrand[d]: yes
13:16 karolherbst[d]: nice
13:30 gfxstrand[d]: Okay, I think I have most I/O addresses sorted. I'm just not sure how I'm going to do texture handles. If a texop is used to compute another texture's bindless handle, that's a side-effect.
13:30 gfxstrand[d]: 😩
13:31 karolherbst[d]: yeah...
13:31 gfxstrand[d]: It's easier if the pass runs before nak_nir_lower_tex but it's also easier to set nodep if it runs after
13:31 karolherbst[d]: might do it in reverse and figure out if it's safe rather than if it's unsafe to use 😄
13:32 gfxstrand[d]: <a:shrug_anim:1096500513106841673>
13:33 gfxstrand[d]: Annoyingly, the handle/index source moves all over everywhere in the texture instructions.
13:33 gfxstrand[d]: I think I need to run it pre-lower somehow
13:35 karolherbst[d]: I think simply because they are used in a tex op is already reason enough to consider the source to have a dep
13:35 gfxstrand[d]: Yeah, maybe scorched earth is the right choice here
13:36 karolherbst[d]: could make an exception for "is this a lod and the tex is not used for anything" because that's _kinda_ safe
13:36 karolherbst[d]: I think...
13:36 karolherbst[d]: not sure if arbitrary lods are legal
13:36 karolherbst[d]: at least that won't load from arbitrary memory
13:37 karolherbst[d]: the big gains you get with "outputColor = load + some math" code here anywya
13:38 marysaka[d]: karolherbst[d]: hmmm I think on Maxwell that was legal but my memory around that is quite rusty...
13:38 karolherbst[d]: depth compare might also be legal
13:38 karolherbst[d]: but those are so super special cases...
13:39 marysaka[d]: one of the things I remember well is how painful the encoding for texture instructions was on Maxwell 😄
13:39 karolherbst[d]: marysaka[d]: yeah... it's probably easier to allow arbitrary ones, because the hardware will have to do a max(scaled, 1) anywaya
13:40 karolherbst[d]: marysaka[d]: not only on maxwell
13:40 karolherbst[d]: 😄
13:40 karolherbst[d]: but yeah...
13:40 gfxstrand[d]: karolherbst[d]: They're clampped
13:40 karolherbst[d]: the scalar variants were all sorts of fun
13:40 gfxstrand[d]: Generally, any parameters the user can pass to a tex op are sanitized
13:41 gfxstrand[d]: But for the purposes of compiler analysis, it's hard to assume those things
13:41 karolherbst[d]: if the dependent tex is .nodep, then the tex feeding into it can probably also be .nodep, and that's probably good enough
13:42 karolherbst[d]: you want to do this analysis backwards aanywya
13:42 karolherbst[d]: probably
13:42 gfxstrand[d]: Yes
13:43 karolherbst[d]: are you doing indirect tex indexing?
13:43 karolherbst[d]: not sure if that's safe
13:43 karolherbst[d]: though if it doesn't load, why should it do any bound checks.. (or even fetching the header)
13:44 gfxstrand[d]: nak/tex-nodep in my gitlab if you want to see what I'm doing
13:44 gfxstrand[d]: I'm CTSing now. There are some fails but no faults so far
13:44 karolherbst[d]: how is the furmark benching?
13:45 zmike[d]: furry
13:45 karolherbst[d]: gfxstrand[d]: ohh yeah.. soo. if you don't set .EF it's doing unpredictable eviction
13:45 karolherbst[d]: just fyi
13:46 karolherbst[d]: sounds like it's optimizing for traffic
13:46 karolherbst[d]: like uhm.. the default value means that
14:00 gfxstrand[d]: karolherbst[d]: So we should be using .EF? Codegen is all over the board here.
14:01 karolherbst[d]: I think when codegen was written nobody had any clue what any of it menas
14:01 gfxstrand[d]: Well yeah
14:01 gfxstrand[d]: At least I've got the bits labeled now. 🤷🏻‍♀️
14:01 karolherbst[d]: I _think_ the default is best unless you know better
14:01 gfxstrand[d]: I should look at what the blob does. Unfortunately, my blob box died.
14:02 karolherbst[d]: like .NA is useful if you know the same cache line won't be accessed again
14:04 karolherbst[d]: if you access the same cacheline a few instructions later, but then never again, prolly want to use .EL and then .EF
14:04 gfxstrand[d]: That might be something we can reasonably do for SSBOs or something but not textures.
14:05 karolherbst[d]: not sure it's based on cachlines tho
14:05 karolherbst[d]: but global loads and surface ops have the same flags
14:05 gfxstrand[d]: Sure
14:06 gfxstrand[d]: I should probably add an enum to NAK and plumb it through but idk how much we want to mess with it.
14:06 karolherbst[d]: anyway, the default value (1) is probably the safest bet
14:07 gfxstrand[d]: Yup
14:07 karolherbst[d]: but yeah.. I can see a benefit in using .EL if you operate on the same texture a bit later
14:08 karolherbst[d]: or memory
14:58 gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33402
14:58 gfxstrand[d]: Seems pretty close to correct to me and I'm seeing .nodep in shaders
14:58 gfxstrand[d]: IDK if it helps anything
14:58 karolherbst[d]: does it help the old furmark?
14:59 karolherbst[d]: but yeah.. some benching should help to figure out if it's helpful
15:00 gfxstrand[d]: It's also possible that it just doesn't matter as much on Ada and mattered a lot on Maxwell or whatever you were benchmarking before.
15:00 karolherbst[d]: might be
15:01 karolherbst[d]: or the gl driver just wasting a lot of memory bw for no reason
15:01 gfxstrand[d]: It could also be furmark 2 being different
15:02 karolherbst[d]: yeah... you probably should test with the first one if you have that avaialable
15:02 gfxstrand[d]: They don't have Linux builds of furmark 1 on the website anymore
15:02 karolherbst[d]: huh?
15:02 karolherbst[d]: weird
15:02 karolherbst[d]: http://www.phoronix-test-suite.com/benchmark-files/GpuTest_Linux_x64_0.7.0.zip
15:02 karolherbst[d]: 😄
15:03 karolherbst[d]: lol
15:03 karolherbst[d]: and here I thought I could upload it
15:09 gfxstrand[d]: No difference on Furmark 1 running under Zink, either.
15:10 gfxstrand[d]: :frog_upside_down:
15:10 gfxstrand[d]: Perf work is hard. 😢
15:11 mohamexiety[d]: aside from the GSP HW utilization thing, was there talk about what else would be needed for nvk to be able to leverage NSight?
15:11 mohamexiety[d]: feels like that would be massive help here at least
15:11 gfxstrand[d]: I've been pondering. I think it would require me to have a chat with the NSight team.
15:12 gfxstrand[d]: Which I'm happy to have
15:12 gfxstrand[d]: I don't even know how NSight works or gathers metrics or anything.
15:12 mohamexiety[d]: yeah I think that may be worthwhile for stuff like this
15:12 gfxstrand[d]: AMD has this neat thing with hardware trace/replay that RADV is pretty easily able to wire up. IDK how any of that works on NV.
15:13 gfxstrand[d]: mohamexiety[d]: Oh, absolutely! It's just a matter of whether or not it's practical.
15:14 mohamexiety[d]: yeah I don't know how the metrics works/are gathered but the few times I used nsight on windows NV it was really nice; you get info on pretty much everything
15:22 gfxstrand[d]: gfxstrand[d]: At least it passes CTS. <a:shrug_anim:1096500513106841673>
15:23 gfxstrand[d]: And I'm enough convinced of the correctness of my pass that if we can find anything it helps, I'm happy to merge.
15:28 gfxstrand[d]: mhenning[d]: Since I'm doing perf things that don't do anything... Might as well. 😂
15:56 karolherbst[d]: gfxstrand[d]: should check on the gl driver with my broken pass again...
16:30 gfxstrand[d]: With or without, NVK+Zink is like 5-10% faster than nouveau GL
16:34 asdqueerfromeu[d]: gfxstrand[d]: And MMU fault-free I guess
16:44 gfxstrand[d]: Furmark is, anyway.
16:52 zmike[d]: 💪 :vulkan: :muscleright:
16:54 tiredchiku[d]: <a:speedL:762086108447375360>:vulkan:<a:speedR:762086108707422278>
17:00 gfxstrand[d]: Don't get too excited. Furmark isn't the worlds best benchmark.
17:01 gfxstrand[d]: But I think landing cbuf textures closed most of the gap with nouveau GL.
17:01 gfxstrand[d]: That was the one big thing I was aware of that we couldn't do.
17:01 gfxstrand[d]: The gap with the blob is going to be a lot harder.
17:13 zmike[d]: gfxstrand[d]: sounds like excuses to me
17:14 gfxstrand[d]: Hey, if you wanna write the "NVK performance is now complete" blog post, I'm not gonna stop you. 😉
17:15 zmike[d]: ???????????
17:15 zmike[d]: you were literally there
17:16 zmike[d]: https://youtu.be/Z6XLwkyo6Nw?t=1869
17:24 mohamexiety[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1336749332841627729/image.png?ex=67a4f054&is=67a39ed4&hm=4ae3c89b71464ff384589112f282ab444f3ea8664c9784f0f8b6a2b9e67f5619&
17:24 mohamexiety[d]: this is amazing
17:25 karolherbst[d]: good times
17:26 karolherbst[d]: got rusticl on zink running at the same time or something
17:26 zmike[d]: yeah that was an intense conference
17:27 karolherbst[d]: carrying all those laptops around was good trianing tho
17:27 zmike[d]: hahah
17:31 ermine1716[d]: > Hit the gym!