00:50airlied[d]: NVIDIA optimise atomic adds and make it harder to work out
01:24HdkR: The optimization where it does a wave-local reduction first and then a single atomic global memory access?
01:24HdkR: If you could disassemble fragment shaders then they shouldn't do that optimization there.
01:25HdkR: Even though it would help the Dolphin shaders a lot :D
01:53airlied[d]: yes that one
01:54airlied[d]: though I've done a test that just does a direct store, and the isetp/flow control stuff seems fine there
01:54airlied[d]: so I wonder is there some change in how REDG works
01:57mhenning[d]: Could try emitting an atomg instead of a redg
01:58mhenning[d]: also I think nir can do that same optimization with nir_opt_uniform_atomics
01:58mhenning[d]: I didn't realize nvidia did that
01:59HdkR: It's a shame they only do it in compute shaders
01:59HdkR: Because the loadstore pipelines can only retire so many atomics per cycle.
02:42airlied[d]: ah imnmx problems by the looks of it
02:43HdkR: :D
03:15gfxstrand[d]: airlied[d]: Didn't Blackwell add some predicates or something? I seem to recall seeing something. Or maybe I'm confusing it with Kepler.
03:15airlied[d]: yeah new predicates, just not sure where they go properly, hacked it to pass the crucbile test
03:47gfxstrand[d]: I wonder if they brought the Kepler ones back. We should do some unit testing.
03:52mangodev[d]: uh oh
03:52mangodev[d]: i encountered a small problem
03:52mangodev[d]: may be system configuration, may be latest nvk, not sure
03:53mangodev[d]: i don't think my driver can properly free video memory
04:02mangodev[d]: i don't know how to debug it, but i've spotted some signs that alarm me greatly
04:02mangodev[d]: once enough textures are loaded on a session, the whole system gets unbearably slow
04:02mangodev[d]: and opengl applications show gibberish texture data from bygone programs for a split second on window creation, which is the main sign i'm getting for the textures not being actually freed when they're supposed to
04:08airlied[d]: now I've got some sort of memory coherence issue, test writing to an ssbo, running normal it fails, if I poke in the debugger it seems to get further
06:39karolherbst[d]: airlied[d]: didn't mean what nvidia generates, but what yours looks like decoded by nvdisasm
06:50airlied[d]: oh looks like cx[] addressing changed
07:13karolherbst[d]: ohh.. in ldc?
07:14karolherbst[d]: airlied[d]: yeah...
07:14karolherbst[d]: they increased the VM....
07:14karolherbst[d]: apparently
07:15karolherbst[d]: in hopper already
07:16airlied[d]: yes cx[] is now 51/13 instead of 45/190
07:16karolherbst[d]: yeah
07:17karolherbst[d]: I suspect it might cause subtle issues elsewhere as well
07:17airlied[d]: it's pretty self-contained, https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35409
07:17karolherbst[d]: though if the heap manager still uses 45 bits, might be fine
07:17karolherbst[d]: nah.. I meant in other places in the class e.g.
07:17karolherbst[d]: or everywhere where addresses are consumed
07:17airlied[d]: though I wonder if I need to change some min alignment
07:18karolherbst[d]: but I think if `vm_heap` is restricted to 45 bits, it might be that nvk simply doesn't run into any issues
07:18karolherbst[d]: but I'm sure it _might_ mean in the classes there are changes that might mean more fixes
07:18karolherbst[d]: though maybe it's just LDC and it's 45 bits everwhere else
07:25karolherbst[d]: heh....
07:25karolherbst[d]: only _some_ things changed to 51 bit
07:25karolherbst[d]: e.g. the shader trap handler is still 45 bits
07:25karolherbst[d]: but semaphores can use 51 bit addresses
07:27karolherbst[d]: but anyway.. I think the upper bits of addresses are generally their own method with nothing else in it, though I do wonder if there have been exceptions to it
07:34airlied[d]: okay, that fixes a bunch of the wierdness I was seeing
09:57airlied[d]: dEQP-VK.spirv-assembly* https://paste.centos.org/view/3fcacfaa
10:41airlied[d]: Imnmx hacks must still be wrong
10:46karolherbst[d]: what's different with imnmx tho?
10:47karolherbst[d]: don't really see any differences?
10:49snowycoder[d]: airlied[d]: There is a hw_test for imnmx here: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34327
10:53karolherbst[d]: maybe reverting the hacks will fix it now that the constant pull stuff is
12:41gfxstrand[d]: mangodev[d]: That's just the Kernel not zeroing memory
12:45gfxstrand[d]: airlied[d]: No, shift of 6 is already fine with the current alignments
13:26phomes_[d]: I am using gsp 570.144 + kernel 6.16.0-rc1 on a AD104. The game Rust has a new hang with the following in dmesg:
13:26phomes_[d]: `nouveau 0000:01:00.0: gsp: Xid:56 CMDre 00000001 00000200 00000001 00000005 00000011
13:26phomes_[d]: nouveau 0000:01:00.0: drm: wndw-0: timeout`
13:27karolherbst[d]: ohh
13:27phomes_[d]: I have not seen it with any other game
13:27karolherbst[d]: that's uhm.. the display stuff going wrong
13:43phomes_[d]: I am happy to do it if there is any kind of debugging or bisection that makes sense. I just don't know where to look 🙂
17:37zmike[d]: gfxstrand[d]: have you looked at that steam thing yet?
17:44gfxstrand[d]: No
17:46gfxstrand[d]: snowycoder[d]: Please rebase. Gitlab is showing conflicts
18:05mangodev[d]: gfxstrand[d]: i'm still confused on the fact that something snaps over enough time
18:05mangodev[d]: even just opening and closing the same program enough times can cause the whole system to chug :/
18:05mangodev[d]: it's to the point where booting one game after booting another makes my PC cry
18:05mangodev[d]: to the point where even half life 1 runs poorly
18:12gfxstrand[d]: Yeah, something's not right there.
18:12gfxstrand[d]: Like, there is something probably not freeing memory properly
18:12gfxstrand[d]: But it's not necessarily related to the temporary corruptions you're seeing.
18:23mangodev[d]: gfxstrand[d]: speaking of
18:23mangodev[d]: any idea on why kde visually corrupts when morphing windows (fullscreen/maximize)?
18:24mangodev[d]: it looks like classic "gpu crapping itself"
18:24mangodev[d]: tons of stripes and bright colors and repeating patterns
18:24mangodev[d]: with a lot of white mixed in there too
18:26mangodev[d]: gfxstrand[d]: some proton games just do it on boot too
18:26mangodev[d]: can't even test them properly because they instantly start a memory leak of some kind
18:26mangodev[d]: teardown causes instant issues on my system, both in opengl and dx12 modes
18:35gfxstrand[d]: airlied[d]: Thanks! I cleaned up the QMD patches and landed them as https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35418
18:35gfxstrand[d]: I'm looking at your cbuf patches now
19:05airlied[d]: gfxstrand[d]: let me know if you want me to do cleanup on the cbuf one or if you want to hold the lock on the branch
19:05gfxstrand[d]: Go ahead
19:05gfxstrand[d]: Take care of the cbuf comments on the MR, then both will land and we can rebase and level-set again.
19:22airlied[d]: okay dropped the alignment and cleaned up the nak arg ordering
19:53gfxstrand[d]: I pushed a super-nit style change to the NIR and assigned Marge.
20:04airlied[d]: okay I'll try and get imnmx and other hw_tests passing and I think there are still some issues around shared memory sizing
20:07gfxstrand[d]: Yeah, something's wrong with smem but I never got it quite sorted.
20:07gfxstrand[d]: I'm blindly trusting you on the +7 thing in your patchl.
20:07gfxstrand[d]: But we can always fix that if it's wrong
20:07airlied[d]: it matched the couple of things I tried, but I was going to run some of the explicit tests against nvidia at some point and dump the qmd
20:08gfxstrand[d]: Yeah, now that we know where the QMD bits are, it should be easier to look at dumps
20:08gfxstrand[d]: Want a bunch of QMDs?
20:10gfxstrand[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1381727102457352302/qmds.tar.xz?ex=68489124&is=68473fa4&hm=7aa76a3cf4febd1108d11474b21977ad51134bf07efc414c6f147c119d3c7cc2&
20:10gfxstrand[d]: Go nuts!
20:10gfxstrand[d]: Some time ago I improved mohamexiety[d]'s QMD tests to be a little easier to generate a giant pile and dumped QMDs with various smem sizes. I haven't gotten around to digging through them all by hand, though.
20:16snowycoder[d]: Why do GPUs from Maxwell start using barriers in instruction dependencies? Doesn't the delay already remove all data hazards?
20:16snowycoder[d]: I could not find any mention of it in the whitepapers
20:17airlied[d]: there are two types of instructions some need delays, some need barriers
20:17airlied[d]: (okay 3 types, some need both)
20:17karolherbst[d]: well technically they only need one, but it depends on the actual hardware which one the hardware will care about
20:18mhenning[d]: barriers are for instructions with variable latencies, where the delay cannot handle all hazards
20:20gfxstrand[d]: Kepler has both but I assume the internal scoreboard triggers in the variable-latency case. Why `tex` is different from `ld`, I don't know.
20:20karolherbst[d]: well for memory operations it's not really predictable anyway
20:20karolherbst[d]: and I suspect some operations execution speed depends on the input values
20:20gfxstrand[d]: airlied[d]: QMDs and cbufs are merged. Now would be a good time for a rebase.
20:21tiredchiku[d]: I've been hyping myself up to get back at taking my inflatable mallet to the codebase
20:21gfxstrand[d]: But I'll let you do that. You have the branch lock.
20:21airlied[d]: yup already have it rebased, going to push it out now
20:21tiredchiku[d]: last couple months have been hell, but I finally have a lot more free time on my hands
20:21gfxstrand[d]: \o/
20:21tiredchiku[d]: c:
20:21airlied[d]: probably need to extract the tex headers stuff next
20:22gfxstrand[d]: I'm going to head home while the Metro is still running. AFK for a little over an hour. Then I'll review snowycoder[d]'s texdepbar stuff.
20:32gfxstrand[d]: Trying to flush the queue a bit before I go head down again.
20:43gfxstrand[d]: That's not where I'm flushing people's patches! 😂
20:43airlied[d]: the blob doesn't not want to produce imnmx for me, just vimnmx
20:44gfxstrand[d]: Fun
20:44gfxstrand[d]: What's the difference? Should we be using the vector ones?
20:44airlied[d]: I wonder should I take that as a hint 😛
20:49mhenning[d]: huh, yeah looks like VIMNMX has 32-bit signed/unsigned modes at least on hopper https://kuterdinel.com/nv_isa/VIMNMX.html
20:50karolherbst[d]: I think we never understood what's special about those V* instructions
20:51mhenning[d]: I mean, VIMNMX has a mode for u16x2 or s16x2 which makes sense to me
20:51mhenning[d]: I don't know why they have both VIMNMX and IMNMX if there's a 32-bit mode for the former
20:51airlied[d]: SIMD 🙂
20:52karolherbst[d]: some of the V instructions are there since forever, but never really seen them used
20:53mhenning[d]: The old v* ones are "video" instructions that are well documented in PTX
20:53mhenning[d]: but I think those have been gone since ~kepler
20:55karolherbst[d]: mhh
20:55karolherbst[d]: there was VABSDIFF at least
20:55mhenning[d]: oh yeah, I guess that one stayed around
20:56karolherbst[d]: but there is VIMNMX3 now
20:56karolherbst[d]: but yeah.. sounds like they were added for int16 stuff
21:08airlied[d]: guess I should just implement vimnmx and see if I can get things to work
21:09karolherbst[d]: looks trivial enough for int32
21:09karolherbst[d]: but it's still odd why IMNMX3 won't work..
21:09karolherbst[d]: maybe it's broken
21:09karolherbst[d]: which... would be kinda not like nvidia 🙃
21:10mhenning[d]: karolherbst[d]: There's no IMNMX3, only IMNMX, VIMNMX, VIMNMX3
21:10karolherbst[d]: ohh right, I meant IMNMX
21:34gfxstrand[d]: karolherbst[d]: I suspect we just put predicates in the wrong place somewhere.
21:35karolherbst[d]: yeah.. that's why I was asking how it looks like decoded with nvdisasm
21:36gfxstrand[d]: There's probably a spare `p0` in there
21:38airlied[d]: there is a spare predicate in there, but we are setting it to pt
21:39karolherbst[d]: what's the binary?
21:40airlied[d]: https://paste.centos.org/view/fabab52d
21:42airlied[d]: it's why I'm trying to get the blob to produce it, but it really doesn't want to
21:43karolherbst[d]: the binary, not the nak debug output
21:43karolherbst[d]: or is "00007217 00000001 007ee000 003fcc00" what nak returns?
21:43airlied[d]: yes that is the instruction that nak encodes
21:44karolherbst[d]: IMNMX.U32 PT, PT, R0, R0, R1, P0, PT ;
21:46karolherbst[d]: weird...
21:46airlied[d]: yeah 2 dst pred, two src preds, maybe there's a 3rd src pred hiding 😛
21:46karolherbst[d]: let me try to get PTX to emit it
21:49karolherbst[d]: huh....
21:49airlied[d]: oh I can just hack on the volta imnmx workaround for now 😛
21:51karolherbst[d]: okay...
21:51karolherbst[d]: so with SM86 they use IMNMX
21:52karolherbst[d]: 87 is a disaster
21:52karolherbst[d]: 89 is also IMNMX...
21:52karolherbst[d]: yeah so with 90 they use `VIMNMX`
21:53airlied[d]: I wonder did they break it when they added 64-bit support
21:54karolherbst[d]: where did they add 64 bit support?
21:57airlied[d]: blackwell
21:57airlied[d]: maybe hopper, let me check
21:57karolherbst[d]: aha!
21:57karolherbst[d]: `IMNMX.U64 PT, PT, R2, R2, UR6, PT, !PT ;` on SM120
21:57airlied[d]: looks like sm120 is the first place u64 works
21:58karolherbst[d]: mhhhhh
21:59karolherbst[d]: yeah.. on SM90 the 64 bit one gets lowered
22:00karolherbst[d]: soooooo
22:00karolherbst[d]: I think they probably broke it
22:01karolherbst[d]: but huh
22:01karolherbst[d]: there are two input predicates
22:05karolherbst[d]: okay
22:06karolherbst[d]: airlied[d]: mind swapping the two predicates?
22:06airlied[d]: tried it, didn't seem to help
22:06karolherbst[d]: mhh
22:07karolherbst[d]: soo
22:07karolherbst[d]: the second one is set to false tho
22:07karolherbst[d]: at least in 64 bit mode
22:08karolherbst[d]: so maybe ` e.set_pred_src(77..80, 80, &false.into());` fixes it?
22:08airlied[d]: in true ghostbusters fashion, I've reversed the polarity and crossed the streams 🙂
22:08gfxstrand[d]: https://tenor.com/view/cross-the-streams-ghostbusters-gif-12688913500088635370
22:08karolherbst[d]: or maybe it's indeed busted
22:25gfxstrand[d]: If so, that's the first properly busted instruction I've encountered
22:28airlied[d]: well if a busted instruction isn't used by the compiler, is it really busted 🙂
22:30HdkR: https://en.wikipedia.org/wiki/List_of_discontinued_x86_instructions#Hardware_Lock_Elision Yes :P
22:30airlied: ah lock elision, it was destined for great things
22:32HdkR: I'm still hoping that ARM's equivalent TME extension will be implemented some day. Then I can implement split-locks without kernel help.
22:42airlied[d]: oh wait maybe false into the second pred is working, don't quote me on that yet
22:43airlied[d]: I must have reverse polarity at same times as crossing streams