IRC Logs of #nouveau on irc.freenode.net for 2025-06-19

03:38 mupuf: gfxstrand[d]: well, seems like you were right. The 6.16-rc2 kernel with gsp 570 is quite stable. That being said, it is half as fast as the previous kernel. This is probably related to all these timeouts I am getting. Do you see these on your system?
03:41 gfxstrand[d]: Which timeouts?
03:42 gfxstrand[d]: Also, yes, it's a bit slower. I don't know if I've noticed 2x but I did notice. We really need to get our GSP lock overhead sorted.
03:47 airlied: mupuf: timeouts? in dmesg?
03:56 mupuf: airlied: no, timeouts in the deqp results
03:56 mupuf: See https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35546/diffs?commit_id=8a314130af8d0c9123429a55d6e2e80dd1f50b64#diff-content-a09aef94e4287cfff61e7de794a3192d83f46a1e
03:57 airlied: those just take a long time, not sure if they are gsp related or not
03:58 mupuf: on the same hardware, they are not a problem on radv
03:58 mupuf: (just FYI)
04:03 mupuf:is stress testing the jobs in their current shape. We'll land them for a week, then keep on adding runners until we can add it to pre-merge
04:04 mupuf: ideally, this won't require any fraction, but we may want to use fraction initially until we have enough capacity if it looks it will take more time to get either the deqp-runner improved or all the runners online
04:31 airlied: anarsoul: okay just realised https://gitlab.freedesktop.org/drm/misc/kernel/-/commit/9802f0a63b641f4cddb2139c814c2e95cb825099 might fix i
04:31 airlied: let me know if you can test that
04:31 airlied: actulally maybe not
04:32 anarsoul[m]: On top of the previous patch?
04:36 airlied: I was hoping it might help without it, but it didn't seem to once I ran it here
07:11 airlied[d]: I've repushed my blackwell branch and tried to cleanup the texture handle code as much as possible, mhenning[d] you might have some advice on the abuse of the pin intrinsic
09:30 karolherbst[d]: mhenning[d]: soo.. I've setup fossils and the `fossils/parallel-rdp/uber_subgroup` seems to have 40 shaders/pipelines? in it, which of those are hurting or rather, what's the proper tool to get per shader stats out of a fossil?
13:08 gfxstrand[d]: mupuf: Yeah, those tests just take forever to run. If you run them without deqp-runner, they pass eventually. I haven't looked into why.
13:37 mupuf: Ack! maybe I should just skip them
14:15 gfxstrand[d]: That's what I'd recommend for now.
14:33 ismael: karolherbst[d]: I've confirmed resuming from sleep the display is reset correctly by nouveau, maybe the problem is related to just DPMS
14:56 karolherbst[d]: ismael: I think with system suspend/resume you aren't running into the same problem
14:56 karolherbst[d]: the problem is that once DPMS kicks in and the GPU gets powered off because of not being used, it won't get powered on for whatever reason
14:56 karolherbst[d]: like..
14:56 karolherbst[d]: what happens if the screen turns black, but 1 second later you use the system?
14:57 karolherbst[d]: I'd expect it to work in that case
15:16 karolherbst[d]: mhenning[d]: ... I looked into one of the shaders, and what I do see is that `iadd3` balancing sometimes doesn't work out properly. Like.. const buffers being moved to the first source instead of the second one. At least that messes with one of the "uber_subgroup" shaders
15:16 karolherbst[d]: isn't there an optimization somewhere that would turn iadd3 cb gpr gpr into iadd3 gpr cb gpr?
15:20 karolherbst[d]: but there are also weird things happening..
15:21 karolherbst[d]: like
15:21 karolherbst[d]: r134 = mov 0x2c // delay=5 wt=000010
15:21 karolherbst[d]: r134 = imad r132 0x38 r134 // delay=5
15:22 karolherbst[d]: ehh wait
15:22 karolherbst[d]: I thought that's an iadd3
15:24 karolherbst[d]: mhh anyway, should look at it on a nir level
15:24 karolherbst[d]: just the iadd3 stuff kinda stood out so far
16:25 mhenning[d]: karolherbst[d]: First you run two different `fossilize-replay --enable-pipeline-stats something.csv --num-threads 32 fossils/**/*.foz` to generate a 2 csvs for before and after the change, then you `./report-fossil.py baseline.csv change.csv --rel-changes "Spills to memory"` which will print out changed pipeline hashes
16:25 karolherbst[d]: ahh cool
16:26 karolherbst[d]: anyway, seems like `nir_lower_phis_to_scalar` causes the regression at least in `uber_subgroup`. Just my RA changes alone gives me the same results as without any changes
16:26 karolherbst[d]: but that's kinda expected..
16:31 mhenning[d]: yeah, not a big surprise
16:32 mhenning[d]: airlied[d]: which patch is this?I don't see it at first glance
16:40 karolherbst[d]: anyway, I dropped `nir_opt_phi_precision` because making `nir_lower_bit_size` not lower phis to 32 does help even more 🙃
16:40 karolherbst[d]: and we can be more targeted, like only lower scalar 16/8 bits or something
17:47 karolherbst[d]: anyway... given it's nir_lower_phis_to_scalar causing the issue, I think just doing the 16x2 -> 32 conversion might just solve this regression
17:47 karolherbst[d]: I wonder if I can hack up nir_lower_phis_to_scalar to only lower certain phis...
18:08 gfxstrand[d]: Disabling for 16x2 would make sense
18:08 gfxstrand[d]: That's a pretty mean case
18:08 gfxstrand[d]: We can't just stitch those back together in RA
18:29 karolherbst[d]: yeah...
18:29 karolherbst[d]: so far my attempt was to simply keep phi vectors, because from_nir can deal with them just fine
18:29 karolherbst[d]: however shader-db isn't happy
18:30 karolherbst[d]: and it's not really clear why
18:30 karolherbst[d]: I think it's just bad luck
18:30 karolherbst[d]: the shaders hurt are already spilling like crazy
18:30 karolherbst[d]: however.. I found some opportunities where iadd3 can be optimized better 😄
19:22 ismael: karolherbst[d]: yes, it seems if I immediately push some key the screen comes back up fine, but if I wait longer, it fails
19:22 ismael: karolherbst[d]: manual testing makes it less reliable than I would like
19:52 airlied[d]: mhenning[d]: https://gitlab.freedesktop.org/airlied/mesa/-/commit/15085feac1b19205d095365de8e9d56c74a437ae
20:37 mohamexiety[d]: airlied[d]: gfxstrand[d] took a bit longer than expected because swizzling swizzles my brain but..... <a:vibrate:1066802555981672650> <a:vibrate:1066802555981672650>
20:37 mohamexiety[d]: DONE!
20:37 mohamexiety[d]: Test run totals:
20:37 mohamexiety[d]: Passed: 51684/149360 (34.6%)
20:37 mohamexiety[d]: Failed: 0/149360 (0.0%)
20:37 mohamexiety[d]: Not supported: 97676/149360 (65.4%)
20:37 mohamexiety[d]: Warnings: 0/149360 (0.0%)
20:37 mohamexiety[d]: Waived: 0/149360 (0.0%)
20:37 mohamexiety[d]: this is host copy on blackwell
20:48 mhenning[d]: airlied[d]: I'm a little confused as to what the pin is doing there
20:50 mhenning[d]: but it's not really valid to use pin like that. It prevents spilling so if you use too many of them then register allocation fails
20:57 airlied[d]: mhenning[d]: if I take it out we spill to warp regs
20:58 airlied[d]: // In non-uniform control-flow, we can't collect uniform vectors so
20:58 airlied[d]: // we need to insert copies to warp regs which we can collect.
20:59 airlied[d]: that part in leaglize kicks in
21:01 mhenning[d]: so the case is something like the handle load is in uniform cf, and the texture load is in nonuniform cf?
21:02 airlied[d]: yes
21:03 airlied[d]: but the handle has to be in ureg vec2, and if it is we have to stop taking it out of ureg and trying to use warp reg
21:07 mhenning[d]: Okay, I think the right way to handle this would be to treat the texture handles the same way we treat bindless cbufs handles. nak_nir_lower_non_uniform_ldcx.c could be extended to deal with both texture handles and nonuniform cbuf handles - that's what handles all of the pinning/unpinning right now and they'd compete for register space so we would need a single pass that knows about both
21:19 airlied[d]: I'll dig a bit into that pass then
22:03 airlied[d]: mhenning[d]: doesn't that pass suffer from the nir/nak flow control problem?
22:03 airlied[d]: the problem I've seen is that even in convergent flow control we end up dumping the uregs
22:05 mhenning[d]: no, that pass works well
22:06 airlied: from my reading of it, it won't pin handles in non-divergent flow control
22:06 mhenning[d]: if you're having trouble with both the handle load and the tex in uniform control flow then I'd guess it's a problem with legalize
22:06 mhenning[d]: right, you don't need to pin anything in non-divergent control flow
22:07 airlied[d]: except the legalize pass always spills it
22:07 airlied[d]: even in non-divergent flow control
22:07 mhenning[d]: it's not a spill if it's done by legalize
22:08 mhenning[d]: legalize might have special cases for the cx[] case
22:08 mhenning[d]: I don't have time to look carefully right now
22:09 airlied[d]: cool, but I tried removing the code in legalize, and it regressed other tests, so it definitely is there for a reason 🙂
22:09 mhenning[d]: but yeah, legalize should be changed so the "use in uniform" control flow case works
22:09 mhenning[d]: yes, you can't completely remove that code
22:09 airlied[d]: maybe I need to make texture handle a different type than SSA
22:10 airlied[d]: add a TexHandle to SrcRef
22:10 mhenning[d]: maybe
22:37 mhenning[d]: I think that would work. I'm less certain that we want to add another special case to the ir like that
22:52 nilberdpistev: So you had realized how bad the software actually does? Also had you realized what those so called engineers in my country actually do? They are stupidest filthiest animals ever seen, and I am way smartest one of them all. They tend to get things the other way around for some reason :D. So the compute module never makes a mistake on packed execution. Mostly it's cause the state in
22:52 nilberdpistev: hardware works with carry set in one way, but in software, you can go as pinning the answer set that is actually to smaller, in other words as you were setting the carry to both ways on request, as you were seeking back in the middle of the calculation. Another big reason is non-volatile memory, that holds a lot of data that needs no calculations anymore. And the biggest reason as to why
22:52 nilberdpistev: hardware does add with 8 half or full adders is security over more/less obvious things such as heat dissipation or spreading. Now finally the decoder back to big presentation again is making sure that their is though few uncertainty and probability is large entropy small hence, and error is 0. I need certificates and their private keys to be hacked cause i did not find any self-signing
22:52 nilberdpistev: mechanism that chrome or firefox would not detect as of yet. I am running out of IP addresses on free vpns. But actually it's the floating point is where the things go messy, such presentation should not be present in hardware. Inetger units are only needed. Back days there was one guy whose name i do not remember working on FPGA floating point units, that i said is not needed, and also
22:52 nilberdpistev: multiply is not needed, instead of multiplying the bank offsets you end up having on computers a special decoder. It's the highest bit on encoding that you use instead. SO as i said even multiply units and lord sakes divider units nor any of them needed, but bitshift is cool, technically not needed cause it aligns to byte boundaries well, and encoder to packed format becomes simply a
22:52 nilberdpistev: memory fetch plus minor access procedure that i already showed, there is a lot of crap in linux and other oses, that should had never be programmed. And btw. even networkings best compression schemes just aren't buffering small enough, wire protocols for compression are so bad.