IRC Logs of #nouveau on irc.freenode.net for 2025-03-19

00:00 Jasper[m]: Will need to check tomorrow
00:00 snowycoder[d]: mhenning[d]: Scheduling instructions are not filled yet, I'm still not sure about sched encoding so I put 0x00 everywhere (that should halt warps for 32 cycles for each instr)
00:00 magic_rb[d]: Add me on matrix btw @magic\_rb:matrix.redalder.org
00:01 magic_rb[d]: Jasper[m]: ^
00:02 mhenning[d]: snowycoder[d]: it's possible that's not a valid way to schedule everything, but I haven't thought about it too carefully
00:03 snowycoder[d]: mhenning[d]: Some other tests pass, even other iadds with carry=0. I don't think that's the problem
00:03 gfxstrand[d]: I wouldn't be surprised if carry has scheduling implications
00:08 mhenning[d]: Does the failing IADD use any negate modifiers? Those can be a little subtle in their semantics
00:10 snowycoder[d]: no, it's just
00:10 snowycoder[d]: LOP.AND R2, R2, 0x1;
00:10 snowycoder[d]: IADD R2.CC, R2, 0x7ffff;
00:10 snowycoder[d]: /* sched data */
00:10 snowycoder[d]: IADD.X R0.CC, R0, R1;
00:10 snowycoder[d]: (nvdism output)
00:10 snowycoder[d]: it seems correct to me
00:12 snowycoder[d]: oh wait, that immediate is missing one bit isn't it -.-
00:13 snowycoder[d]: Sorry for the duck debugging
00:15 mhenning[d]: Looking at src/nouveau/codegen/nv50_ir_emit_gk110.cpp, it looks like it won't emit a carry-in/carry-out for adds with immediates (if I'm reading that right)
00:16 mhenning[d]: So you might need to change the legalize code to prevent that case
00:29 snowycoder[d]: Ok, now the test passes, I was resetting the negation bit.
00:31 snowycoder[d]: mhenning[d]: I'm encoding a couple more instructions that are not present in nouveau but seem valid for the disassembler.
00:31 snowycoder[d]: I think that they weren't encoded just because the IR could not represent them(?)
00:31 snowycoder[d]: (e.g. there's an imul with long immediate that could save us some cycles)
00:53 mhenning[d]: snowycoder[d]: Yeah, there are definitely instruction forms on the hardware that the gl driver doesn't make use of
00:57 airlied[d]: mohamexiety[d]: I think relax and work out explosions, it might be worth only relaxing on volta+ as I think the hw is more flexible
00:59 mohamexiety[d]: got it, will do that tomorrow then. thanks!
03:31 gfxstrand[d]: gfxstrand[d]: With all those hacks, I should have Maxwell B and Pascal in a few hours. I'll put the packages together and submit in the morning. Then I need to fix Volta fp64.
03:31 gfxstrand[d]: I would have had Maxwell B already but I kept accidentally running RADV by mistake. :blobcatnotlikethis:
03:34 mhenning[d]: gfxstrand[d]: while you have maxwell plugged in, I'd be curious if https://codeberg.org/mhenning/mesa/src/branch/watermark works at all (like, if you try vkcube, does it kill the context?)
03:35 mhenning[d]: exciting that we're getting closer to enumerating by default on maxwell though 🙂
03:43 gfxstrand[d]: Sure. I can try it in the morning before I switch to Volta.
03:59 airlied[d]: woo test merged mhenning[d] scheduler, 17->18 TFLOPS
04:09 gfxstrand[d]: Yeah, planning to merge that once GitLab is back to stable and I'm done CTSing. It's first on my list.
04:12 airlied[d]: it needs rework vs latencies stuff, since at least the prepass/postpass one adds another function that is just guessing
04:16 mhenning[d]: Yeah, there are cases where we don't currently know what the second instruction is because the pass is too naive (eg. across control flow edges), and there we call that extra function to get an upper bound on what the latency might be
04:17 mhenning[d]: It's still useful without fixing that case, but long term we either need to include an upper bound along with the latencies or handle control flow properly
04:17 airlied[d]: yeah some of those upper bounds are probably not upper enough 🙂
04:18 mhenning[d]: ah, that too
04:52 gfxstrand[d]: Control flow with barrier tokens is gonna be such a headache.
06:15 airlied[d]: mhenning[d]: we also seem to have OpCopy in use when you do some latency stuff, but that isn't really a hw instruction, so maybe should lower those first, not sure
07:29 airlied[d]: some more ld/st hacks and licm gets be to 19TF
07:40 airlied[d]: arrgh 19.95
07:47 airlied[d]: I think getting some 64-bit address offsets down to 32-bit address offsets might get me over the 20 line if I can work it out
08:11 Jasper[m]: Trying to compile mesa with nouveau (gallium and vk), zink and llvmpipe errors out it seems
08:13 Jasper[m]: paste.centos.org/view/9f09dcc1
08:14 Jasper[m]: * https://paste.centos.org/view/9f09dcc1
08:14 Jasper[m]: Wrong channel since it's in llvmpipe, but maybe there's an obvious fix :p
08:15 Jasper[m]: Actually no, this is libcompiler
08:22 airlied: git clean -fdx maybe
08:23 Jasper[m]: Will try in a sec
08:55 Jasper[m]: @airlied I think that fixed it, thanks! Lemme check if I can see some spinny gears and/or cubes
08:59 Jasper[m]: Unaccelerated it seems, glxinfo will also say it's still using llvmpipe. I did NOUVEAU_USE_ZINK=1 meson devenv glxinfo in this case
08:59 Jasper[m]: Anything I can do to make it not us llvmpipe?
09:00 Jasper[m]: s/not/__not__/, s/us/use/
09:05 Jasper[m]: Okay, "failed to detect any valid gpu's[...]" seems like something else amiss
09:07 Jasper[m]:sent a code block: https://matrix.org/oftc/media/v1/media/download/AaeWphIqMAsTUBofwyE02k8ox2N4tHzvGAhitLSufjCe8XDxDU0RbM_ZvxVYC2NLJ2PL7WXFf1Nmau3xT7Ba5rVCeV9cFpAwAG1hdHJpeC5vcmcvVk1YcE5Qa3V3UFpwVUJKSGNlTldhb3p6
09:07 Jasper[m]: lmao
12:19 Jasper[m]: @_oftc_gfxstrand[d]:matrix.org I have vulkaninfo output that nvk is used and is detecting the card, but I cannot get something like glxgears or glxinfo to work it seems. What cocktail of env variables do I need for that?
12:19 Jasper[m]: currently have the broken vulkan driver one and nouveau use zink
12:24 Jasper[m]: vkcube exits with:... (full message at <https://matrix.org/oftc/media/v1/media/download/AfcdtM5_ZEXJpYyErxcTLpTl9q7O9eTmWgYPwgB1YjAnloFPe5sBXlbHEyZrchQfABnmTmcXsEJmrlKpVJkRKxlCeV9nWPqAAG1hdHJpeC5vcmcvTUljV25FWXJRb0hvUHJHRHpxeU1MUk9m>)
12:31 Jasper[m]: I retried forcing glxgears to use zink (since it normally uses `tegra`, not nouveau) and got the following:... (full message at <https://matrix.org/oftc/media/v1/media/download/AT8MjYP_oar5ro-_L_Jbso74neFqKaUDWiOm5gCs6csZLvBdiXeZaKvtp_Z3eKZ5Nn2OQb4Y8vBWBbhbvS75ACZCeV9nwxjwAG1hdHJpeC5vcmcvRHlMeGVqd0ppUUFqTGRkRmNBbkZsQ0dP>)
12:32 Jasper[m]: lmk if there's anything else to try or test, just tell me what to run
13:05 gfxstrand[d]: airlied[d]: We're going to have to figure out how to at least estimate those if we want pre-RA scheduling.
13:41 gfxstrand[d]: Pascal done. Maxwell B had a timeout overnight but I got one of them so running just this one shard should pass.
13:43 karolherbst[d]: what's nvidia doing with OpCopy btw?
13:44 karolherbst[d]: ohhh btw... have you found any use for `REDUX`?
13:44 gfxstrand[d]: karolherbst[d]: OpCopy is a NAKism
13:44 gfxstrand[d]: IDK what they do inside their compiler
13:44 karolherbst[d]: I thought you meant the spirv opcopy stuff, but fair
13:45 gfxstrand[d]: We don't do anything with REDUX today.
13:47 gfxstrand[d]: We should probably use that for some of the subgroup ops if they're in uniform control-flow
13:47 gfxstrand[d]: But we'll have to figure out exact semantics first
13:51 gfxstrand[d]: ahuillet: Any chance we can get some Blackwell class headers?
13:52 karolherbst[d]: REDUX: "Reduction of a Vector Register into a Uniform Register"
13:52 karolherbst[d]: takes ops like atomics
13:52 karolherbst[d]: go figure what it does 😛
14:11 gfxstrand[d]: Well, yeah...
14:12 gfxstrand[d]: But it does min/max. Is it float or int? Is it signed?
14:12 gfxstrand[d]: I'm guessing unsigned integer but someone needs to write tests
14:13 gfxstrand[d]: If we can figure out the semantics, I'm sure we can find a use for it
14:13 gfxstrand[d]: Single-instruction reductions sound pretty rad, TBH
14:24 gfxstrand[d]: gfxstrand[d]: Or notthatclippy[d]
14:26 gfxstrand[d]: mohamexiety[d]: Looks like we have Hopper but not Blackwell
14:26 mohamexiety[d]: I see, interesting then
14:27 gfxstrand[d]: Well, we have hopper 3D but not hopper compute?!?
14:27 gfxstrand[d]: Something funky is going on
14:28 gfxstrand[d]: mohamexiety[d]: Blackwell is in OGK, just not the docs repo
14:28 mohamexiety[d]: ohh nice then. so we have the headers
14:28 gfxstrand[d]: Never mind. The only thing the headers in OGK have are the version #define
14:28 gfxstrand[d]: 😭
14:29 mohamexiety[d]: aw
14:30 gfxstrand[d]: So yeah, they need to release some stuff
14:30 gfxstrand[d]: Or we just assume backwards compatibility and see how far that gets us
14:30 gfxstrand[d]: But I strongly suspect we need new QMDs
14:33 notthatclippy[d]: I imagine they're on some desk awaiting legal approval. Will ask around tomorrow.
14:33 gfxstrand[d]: Thanks!
14:34 gfxstrand[d]: But also, I don't have a Blackwell to test on yet. 😛
14:36 notthatclippy[d]: I did bring that up a few times, but apparently no one bit hard enough.
14:39 karolherbst[d]: gfxstrand[d]: int
14:39 karolherbst[d]: signed or unsigned
14:39 karolherbst[d]: there is a flag
14:45 gfxstrand[d]: Ah. I see the flag now
14:45 gfxstrand[d]: karolherbst[d]: Do dsetp and hsetp have a different latency on Volta?
14:45 karolherbst[d]: no idea
14:49 gfxstrand[d]: ugh
14:53 gfxstrand[d]: Okay, looks like double/half predicates are a little slow
14:55 gfxstrand[d]: Of course this is only showing up in the clustered tests :blobcatnotlikethis:
15:09 gfxstrand[d]: Yeah, looks like 15 cycles for hsetp2 and dsetp
15:12 karolherbst[d]: yeah.. predicates are high latency unfortunately
15:15 gfxstrand[d]: Okay, Volta is running now and hopefully this Maxwell B shard will finish okay and not timeout this time.
15:15 gfxstrand[d]: Pascal is submitted
15:19 gfxstrand[d]: Doing CTS runs is a PITA
15:19 gfxstrand[d]: Especially on these crappy cards that won't reclock
15:20 gfxstrand[d]: Not that that makes THAT much difference. The CTS is almost always CPU limited anyway. It just means you're more likely to hit timeouts.
15:23 Jasper[m]: I'd say that might be the only thing the tegra's are good for, but I'm pretty sure none of them have a driver for frequency scaling atm
15:23 Jasper[m]: At least upstream
15:26 Jasper[m]: * I'd say that might be the only thing the tegras are good for, but I'm pretty sure none of them have a driver for gpu frequency scaling atm
15:52 kwizart: Jasper[m], IIRC that might appear for tx1, some may have converted the table provided by the firmware to the upstream kernel format, but the patches are not yet upstream (only preliminary work was merged)
15:53 Jasper[m]: Ohhh interesting, hope that can be adapted
15:53 Jasper[m]: Especially if nvk gets somewhere
15:59 kwizart: I've sent an email to remind Diogo about this... Actually it was the other way, he mentioned to have reworked the emc driver to understand the emc-tables that are from an older format that upstream kernel supports
16:03 Jasper[m]: Ohh it's the -emc patchset
16:04 Jasper[m]: I see, when was that? I can poke him separately (I've been in contact with him over t210-smaug related stuff)
16:05 Jasper[m]: Jasper[m]: I thought those were only related to memory?
16:06 kwizart: I don't think he has published the "part-2" serie (or this imply GPU, indeed)
16:13 Jasper[m]: Ahh, makes sense then
17:19 mhenning[d]: gfxstrand[d]: Yeah, using REDUX for nir_intrinsic_reduce is one of the things I have on my TODO list.
18:38 snowycoder[d]: Question for NAK and Maxwell: Why does the source code in builder say "we have to use a regular 32-bit shift here [...] we need to wrap manually", but then it still uses a `shf.l` with manual wrapping?
18:38 snowycoder[d]: Since we use `shf.l` and take the high part (instead of using a `shl`) we can remove the manual wrapping and let the hardware handle it, right?
18:51 snowycoder[d]: If I remove manual wrapping and swap the `shf.l` type from `U32` to `U64` (so that it wraps properly) all tests pass on Kepler (same .high behaviour of maxwell).
18:51 snowycoder[d]: (Am I introducing weird bugs?)
19:25 gfxstrand[d]: I'm not sure about Kepler but on Maxwell, the `.high` behavior is funky
19:26 gfxstrand[d]: It's all modeled in the `Foldable` impl
19:27 gfxstrand[d]: but basically, `shf.l` ignores `.high` and just assumes `.high`, I think?
19:27 gfxstrand[d]: Whatever it does the `Foldable` impl is correct
19:28 gfxstrand[d]: Oh, actually... Look at the `SM50Op` impl for `OpShf`
19:28 gfxstrand[d]: There's comments in there as well
19:28 snowycoder[d]: gfxstrand[d]: Then it's the same behaviour on kepler
19:29 gfxstrand[d]: If Kepler doesn't have that bit of jank and we can use it "normally", that'd be great.
19:29 snowycoder[d]: Nope, kepler crashes if .hi is set, but even if it isn't it takes the high part anyways 😦
19:29 gfxstrand[d]: Okay, so same story there
19:30 snowycoder[d]: But wrap works normally, we can hard-code the low part to 0 and only use the high part without any manual wrapping for shl64
19:30 gfxstrand[d]: So what we have to do in `shfl64()` is we do a shift without `.wrap`, meaning that it will return 0 if shifted more than 31 places and `& 0x3f` manually to do `% 64` and get the 64-bit wrap behavior.
19:32 gfxstrand[d]: snowycoder[d]: Hrm... Maybe that works?
19:32 snowycoder[d]: Using hw_test it seems to work (on Kepler)
19:32 snowycoder[d]: Btw, love hw_tests, they are helping a lot!
19:33 gfxstrand[d]: There's also something funky with U64 shifts on Maxwell. See the folding code
19:34 gfxstrand[d]: I'm happy to come up with something better than what we're doing right now for shl64 but I didn't come up with anything at the time and there's an annoying number of restrictions to work around
19:35 snowycoder[d]: gfxstrand[d]: Folding code seems to ignore sign unless it's for I64?
19:36 gfxstrand[d]: So maybe we can make 64-bit work?
19:36 gfxstrand[d]: I'm happy if you can find a better way
19:37 snowycoder[d]: I think I found something that works, I'll send a MR so you can test it on Maxwell too when gitlab returns
19:38 snowycoder[d]: (also, If what I'm thinking is correct we could unify shl64 even for sm70+)
19:51 gfxstrand[d]: I'm not too worried about unifying sm70+ with earlier. I'd rather sm70+ generate the natural thing. But if we can emit fewer instructions on earlier hardware, I'm all for it.
20:42 gfxstrand[d]: Ugh... Volta keeps dying. Guess it's time to run it serial and see if I can get that to pass. 😭
20:43 gfxstrand[d]: At least it's still sharded so if it dies part-way through, I can restart it without losing everything.
21:10 dj-death: gfxstrand[d]: can you consider separate_shader=false for optimized pipelines?
21:10 dj-death: link-optimized
21:12 gfxstrand[d]: We don't link optimize at all right now
21:12 gfxstrand[d]: Probably should
22:45 _lyude[d]: btw gfxstrand[d] looking again at the issue you reported, I haven't gotten it to reproduce so I've been trying to see if I can figure out why igt-gpu-tools doesn't work anymore and then hopefully once that's figured out - see if I can run a test to generate a 32x32 cursor
22:45 _lyude[d]: times like this I wish we had a piglit equivalent but for hacking up modesetting scenarios
23:26 gfxstrand[d]: Yeah, IGT is very not great
23:47 TranquilIty[m]: What is broken with IGT ?