IRC Logs of #nouveau on irc.freenode.net for 2025-10-14

04:46 airlied[d]: karolherbst[d]: the non-square flexible lowering really messes with things like transpose, might have to a split intrinsic, and figure out a per-operation split
05:17 airlied[d]: I do wonder if we should just leave the NAK lowering in place, and just lower to square matrices
05:20 airlied[d]: maybe I can square up transposes and it'll work
09:11 mohamexiety[d]: airlied[d]: do you know if Spark is going to be as cursed as Thor is with regards to running nouveau? AFAIK it’s normal Blackwell so I think it uses the normal GSP without any shenanigans but given how cursed the development timeline of this thing was I don’t actually know
09:15 airlied[d]: It will be cursed different
09:16 airlied[d]: But hoping NVIDIA will bring it up in nova at least
09:16 airlied[d]: When I played with grace hopper it had a lot of unique challenges
09:16 airlied[d]: Esp around memory coherency
09:17 airlied[d]: Not even NVIDIA Vulkan driver exposed VRAM as mappable on hopper, I think Blackwell fixed some of that
09:17 airlied[d]: So maybe it will be a bit simpler to get across the line
09:35 jja2000[d]: `Pass: 774578, Fail: 5412, Crash: 527, Warn: 4, Skip: 969749, Timeout: 17, Missing: 5451, Flake: 14762, Duration: 16:39:06, Remaining: 10:08:15` status update on gp10b
10:11 mohamexiety[d]: airlied[d]: Interesting :blobcatnotlikethis:
10:11 mohamexiety[d]: Thanks! You’d thing unified memory stuff would be simpler not more cursed :bleaker_kekw:
10:22 karolherbst[d]: airlied[d]: I think when I debugged it I concluded it calculates a wrong offset in certain cases
10:22 karolherbst[d]: my working theory is, that if you split on multiple dimensions, and the dimensions aren't the same, it does something wrong
11:29 airlied[d]: I fixed a bunch of cases, but some results matmul were still failing
11:29 airlied[d]: My lowering has some square assumptions in it
11:39 karolherbst[d]: airlied[d]: yeah.. that makes sense.. do you know where to fix it and have time? Otherwise if you give me a tldr I could look into it
11:41 airlied[d]: I havent pushed out my hacks, going to sleep now, but will push out tomorrow, but one main bug was the split load col_offset/row_offset calcs were wrong for non square
11:41 airlied[d]: It uses desc.cols for both
11:42 airlied[d]: It needs to multiply then swap for row major
11:47 karolherbst[d]: airlied[d]: ohhh....
11:47 karolherbst[d]: that's what I was missing...
12:36 snowycoder[d]: I've never seen "Illegal Instruction Parameter", usually it errors out with "Illegal Instruction" when you mess something up in the encoder (or you pass the from flags in the SPH)
12:40 snowycoder[d]: What does RADV_DEBUG=hang do?
12:41 snowycoder[d]: I don't think we have something similar.
12:41 snowycoder[d]: Usually I just replace some weird instructions with nops or "easy" variants until I figure out what the SM is throwing up
12:42 marysaka[d]: misyltoad[d]: we really don't have much more but I think it gives you the details of the address causin git I can't remember if ESR values are helpful here :aki_thonk: karolherbst[d] do you remember?
12:43 karolherbst[d]: no idea
12:44 marysaka[d]: ah wait those are perfcounter registers isn'ti t
13:11 mohamexiety[d]: in theory it shouldnt be too hard to wire up a `RADV_DEBUG=hang` equivalent but we currently don't sadly
13:15 mohamexiety[d]: if it does it's not wired up from the kernel side currently :frog_upsidedown:
13:17 mohamexiety[d]: most of what `hang` does can be done from userspace though without GSP involvement, since userspace should at least narrow down which section of the command buffer is faulting
13:19 mohamexiety[d]: yeah that one would likely (really not sure) need the GSP similar to how we get faulting addresses for mmu faults from there
13:19 mohamexiety[d]: I am not even sure if any of the ESR values are helpful for us rn
13:21 mohamexiety[d]: misyltoad[d]: notthatclippy[d] would you know if we could use any of this info as is to be able to find the faulting PC or whether that would require some kernel changes?
13:43 marysaka[d]: yeah the GSP report the error
13:59 marysaka[d]: misyltoad[d]: wondering if 0x4c1ab72 wouldn't be the address of the instruction causing the issue
14:02 marysaka[d]: are you sure the upload address is right? that looks not aligned at all :aki_thonk:
14:03 marysaka[d]: I would expect it to be at least aligned to 64 bytes (not sure what is the requirement on compute again would need to dig)
14:03 gfxstrand[d]: Is that for a QMD? Those addresses are shifted by 8 so they never look aligned.
14:04 gfxstrand[d]: ah
14:10 marysaka[d]: right so `0x516fb0` is probably the origin of the exception, `0x516fa8` exception type, `0x516fac` seems to be usually 0x174 seeing random dump online, `0x516fb4` is maybe the interesting one? but would make no sense as an offset anyway... maybe that is nothing interesting for us there :painpeko:
14:11 marysaka[d]: we really should figure out and type the whole trap handler stuffs someday to report concrete errors
14:49 esdrastarsis[d]: marysaka[d]: GSP is reporting useful error messages since R570, right?
14:59 gfxstrand[d]: yes
15:42 notthatclippy[d]: Those Xid logs aren't going to give you the address, but you might be able to get it either from the GSP event (not sure if it gets generated here) or from the protobuf journal.
15:44 notthatclippy[d]: AFAIK the protobuf is still not actually exposed by nouveau, but I can totally see a usecase for a fwctl thing that just gets it and a userspace debugger that decodes it
15:53 chikuwad[d]: release 570
15:53 chikuwad[d]: i.e. nvidia 570 drivers
15:53 chikuwad[d]: or more specifically, support for the GSP from the 570 driver in nouveau
15:55 marysaka[d]: notthatclippy[d]: hmm is the protobuf def actually public? I suppose you also use the standard wire format? :aki_thonk:
15:56 notthatclippy[d]: marysaka[d]: Subset of it is, in openrm, although what gets published is the generated C code and not the protobuf spec. But it would be an easy sell to get spec approval publishing too.
15:57 notthatclippy[d]: misyltoad[d]: Can you trigger the same thing on the NV stack?
15:57 notthatclippy[d]: Don't go trying now if you're not sure, because I'm not sure what it would buy you exactly.
15:58 notthatclippy[d]: Okay, we _can_ run NVK on OpenRM.
15:58 marysaka[d]: notthatclippy[d]: yeah that would be awesome to have but we could work out of C code to parse things a bit at first I guess
15:58 notthatclippy[d]: notthatclippy[d]: I'll have to check if that actually buys you anything though.
16:00 notthatclippy[d]: marysaka[d]: So, I was the one that "decided" what went public and what didn't when we published OpenRM. How that happened was that we completely missed these bits until we got waaaay too close to the publish deadline. And so I wrote a script that first deleted everything that wasn't actually needed to build OpenRM and then I just read through the rest and rubber stamped it. I don't even remember
16:00 notthatclippy[d]: how much was deleted, but I imagine that anywhere between 95% and 100% of it is also entirely okay to publish and I just didn't have the time to read it.
16:00 notthatclippy[d]: And, you know, nothing more permanent than a temporary solution.
16:02 notthatclippy[d]: I dunno, haven't been following too closely. I only know what was announced here
16:04 marysaka[d]: yeah that but with a specific openrm version also wasn't rebased since a while ago so idk how messy rebasing on that would be for you 😅
16:04 notthatclippy[d]: misyltoad[d]: This looks to be based on <https://github.com/X547/mesa/tree/mesa-nvk> which is the one I looked at, so I'm guessing that's the latest and greatest
16:05 chikuwad[d]: ~~I should get back to it eventually~~
16:21 mhenning[d]: Might not be directly useful for you for this, but NVK_DEBUG=push_sync will insert additional synchronization and print out the push buffer that fails
16:22 mhenning[d]: Otherwise, if I were you, I'd be trying to write more complex cuda c++ programs and getting them running before tackling dlss
16:24 mhenning[d]: One thing to note is that CUDA has an optimizaiton along the lines of nir_opt_large_constants that it applies and you'll need to bind that extra data into a cbuf in order for the program to access it.
16:26 mhenning[d]: In general, I think you'll probably need to have a bunch of code to bind things into cbufs in ways that match the proprietary driver.
16:27 mhenning[d]: There's also a few things that you need to put into the QMD (like whether the shader does global memory access) that you'll need to make sure line up
16:41 mhenning[d]: misyltoad[d]: I meant global mem, but now that I'm looking at the code, I was mistaken - there's a bit for that on the 3d engine but not in compute
16:54 mohamexiety[d]: yoooo <a:vibrate:1066802555981672650>
17:03 mhenning[d]: misyltoad[d]: oh, yeah. That's an odd one in that vulkan only uses up to one barrier, but cuda exposes more than one
17:04 pac85[d]: Wow nice!
17:04 pac85[d]: misyltoad[d]: ~~count the instructuons~~
17:07 mhenning[d]: pac85[d]: It wouldn't actually be an instruction count but rather the maxiumum barrier index + 1
17:07 pac85[d]: mhenning[d]: Oh
17:12 karolherbst[d]: misyltoad[d]: you see `BAR` instrucitons tho, right?
17:13 karolherbst[d]: though it should be part of the header.. but worst we should be able to decode it and see the highest number used with BAR 🙃
17:13 karolherbst[d]: let's see...
17:13 karolherbst[d]: nah.. depbar is something else
17:13 karolherbst[d]: ohh shf?
17:13 karolherbst[d]: mhhh
17:14 karolherbst[d]: if you don't have a `BAR` you don't need to set control barrier to anything afaik
17:16 karolherbst[d]: I'm sure the issue you are seeing is something else 😛
17:16 marysaka[d]: misyltoad[d]: you have barriers
17:16 marysaka[d]: see BMOV around :aki_thonk:
17:16 marysaka[d]: it's actually BX too
17:17 marysaka[d]: it's very inconsistent it seems :blobcatnotlikethis:
17:17 karolherbst[d]: that's a different barrier
17:17 karolherbst[d]: marysaka[d]: that's a different barrier
17:17 marysaka[d]: oh right :EstelleFacepalm:
17:17 karolherbst[d]: misyltoad[d]: yep, that needs control barrier
17:18 mhenning[d]: yeah, specifically since that's index zero, you'll need num_control_barriers=1
17:18 karolherbst[d]: _not_ sure if `SHF_BARRIERS` is the right thing, but if there is nothing else obvious going on.. maybe just use that?
17:19 mhenning[d]: you can also generate cuda kernels with different barrier indices and see what changes in the binary
17:19 marysaka[d]: there is also some global init for FP16 1.0
17:22 karolherbst[d]: well that it only uses at most one barrier
17:22 karolherbst[d]: might want to correlate headers 🙃
17:22 karolherbst[d]: if they all have `@"SHF_BARRIERS=1"` and the others don't...
17:25 karolherbst[d]: well... assume it's the right thing and hope it works? 🙃
17:26 marysaka[d]: pretty sure it's fine... is there anything that we need to setup for call stack tho? :aki_thonk:
17:27 marysaka[d]: oh right we have `EIATTR_CRS_STACK_SIZE`
17:28 marysaka[d]: relocations maybe? idk tbh :aki_thonk:
17:32 mhenning[d]: marysaka[d]: The old CRS stack is gone starting with volta - there are bits in the program headers for it but they're always zero
17:32 mhenning[d]: There's probably still a stack for function calls, but we have no idea how that really works
17:34 pac85[d]: misyltoad[d]: Can you like dump all the state from the blob driver around the kernels and compare it
17:41 marysaka[d]: mhenning[d]: I see... I was still stuck on my Maxwellism then 😅 , I guess we should explore that someday
18:16 mohamexiety[d]: compression now passes all CTS \o/
18:16 mohamexiety[d]: just a few flakes (non reproducible?)
18:17 mohamexiety[d]: could merge after final review round but ideally would prefer if more people test out games
18:17 esdrastarsis[d]: mohamexiety[d]: finally 🅱️ erf
18:21 esdrastarsis[d]: mohamexiety[d]: What about the kernel patches? They will land in 6.19?
18:21 mohamexiety[d]: no response since last review
18:21 mohamexiety[d]: I sent a v2 addressing everything but no response to that
18:22 mohamexiety[d]: will wait a bit and then re send I guess
18:25 mohamexiety[d]: https://lkml.org/lkml/2025/10/10/48
18:27 mohamexiety[d]: https://gitlab.freedesktop.org/mohamexiety/nouveau/-/commits/compv222 this has the patches rebased on 6.18rc1 if someone wants something easier access
18:36 mhenning[d]: mohamexiety[d]: I think we would typically wait for the kernel side to get review and then land them together.
18:37 mhenning[d]: misyltoad[d]: We can dump command buffers from the proprietary driver, but I'm not sure that's what you're looking for
18:38 mohamexiety[d]: game dependent but a lot faster across the board
18:38 mhenning[d]: For something like this I'd typically use renderdoc but I have no idea if that works for this
18:38 mohamexiety[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1427727262253453393/image.png?ex=68efea21&is=68ee98a1&hm=157bdfcb44ed4a4231bfca9820753bae47880a520f8ab4ca72cbe1cb33f69f9b&
18:38 mohamexiety[d]: older testing by phomes_[d]
19:29 airlied[d]: misyltoad[d]: you writing your own PTX to NIR parser? or just executing SASS CUDA kernels?
19:29 mohamexiety[d]: SASS
19:43 pac85[d]: misyltoad[d]: Now idea but just throwing around the idea. Might be a good idea to make such tool
19:54 airlied[d]: karolherbst[d]: https://gitlab.freedesktop.org/airlied/mesa/-/commits/nvk-coopmat2-flex-hacks the fix is in that pile of hacks
19:56 karolherbst[d]: airlied[d]: cool. I can finish it up tomorrow, if you want to work on other things
19:58 airlied[d]: that would be good, I really can't see why the final sets of tests are failing, I'm just looking into how to possibly deal with conversion between different shapes and transposing between non-square shapes
19:58 airlied[d]: I do feel that apart from the MMA instruction itself there is value in keeping square shapes maybe, not sure
19:58 airlied[d]: the ability to convert at 16x16 float16 to 2 16x8 uint8 makes writing the intrinsics hard
19:59 airlied[d]: I hacked around that in those hacks by forcing all reductions to a common side if there was a convert, but that feels wrong
20:00 airlied[d]: I'm mostly just playing around to avoid working out workgroup scope matrix lowering, my brain is refusing to play that game
20:02 karolherbst[d]: right...
20:02 karolherbst[d]: airlied[d]: I think nvidia supports it natively on blackwell with the tensor instructions
20:02 karolherbst[d]: sooo.. might just not need it on nvk at least
20:03 airlied[d]: oh we need it on turing etc
20:03 karolherbst[d]: annoying
20:03 airlied[d]: I've dumped the nvidia shaders
20:03 karolherbst[d]: sure but we could also say it's only supported on blackwell+ 🙃
20:04 airlied[d]: no that would suck
20:04 airlied[d]: actually I dumped it on blackwell
20:04 airlied[d]: and it uses the same lowering at turing
20:04 karolherbst[d]: huh? weird
20:04 airlied[d]: maybe vulkan driver has to catch up
20:04 karolherbst[d]: well I haven't really looked into how the tensor ops are working, but I _thought_ they could do workgroup level things...
20:22 jja2000[d]: gfxstrand[d]: when the cts run finished, is there anything else I can test? It's doing the test pre-today's rebase
20:23 jja2000[d]: `Pass: 1127364, Fail: 8079, Crash: 772, Warn: 5, Skip: 1411325, Timeout: 24, Missing: 8045, Flake: 21386, Duration: 27:28:29, Remaining: 2:53:36`
20:23 gfxstrand[d]: Not really
20:23 gfxstrand[d]: Right now, I'm focused on getting the base cach flush MR landed.
20:24 jja2000[d]: Which MR is that? (so I don't bother you any further until that lands :P)
20:24 gfxstrand[d]: And trying to build Android
20:25 jja2000[d]: jja2000[d]: Also, should I save the logs and publish em somewhere? It's kind of a lot of stuff
20:31 gfxstrand[d]: IDK
20:31 gfxstrand[d]: Logs by themselves aren't that useful
20:32 gfxstrand[d]: steel01[d]: I think the problem I have at this point is bindgen. I copied over my distro bindgen and now it's pulling headers from the wrong place.
20:33 steel01[d]: gfxstrand[d]: I did a cargo install for bindgen, putting it in ~/.cargo. For the moment, that path is set at a higher priority than the aosp stuff in the mesa build rules for that fork.
20:35 gfxstrand[d]: let's try that
20:35 steel01[d]: What will probably have to happen in the mid to long -term will be making a new rust prebuilts folder for whatever version mesa requires. If we can't go the other way and lower the build req.
20:36 gfxstrand[d]: Well, that's the weird thing. The build req is kinda fine
20:36 steel01[d]: Eh? I thought cbindgen was too old. And meson went 'go away'.
20:38 jja2000[d]: gfxstrand[d]: Log may be the wrong term sorry, I mean the .log and .qpa files generated by the cts
20:38 steel01[d]: Mmm, I do note that mesa_cross3d is currently setting rust 1.82. And 1.83 is available. Wonder if 1.83 would have new enough bindgen.
20:39 steel01[d]: Oh. bindgen is in clang-tools, not the rust folder. Meh.
20:40 gfxstrand[d]: /usr/include/features.h:435:4: warning: _FORTIFY_SOURCE requires compiling with optimization (-O) [-W#warnings]
20:40 gfxstrand[d]: /usr/include/bits/floatn.h:83:52: error: unsupported machine mode '__TC__'
20:40 gfxstrand[d]: /usr/include/bits/floatn.h:97:9: error: __float128 is not supported on this target
20:40 gfxstrand[d]: /usr/include/bits/pthreadtypes-arch.h:52:50: error: 'regparm' is not valid on this platform
20:40 gfxstrand[d]: /usr/include/bits/pthreadtypes-arch.h:52:50: error: 'regparm' is not valid on this platform
20:40 gfxstrand[d]: /usr/include/bits/pthreadtypes-arch.h:52:50: error: 'regparm' is not valid on this platform
20:40 gfxstrand[d]: /usr/include/bits/pthreadtypes-arch.h:52:50: error: 'regparm' is not valid on this platform
20:40 gfxstrand[d]: /usr/include/bits/pthreadtypes-arch.h:52:50: error: 'regparm' is not valid on this platform
20:40 gfxstrand[d]: clang diag: /usr/include/features.h:435:4: warning: _FORTIFY_SOURCE requires compiling with optimization (-O) [-W#warnings]
20:40 gfxstrand[d]: Unable to generate bindings: clang diagnosed error: /usr/include/bits/floatn.h:83:52: error: unsupported machine mode '__TC__'
20:40 gfxstrand[d]: /usr/include/bits/floatn.h:97:9: error: __float128 is not supported on this target
20:40 gfxstrand[d]: /usr/include/bits/pthreadtypes-arch.h:52:50: error: 'regparm' is not valid on this platform
20:40 gfxstrand[d]: /usr/include/bits/pthreadtypes-arch.h:52:50: error: 'regparm' is not valid on this platform
20:40 gfxstrand[d]: /usr/include/bits/pthreadtypes-arch.h:52:50: error: 'regparm' is not valid on this platform
20:40 gfxstrand[d]: /usr/include/bits/pthreadtypes-arch.h:52:50: error: 'regparm' is not valid on this platform
20:40 gfxstrand[d]: /usr/include/bits/pthreadtypes-arch.h:52:50: error: 'regparm' is not valid on this platform
20:40 steel01[d]: Mmm, yeah that's bad. Using host headers instead of target headers.
20:40 gfxstrand[d]: This is after I fixed it to pass the right `-target`
20:41 gfxstrand[d]: Before that, we had issues with generating structs with the wrong layouts
20:43 steel01[d]: I wonder if the aosp prebuilt sets any different default flags.
20:44 steel01[d]: So, that prebuilt is 0.69.5. And the build script wants... 0.7x. I'm assuming that min was raised for a reason and not just 'bigger number better'. Do you know if there's any required features in the newer version?
20:47 gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/commit/1a698c75ae5f47b4cc875170e6c3d30aac0a8a24
20:47 gfxstrand[d]: Doesn't seem super critical
20:48 marysaka[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1427760003019964516/snapshot.png?ex=68f0089f&is=68eeb71f&hm=c6aac0fb87500b080a3410af439bda8bf381c2008f43293b8ce2a5ef5f472cc4&
20:48 marysaka[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1427760003783331953/image.png?ex=68f0089f&is=68eeb71f&hm=aa0ea5ada561dcaa168525872d456ccb0a22a8ae8e0570eb75b0411fc117ab13&
20:48 marysaka[d]: 🎉
20:48 steel01[d]: Mmm. So maybe worth a try dropping that version in meson.build, then nuking the .cargo version if you pulled that.
20:50 gfxstrand[d]: Yeah, let's give this a go
20:50 gfxstrand[d]: meson.build claims 0.69 is buggy which is bad because that's the version Android ships
20:50 gfxstrand[d]: Well, it's 0.69.5 which is hopefully good
20:51 gfxstrand[d]: Looks like 0.69.0 is the one that's busted
20:54 steel01[d]: Yeah, that's the way I read it. The point release is new enough to be fine.
20:56 mohamexiety[d]: marysaka[d]: YOOOO very nice!!! ❤️
21:13 huntercz122[d]: marysaka[d]: DLSS on NVK before GTA 6? O.O
21:34 esdrastarsis[d]: huntercz122[d]: Next goal: Raytracing on NVK before GTA 6
21:38 rhed0x[d]: or nvk on the nvidia kernel driver :>
21:41 gfxstrand[d]: Oops. I think I just OOM'd my build box.
21:41 gfxstrand[d]: Building android 2x at the same time will do that. 😂
21:42 marysaka[d]: oh no :blobcatnotlikethis:
21:42 steel01[d]: Uhhhhh...
21:43 steel01[d]: Yeah, I'd say so.
21:43 jja2000[d]: NOOO OOM on the vk test
21:43 jja2000[d]: dangit
21:43 steel01[d]: 64GB ram isn't really enough for a single build when using 32 threads (like a ryzen 5950x). You'd have to have quite the beast to get away with that.
21:44 steel01[d]: That poor cpu scheduler. 😛
21:47 jja2000[d]: I'm guessing you can't continue these cts's after crashing like that? 🙁
21:55 gfxstrand[d]: jja2000[d]: I think I might have just fixed Zink and vkcube if you want to try those again
21:55 jja2000[d]: Ye will do, good timing too
21:56 gfxstrand[d]: And then I think my tegra branch is in good shape to merge once the prerequisites land.
21:56 gfxstrand[d]: It's not going to get us tegra by default but it's got the basics.
21:57 steel01[d]: To enable it by default, is anything needed beyond changing the conformant function to only filter out igpu, allowing soc through? I'll probably want to do that for android builds. Causing fighting with the env vars via props... I've not won that battle yet.
21:58 gfxstrand[d]: We have bugs to fix
21:59 mhenning[d]: sigh. I pulled vulkan-cts-1.4.4.0 and now my cts runs take an estimated 6 hours instead of 2
21:59 mhenning[d]: also I have tens of failures. need to figure out what that's about
22:00 gfxstrand[d]: 😩
22:07 gfxstrand[d]: gfxstrand[d]: Okay, finally OOMd for real. Now I can re-start just one of my Android builds and this time with `-j24` instead of `-j36`.
22:13 gfxstrand[d]: mhenning[d]: I'm kicking off a 1.4.4.0 run on Blackwell now just to see how bad it is.
22:16 gfxstrand[d]: I'm seeing a bunch of untyped pointers coop matrix crashes
22:17 gfxstrand[d]: That'll slow things down a bit
22:18 gfxstrand[d]: But 2 min in, I'm seeing an estimate of 52 min. That's only a little slower than before. We'll see how long it actually takes.
22:18 mhenning[d]: gfxstrand[d]: yeah filed https://gitlab.freedesktop.org/mesa/mesa/-/issues/14100
22:18 mhenning[d]: I'm at Pass: 756733, Fail: 36, Crash: 58, Warn: 2, Skip: 768670, Flake: 1, Duration: 2:51:48, Remaining: 2:27:57
22:18 mhenning[d]: normally would have been done by now
22:18 gfxstrand[d]: Ouch
22:19 gfxstrand[d]: `Pass: 103861, Fail: 1, Crash: 9, Skip: 106129, Duration: 3:04, Remaining: 38:34`
22:19 gfxstrand[d]: Maybe something later in the run will slow it way down?
22:19 mhenning[d]: maybe
22:24 gfxstrand[d]: Okay, I've got two CTS runs and an Android build going and it's already dark outside. Time to go home.
22:25 jja2000[d]: Mesa compile is almost done, will got to bed after testing vkcube
22:26 mhenning[d]: misyltoad[d]: Out of curiosity, did you just need to implement the vulkan extension to get that going, or is there more glue code to get it to turn on?
22:28 jja2000[d]: gfxstrand[d]: can confirm, seeing spinny cube
22:29 jja2000[d]: It'll lag from time to time, but it's definitely spinning
22:29 jja2000[d]: at speed too
22:29 gfxstrand[d]: \o/
22:30 gfxstrand[d]: I just forgot to flag the memory types as DEVICE_LCOAL. Easy fix once I saw it.
22:30 jja2000[d]: gfxstrand[d]: Spinny gears too, thank you very much so far!
22:36 gfxstrand[d]: steel01[d]: Okay, once my machine stopped OOMing and I could look at it again, we're still failing with the bindgen bundled in Android. Same thing. 😭
22:37 gfxstrand[d]: But I'm not fixing that tonight
22:38 gfxstrand[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1427787608188260515/mesa.diff?ex=68f02254&is=68eed0d4&hm=c73e60f72067f73b298b4e7b4094725a83fdb2ae04d046ea5702d908d05c495a&
22:38 gfxstrand[d]: This is the diff of my external/mesa right now
22:51 jja2000[d]: Hmmm seems like it will still often fall back to llvmpipe from zink, but I guess there's still all of the other stuff that needs improving hahaha
23:02 jja2000[d]: It's the hardware, dw zmike[d]
23:07 gfxstrand[d]: Did the Blackwell modifier kernel patches ever land?
23:07 gfxstrand[d]: cubanismo[d]: ^^
23:08 mhenning[d]: I think they never got reviewed
23:12 gfxstrand[d]: I thought I did but maybe not?
23:12 gfxstrand[d]: I'll look tomorrow
23:13 gfxstrand[d]: James reviewed the Mesa patches
23:13 gfxstrand[d]: If there was anything wrong, IIRC it was pretty minor
23:25 cubanismo[d]: Let me go look.
23:25 mhenning[d]: Ah, right looks like you reviewed
23:25 cubanismo[d]: Danillo said they were good to go in a private message
23:25 cubanismo[d]: But someone made me go work on something else and I stopped paying attention.
23:26 mhenning[d]: fourcc change hasn't landed as of 6.18-rc1 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/include/uapi/drm/drm_fourcc.h?h=v6.18-rc1
23:30 cubanismo[d]: The minor fix the actual modifier addition patches relied on had to be applied first and filter down to all the LTS kernels. That should have happened by now, but it's possible Danilo forgot to follow up.
23:30 cubanismo[d]: misyltoad[d]: Cool. I'll have to send that around internally.
23:32 mhenning[d]: cubanismo[d]: He did ask a question that you didn't respond to https://lkml.org/lkml/2025/9/2/1237
23:33 cubanismo[d]: Ah, never saw that. Thanks.
23:41 gfxstrand[d]: Oh, right. So I did review it and it didn't land. I'm not going crazy. 😅
23:42 karolherbst[d]: misyltoad[d]: the issue with smem on cuda is that there are two ways of specifying shared memory 1. as part of the cubin and 2. as part of a variable amount. And you have to launch it as the sum of both of them. The CUDA API only needs the dynamic amount as it will add the kernel internal amount automatically
23:43 gfxstrand[d]: Yeah, that's the way CL works, too.
23:43 karolherbst[d]: ahh, nice
23:43 karolherbst[d]: I suspect it's the initialization value
23:44 karolherbst[d]: earlier gens (like.. 8000 series) had commands to prefill the value
23:44 karolherbst[d]: yes that's like 15? years old gpus
23:44 gfxstrand[d]: Probably
23:44 karolherbst[d]: in theory you can lower it by appending something to the kernel, but that's just pain
23:45 karolherbst[d]: doable with ptx tho
23:45 karolherbst[d]: I think you can specify initializers to shared memory in ptx
23:45 karolherbst[d]: anyway.. tesla has commands to prefill it, and I suspect it comes from there nad they never bothered to remove it
23:46 karolherbst[d]: I think they also still allow bound const pulls, but the compiler generates invalid code 🙃
23:47 karolherbst[d]: anyway.. you probably want to assert on it being 0
23:47 karolherbst[d]: who knows when something uses it for crazy shit
23:47 karolherbst[d]: but then also unsure how it's used?
23:48 karolherbst[d]: could be a memcpy from a buffer passed through an internal kernel arg
23:48 mhenning[d]: misyltoad[d]: yeah, that's good. The cuda compiler lets you generate cubins without ptx (although that's discouraged) so maybe they have stuff in place for that
23:49 karolherbst[d]: though given shared memory is just L1 cache, maybe there is some other magic way to prefill it...