00:00 redsheep[d]: I'm sure even with nvidia's insane pricing and insuffucient supply amazon makes a killing on this
00:07 redsheep[d]: mohamexiety[d]: I know we keep talking about when this craze will end but I am not sure it will. Sure, the day will probably come where nvidia's stock price corrects, and it will probably coincide with some kind of event that causes more severe widespread rejection of LLMs and so forth, but I think we're at the point where enough people will want all of the tensor processing they can get for
00:07 redsheep[d]: *something* for for a very long time to come
00:07 snowycoder[d]: pavlo_kozlenko[d]: I have a 710 (yep, kepler b)
00:08 redsheep[d]: Most of the investors and customers could walk away and there would still be more than enough demand for nvidia to continue selling everything they can get TSMC to make, and still at a crazy price
00:09 airlied[d]: started adding bound handle support on my branch
00:09 airlied[d]: but will finish properly later
00:12 mhenning[d]: My ampere gpu only seems to support SET_QMD_VERSION with version "1"
00:12 mhenning[d]: int qmd_v = 1;
00:12 mhenning[d]: P_IMMD(p, NVA0C0, SET_QMD_VERSION, {.current = qmd_v, .oldest_supported = qmd_v});
00:12 mhenning[d]: P_IMMD(p, NVA0C0, CHECK_QMD_VERSION, {.current = qmd_v, .oldest_supported = qmd_v});
00:12 mhenning[d]: So I suppose those calls might not be too useful
00:12 pavlo_kozlenko[d]: snowycoder[d]: Me too. How can I help?
00:13 airlied[d]: I think there are some GPUs in the class headers have multiple QMDs defined and those can deal with multiples, otherwise you have to use the one
00:16 snowycoder[d]: pavlo_kozlenko[d]: If you want to check out my MR there's some stuff that needs work.
00:16 snowycoder[d]: I'm trying to figure out surfaces, but there are still textures (don't work but I don't know why) and shared atomics (we need to lower them to load-lock and store-unlock loops)
00:17 mhenning[d]: airlied[d]: yeah, and if I'm reading this right my gpu supports both "QMDV02_04" and "QMDV03_00". But apparently you don't select between them with those methods
00:17 snowycoder[d]: Just tell me if you want to work on any so we don't rewrite the same things.
00:17 snowycoder[d]: (If you want to work on textures I already did the encoding and txq, for some reason txq works but tex doesn't)
00:21 pavlo_kozlenko[d]: shared atomics
00:21 pavlo_kozlenko[d]: mb
00:21 pavlo_kozlenko[d]: snowycoder[d]
00:21 pavlo_kozlenko[d]: snowycoder[d]: Kepler B doesn’t support all Vulkan atomic ops natively on shared memory. So, we need to lower them to a software fallback: basically a spinlock loop using load, op, CAS, yes?
00:24 pavlo_kozlenko[d]: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf
00:24 pavlo_kozlenko[d]: 12 page
00:24 pavlo_kozlenko[d]: hmm
00:35 gfxstrand[d]: No, it's load-lock/store
00:35 gfxstrand[d]: It's the same scheme that ARM uses for their CPUs
00:37 gfxstrand[d]: So you do a load-lock, check to see if you got the lock. If you did, do the ALU op and store. If you didn't, loop and try again.
00:39 gfxstrand[d]: ```rust
00:39 gfxstrand[d]: loop {
00:39 gfxstrand[d]: val, locked = load_lock(addr);
00:39 gfxstrand[d]: if (locked) {
00:39 gfxstrand[d]: new_val = op(val, data);
00:39 gfxstrand[d]: store_release(addr, new_val);
00:39 gfxstrand[d]: break;
00:39 gfxstrand[d]: }
00:39 gfxstrand[d]: }
00:40 gfxstrand[d]: We'll need new NIR intrinsics for `load_lock_shared` and `store_release_shared`
00:50 gfxstrand[d]: snowycoder[d]: Plugged in a GK208. IDK how much time I'll have but I may play with textures.
00:50 gfxstrand[d]: Is your draft MR up-to-date?
01:53 pavlo_kozlenko[d]: gfxstrand[d]: I want add```
01:53 pavlo_kozlenko[d]: INTRINSIC("load_shared", srcs=1, dests=1, flags=[NIR_INTRINSIC_CAN_ELIMINATE])
01:53 pavlo_kozlenko[d]: INTRINSIC("store_shared", srcs=2, dests=0)```
01:53 pavlo_kozlenko[d]: for declarations of new NIR instructions
01:53 pavlo_kozlenko[d]: Where do I insert them correctly?
01:54 pavlo_kozlenko[d]: i cant fint fragment like INTRINSICS = [ where i can write instructions
02:01 mhenning[d]: pavlo_kozlenko[d]: Take a look at src/compiler/nir/nir_intrinsics.py
03:13 gfxstrand[d]: Also, I don't think we want CAN_ELIMINATE on load_lock_shared.
03:15 gfxstrand[d]: I mean, if either the load or the store gets eliminated, we're kinda screwed. <a:shrug_anim:1096500513106841673>
03:32 pavlo_kozlenko[d]: mhenning[d]: I'm confused :)
03:44 mhenning[d]: pavlo_kozlenko[d]: That python file determines the nir intrinsics that are available. It's used to generate c code
03:45 mhenning[d]: so if you want to add an intrinsic, you add it there
04:22 pavlo_kozlenko[d]: mhenning[d]: I forgot to say that's where I put it, I just don't know which function to add it to.
05:02 mhenning[d]: pavlo_kozlenko[d]: It doesn't need to go in a specific place, but since it's currently nvidia specific you might want to add an `_nv` at the end of the name and put it near the other `_nv` intrinsics
05:03 mhenning[d]: or if you don't use the `_nv` suffix it can go near the other atomics
05:03 mangodev[d]: i'm curious
05:03 mangodev[d]: is there a way nvk could get monitoring support in the future (gpu usage, power usage, et cetera)? is there a standard mesa interface where that info is transmitted, or is it specific per-driver?
05:05 tiredchiku[d]: mesa doesn't deal with hw monitoring, that's the kernel's job
05:05 tiredchiku[d]: and the kernel has a standard interface called...
05:05 tiredchiku[d]: 🥁
05:05 tiredchiku[d]: hwmon
05:05 tiredchiku[d]: https://docs.kernel.org/hwmon/hwmon-kernel-api.html
05:05 tiredchiku[d]: the GSP provides all the info, but no one has wired it up to hwmon yet
05:06 mhenning[d]: yeah, it's very doable just nobody's gotten around to it yet
05:08 airlied[d]: mhenning[d]: do we know if uaoffi mode is there before hopper?
05:40 HdkR: NVIDIA shared atomics check the cacheline lock on the load side? ARM tests it on the store side so you burn a bit of ALU each iteration
07:50 asdqueerfromeu[d]: mhenning[d]: And this is one of the reasons why a NVRM backend is important 🐸
07:58 airlied[d]: Ah ulc means clamp value needs to be in a ur, my brain is slow
08:18 snowycoder[d]: gfxstrand[d]: not yet sorry, went to bed, I'll update it with
09:20 snowycoder[d]: Ok merge request updated gfxstrand[d] pavlo_kozlenko[d]
09:20 snowycoder[d]: I added encodings for tex/txq and some initial work for suld/sust (I'm still working on it when I have time)
09:31 x512[m]: asdqueerfromeu[d]: I suspect Linux developers will implement that in Nouveau kernel driver calling GSP instead of using NVRM KMD.
09:31 x512[m]: Out of tree modules is a big big taboo.
09:36 asdqueerfromeu[d]: x512[m]: The question is when that will be done (because the initial GSP support was part of kernel 6.7 released in January 2024 and there was no progress on that since)
09:53 HdkR: What a shame, my RTX 4000 SFF Ada doesn't work on the Radxa Orion so I can't throw NVK at anything right now
10:00 mohamexiety[d]: HdkR: Wait why not? That board supports normal PCIe afaik
10:05 x512[m]: Getting load/thermal GPU stats is easy with NVRM API.
10:06 x512[m]: https://github.com/NVIDIA/open-gpu-kernel-modules/discussions/157#discussioncomment-10381610
10:06 x512[m]: Also per-PID APIs exist.
10:12 HdkR: mohamexiety[d]: It doesn't appear on the bus, I think the amount of BAR space isn't enough or something silly.
10:13 mohamexiety[d]: I see, interesting
10:15 HdkR: Luckily I can just wait two months for Thor :D
11:39 zmike[d]: is anyone here able to repro https://gitlab.freedesktop.org/mesa/mesa/-/issues/12300
13:54 gfxstrand[d]: snowycoder[d]: How old a CUDA SDK do I need?
13:57 gfxstrand[d]: Looks like 7 if I want all the way back to Fermi
14:57 gfxstrand[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1359905091812982944/image.png?ex=67f92dcf&is=67f7dc4f&hm=0276fd4033a8a0d2b301531d8bb4206a5eb37d90442a3ab33c0ef49b81a425bc&
14:57 gfxstrand[d]: That's not nothing!
14:58 snowycoder[d]: gfxstrand[d]: I'm using CUDA 10 since it's the last version that supports sm30 and sm32
15:01 snowycoder[d]: Oh wow, I've tried `dEQP-VK.glsl.texture_functions.texture.sampler2d_fixed_fragment` and it returned `DeviceLost`
15:03 gfxstrand[d]: That's because we're faulting because the hardware thinks the image is compressed when it's not
15:19 gfxstrand[d]: I hope the kernel isn't messing with my PTE kinds....
15:34 gfxstrand[d]: Okay, looks like I'm getting roughly the same behavior as codegen now
15:40 gfxstrand[d]: I wonder if images can't live in GART
15:50 gfxstrand[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1359918380659769505/image.png?ex=67f93a30&is=67f7e8b0&hm=98b730b8fa3680433da92bda57f357d45e9290e16109aef22a7db89008aab1b1&
15:50 gfxstrand[d]: If I fix our Kepler GART hack I get
15:51 gfxstrand[d]: snowycoder[d]: I think we need to figure out this texture memory issue before it makes sense to spend much more time on tex/surf ops in the compiler.
15:52 gfxstrand[d]: At this point, I'm pretty darn sure that tex and tld more or less work on my nak/kepler-tex branch
15:52 gfxstrand[d]: We're getting garbage because either the image upload is bogus (quite possible) or the hardware is trying to compress things
15:54 gfxstrand[d]: Actually... It could be texdepbar
15:54 gfxstrand[d]: Unlikely. I'm getting the same issue with codegen
15:55 snowycoder[d]: Yep, that's why I was trying (and failing) to debug TICs and QMDs
15:55 gfxstrand[d]: Actually, I think we have an issue with texdepbar AND TICs
15:56 snowycoder[d]: Oh no, why?
15:56 gfxstrand[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1359920021647921312/image.png?ex=67f93bb7&is=67f7ea37&hm=c879c089e548a5ffc0b60ffc19fccc8c28f6e6c2b9bc0e551f9487a6fdb81162&
15:56 gfxstrand[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1359920021911900230/image.png?ex=67f93bb7&is=67f7ea37&hm=da1e435134f9e4087112b23b5e2dada970b1236b03e61af0139c9620dfbe53c7&
15:56 gfxstrand[d]: First is codegen, second is NAK
15:57 snowycoder[d]: codegen uses a different texdepbar?
15:57 gfxstrand[d]: Codegen is clearly garbled but it's half garbled and half fine. NAK is more garbled
15:57 gfxstrand[d]: The additional NAK garble looks very much like we're grabbing data before it's ready
15:58 gfxstrand[d]: All the codegen garble looks like the image is set up wrong
15:58 snowycoder[d]: I can also push some code to write out the shader binary into a file if it can be useful
15:59 snowycoder[d]: (to check encoding with nvdisasm)
15:59 gfxstrand[d]: There's a flag that will dump it
16:05 gfxstrand[d]: gfxstrand[d]: This scrambling makes me think it's not actually a compression issue. I think we're just getting the wrong data somehow
16:08 gfxstrand[d]: Ugh... Ran with `zero_memory` once and now I'm getting all zeros
16:17 gfxstrand[d]: Things I know:
16:17 gfxstrand[d]: 1. I know bindless is working
16:17 gfxstrand[d]: 2. I know we're putting the right address in the texture header
16:19 gfxstrand[d]: I kinda suspect VM_BIND
16:20 gfxstrand[d]: But if we have bugs there... cursed.
16:20 gfxstrand[d]: Well, no. That should be fine, too. Render works and that's even pickier about PTE kinds than textureing
16:20 mhenning[d]: airlied[d]: Yeah, the new ulc/ulb/lb.ulc are why I've been saying the bias/clamp ordering probably changed in hopper. At the very least there's no `.lb.lc` any more so we need to be able to turn the clamp into a uniform. The old code has tricks to combine the lb + lc into a single value, so those are probably gone too (and we maybe never need some of the offset code any more)
16:21 mhenning[d]: airlied[d]: I haven't checked that specifically, but considering there's no ureg source on those instructions before hopper, I doubt it
16:22 gfxstrand[d]: Okay, I think I've spent too much time on Kepler already. But I think progress has been made. I've got a place I have to be in an hour and a half.
16:23 snowycoder[d]: gfxstrand[d]: Thanks for your help!
16:23 snowycoder[d]: Now I'm even more confused on where the bug is though:blobcatnotlikethis:
16:25 gfxstrand[d]: There are funky things going on with memory and kepler.
16:26 gfxstrand[d]: Last time I poked at kepler, I got it sorted to the point of no longer "everything crashes like mad" but there are still issues
16:27 gfxstrand[d]: But also, GM107 and GK207 should have roughly the same memory stuff going on, or so it looks like in the kernel.
16:27 gfxstrand[d]: <a:shrug_anim:1096500513106841673>
16:32 redsheep[d]: Wasn't Maxwell A the very first GPU you got working with nvk?
18:33 mohamexiety[d]: hm, is there a pattern to how local size is passed through the QMDs in Ada/older?
18:34 mohamexiety[d]: trying to figure out a pattern but it's a bit weird
18:55 mohamexiety[d]: nevermind, I am dumb. it was the most obvious stuff but I missed it :blobcatnotlikethis:
18:55 mohamexiety[d]: ook I think we got local size too now, just verifying just in case
19:00 snowycoder[d]: gfxstrand[d]: what testcase are you using for this?
19:27 snowycoder[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1359973051600732357/image.png?ex=67f96d1a&is=67f81b9a&hm=20bbe4aedfb3ea2711565f9a82f918c8410df9694ac7a51ed332ee348d3efcef&
19:27 snowycoder[d]: WAIT,
19:27 snowycoder[d]: most `texture.samplercube_fixed_*` are almost sensible!
19:29 tiredchiku[d]: seem to be flipped along the y axis
19:43 mhenning[d]: or it could be flipped on the x axis
20:17 mhenning[d]: I'm not sure what you mean by "unused bytes"
20:17 mhenning[d]: For comparison, the v4 fields are:
20:17 mhenning[d]: #define NVCBC0_QMDV04_00_CTA_THREAD_DIMENSION0 MW(1167:1152)
20:17 mhenning[d]: #define NVCBC0_QMDV04_00_CTA_THREAD_DIMENSION1 MW(1183:1168)
20:17 mhenning[d]: #define NVCBC0_QMDV04_00_CTA_THREAD_DIMENSION2 MW(1199:1184)
20:17 mhenning[d]: #define NVCBC0_QMDV04_00_REGISTER_COUNT MW(1208:1200)
20:18 mhenning[d]: which is to say that one makes each of them a 16-bit number, which means technically the upper bits of some of then will always be zero
20:18 mhenning[d]: and then it's followed by an unrelated field
20:39 mohamexiety[d]: yeah in that case it all checks out, thanks
20:56 gfxstrand[d]: snowycoder[d]: glsl.texture_functions
21:20 mohamexiety[d]: ok found shader memory
21:20 mohamexiety[d]: er, sorry. shared memory
21:21 mohamexiety[d]: buuut it's a bit problematic. it's entirely different. this is how the shared memory entry looks like:
21:21 mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
21:21 mohamexiety[d]: .VALUE = 0x1b41802
21:21 mohamexiety[d]: and this is the one without:
21:21 mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
21:21 mohamexiety[d]: .VALUE = 0xb40800
21:21 mohamexiety[d]: everything else is 1:1 identical
21:21 mohamexiety[d]: the shaders are also identical. the only difference really is one is `shared bool` and one is `bool`
21:32 mohamexiety[d]: gfxstrand[d]: with this done I am just missing 2 and 4 out of these. for 2 I am not sure how to have 2 different shaders at the same time tbh. for 4 I am not sure where to start with the register manipulation
21:32 mohamexiety[d]: (and also if there'll be more playing around with shared memory)
21:35 mohamexiety[d]: also unrelated but cant seem to push anything at all today :thonk:
21:35 mohamexiety[d]: `git push` just... doesn't work. no error or anything, it just freezes like that
21:38 gfxstrand[d]: SSH changed. You need to fix all your SSH remotes from gitlab.freedesktop.org to ssh.gitlab.freedesktop.org.
21:38 gfxstrand[d]: I'm a little annoyed by that change. It's gonna break a lot of people's setups and it's frustratingly non-obvious why.
21:42 gfxstrand[d]: mohamexiety[d]: So I'm guessing the bottom bits are an amount of shared memory. It's probably similar to what we program for other gens but the encoding might have changed. I suspect the other are control bits of some sort. For the control bits, we can run the same experiment on Ampere and see what bits change there. They're probably roughly the same bits, just moved around a little.
21:43 tdaven[d]: The ssh change is mentioned here with a fairly nice way handle it:
21:43 tdaven[d]: https://gitlab.freedesktop.org/freedesktop/freedesktop/-/issues/2076#note_2831847
21:51 gfxstrand[d]: mohamexiety[d]: For registers, the big thing is to do something that prevents the scheduler from moving things around to reduce register pressure. So something like
21:51 gfxstrand[d]: ```glsl
21:51 gfxstrand[d]: int x[3] = { 0, 0, 0 };
21:51 gfxstrand[d]: for (int i = 0; i < push.bound; i++) {
21:51 gfxstrand[d]: x[0] += i;
21:51 gfxstrand[d]: x[1] += i * 3;
21:51 gfxstrand[d]: x[2] += i * 5;
21:51 gfxstrand[d]: }
21:51 gfxstrand[d]: int y = x[0];
21:51 gfxstrand[d]: y += x[1];
21:51 gfxstrand[d]: y += x[2];
21:51 gfxstrand[d]: use(y);
21:51 gfxstrand[d]: Because the loop bound is a push constant, the compiler can't unroll it and all of `x[]` has to remain live in registers at the end of the loop. Then we use all of them to ensure the compiler doesn't delete anything. You can make `x[]` as big as you'd like and burn all the registers.
22:05 mohamexiety[d]: gfxstrand[d]: Ohhh I see. Thanks a lot, I completely missed that this would change
22:06 mohamexiety[d]: gfxstrand[d]: Got it, will do that then
22:07 mohamexiety[d]: gfxstrand[d]: Aha that’s brilliant. Understood, thanks so much