00:01 gfxstrand[d]: We'd have to actually enable the hardware, though.
00:02 gfxstrand[d]: But it's theoretically possible we could bring it up on top of a branch that actually works and then forward-port the patches.
00:06 banger: is nouveau still under active development?
00:07 gfxstrand[d]: Depends on what part of nouveau you're talking about
00:08 banger: the linux driver for nvidia cards in particular the GT 1030
00:08 gfxstrand[d]: Yeah, it's still in development. But 1030 is pretty cursed so it's unlikely to get massively better
00:09 banger: cursed how?
00:09 gfxstrand[d]: Pascal is stuck at boot clocks and there's not much we can do about that.
00:09 gfxstrand[d]: Turing (20xx) and above is fine
00:10 gfxstrand[d]: But 10xx not so much
00:10 banger: yeah i read that i am not a gamer and just looking for it support my linux desktop Solus plasma
00:12 banger: as nvidia has announced support for the 10xx drivers is ending
00:12 gfxstrand[d]: Yup
00:13 banger: just trying to keep my old rigs going with the 1030 cards
00:16 gfxstrand[d]: Yeah, 1030 should work on nouveau
00:16 gfxstrand[d]: it
00:16 gfxstrand[d]: it'll just be slow
00:16 banger: great thats all i need to know thanks
00:24 cubanismo[d]: FWIW, we'll keep doing "security updates" that usually pull in support for new kernels (And new X servers, if that ever happens again) for quite a while. You'll just be stuck on an older branch, so there won't be new features or larger bug fixes. These days the main downside of that for older cards is you won't get any/many more Wayland support improvements.
00:24 cubanismo[d]: But of course, nouveau is an option as well if it works for you.
00:39 banger: interesting yeah wayland is my main desktop
00:44 banger: i plan on using nouveau for sometime so its good to know its still being worked on
03:17 x512[m]: <marysaka[d]> "gfxstrand[d]: I was explicitly..." <- But will it apply to standard Vulkan opaque FD buffer sharing?
03:28 gfxstrand[d]: x512[m]: Uh... Good question. I need to go read up on the rules there.
03:31 x512[m]: Note that I am interested in efficient opaque FD buffer sharing so it will not always fallback to linear.
03:31 gfxstrand[d]: I think if you create a memory object with a dedicated allocation and export it as OPAQUE_FD then you had better import it with an identical dedicated allocation. But I'm not sure if there's a VU that says that.
03:32 gfxstrand[d]: But hey, we've got cubanismo[d] here now. He knows. 😅
03:33 x512[m]: As long image creation and import parameters match and the save driver is used, opaque FD import should work.
03:33 x512[m]: s/save/same/
03:33 gfxstrand[d]: Yes but this is about memory object creation, not just image creation. Because we're currently depending on dedicated allocation for compression to work.
03:34 gfxstrand[d]: But also we're gonna have to figure this out all over again on openrm because it's probably smarter about page tables than nouveau.
03:35 x512[m]: Do Nvidia support RLE image import? It can greatly speed up loading rasterized alpha masks generated by CPU.
03:37 x512[m]: NVRM may use private FD parameters so it will import everything correctly.
05:57 cubanismo[d]: gfxstrand[d]: That is accurate.
05:57 cubanismo[d]: Though I also don't have the exact VUID handy
05:58 cubanismo[d]: That is the #1 most common interop error: Create as dedicated, import as non-dedicated.
05:59 cubanismo[d]: x512[m]: OpenRM knows almost nothing about image layout.
06:02 cubanismo[d]: The premise of opaque FD import/export as always that given identical inputs, both sides arrive at the same layout. That's why it had the driverUUID and deviceUUID matching requirement. Dev is effectively another input to the layout equation, and the drivers had better actually be the same to ensure they both have the same logic. Then, everyone went and required dedicated allocation and used
06:02 cubanismo[d]: kernel-side driver-private metadata instead and complained whenever that was hard.
06:04 cubanismo[d]: Also, we made a last-minute rework to how dedicated allocation worked to appease slippery-slope concerns, and in retrospect, it sort of broke dedicated allocation.
06:06 cubanismo[d]: Still not super happy with how all that turned out, but I think the overall external handle mechanism/abstraction holds up pretty well, and opaque handles are implementable, if a bit tricky to get right.
07:39 mohamexiety[d]: HdkR: I am still not sure if I want Thor or Spark for nvk bringup tbh. On one hand, Spark should have a better software experience (not tied to Jetpack, modern kernels, etc) but Thor should be more flexible later on
07:40 airlied[d]: I think nvk bringup on spark will only be about 20 lines of code 🙂
07:41 airlied[d]: though I've no idea if the memory bus is coherent on that thing, so it might be a bit more work
07:41 mohamexiety[d]: It is I think
07:41 mohamexiety[d]: It’s fully unified, only a single memory pool, etc
07:42 mohamexiety[d]: None of the cursedness like the big DC SoCs with leaning on nvlink-c2c
07:42 airlied[d]: at least on GH, even NVIDIA's vulkan driver was a bit nerfed
07:42 airlied[d]: exposed VRAM as non-mappable
07:43 airlied[d]: though I think GB might have fixed that
07:43 mohamexiety[d]: Yeah that’s because GH isn’t truly unified. You have two distinct pools (GPU HBM and CPU’s LP5X) with nvlink-c2c essentially pretending that it’s only one by doing some heroics in the background
07:44 mohamexiety[d]: GB _should_ have the same limits but I haven’t seen a system
07:45 airlied[d]: the open driver has an option to expose it as non-coherent, but I couldn't get it to work
07:47 mohamexiety[d]: That’s annoying
07:54 jannau: galaxus has a product listing for a pny dgx spark with delivery date in ~3 weeks, €6900
07:54 jannau: https://www.galaxus.de/en/s1/product/pny-workstation-nvidia-dgx-spark-prozessorfamilie-nvidia-128-gb-1000-gb-pc-59656752?supplier=2705624
07:56 mohamexiety[d]: That’s one huge upcharge damn
07:56 mohamexiety[d]: It should be in range of $3000-4000
08:02 jannau: includes 19% VAT but €5800 is still a huge upcharge
08:02 HdkR: mohamexiety[d]: Yea, I'm having a hard time justifying either machine to myself currently.
08:32 karolherbst[d]: airlied[d]: if you are bored, could do more benchmarks on main, because a membar opt landed that should give you you +30% more perf 🙃
08:41 airlied[d]: I'll see if I can kick it on Monday, got stuck trying to add coopmat2 features
08:51 karolherbst[d]: fair
09:24 karolherbst[d]: okay.. so faiths loop accounting thing does help.. but not sure if in the right direction 😄 now it pointed out a massive regression in a shader having like 5 nested loops and I'm like :blobcatnotlikethis:
09:24 karolherbst[d]: and it's like massive loops
09:24 karolherbst[d]: the entire shader is nuts
09:25 karolherbst[d]: like even the most inner loops are massive
10:04 karolherbst[d]: uhhh.. I think I might find an annoying NAK bug, I looked into some of the nir_opt_offsets regressions and an input that is _obviously_ better ends up with a worse shader down the line 🙃
10:05 karolherbst[d]: like the only difference is really before I have a bunch of leas, and now I have ishl instead (and the constant offset folded into the IO op)
10:05 karolherbst[d]: and somehow the shader is bigger ..
10:05 karolherbst[d]: ohh...
10:07 karolherbst[d]: ahhhhhh
10:07 karolherbst[d]: it's RA
10:08 karolherbst[d]: so that opt able to remove 3 regs, also caused 1420 -> 1517 instructions
10:08 karolherbst[d]: if I fake +3 gprs, I get 1420 -> 1411 instead
10:09 karolherbst[d]: mhhhhhh
10:09 karolherbst[d]: maybe I should figure out a good way of dealing with this..
10:09 karolherbst[d]: could always do multpile of 4s...
10:09 karolherbst[d]: that should mitigate this issue at least somewhat
10:14 karolherbst[d]: though it also feels like that RA makes bad choices somewhere
10:33 karolherbst[d]: okay. I have a patch
10:55 karolherbst[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36514
11:04 karolherbst[d]: it's so much better with this..
12:35 karolherbst[d]: okay.. I think the loop aware cycle counter is overflowing 🙃
12:36 karolherbst[d]: I have a shader, the only change is (5 + 2 + 5 + 2) in a nested loop turns into (1 + 4 + 1 + 1) and 45856466 static cycle count turns into 1018934994 🙃
12:40 gfxstrand[d]: mohamexiety[d]: At the price they're asking, what's even the point of Thor? Some MMIO pins? You know it's going to be a less powerful CPU/GPU and have a jankier software stack. But at the same price?!? Yeah, doesn't make a lot of sense.
12:41 karolherbst[d]: maybe we should limit how deep of a loop we'll consider...
12:42 mohamexiety[d]: gfxstrand[d]: the GPU looks to be more powerful but details are sparse. the only numbers we have is that Spark does 1PF FP4 Sparse while Thor does 2PF FP4 Sparse. CPU should be same I think. the dealbreaker really is the software stack
12:43 mohamexiety[d]: what really hurts Thor for me is the PCIE downgrade tbh cuz you only get 8 lanes now compared to Orin
12:43 karolherbst[d]: yep yep.. it overflows
13:06 karolherbst[d]: okay, actual cycle count is 133272881051346
13:12 karolherbst[d]: something doesn't make sense...
13:12 karolherbst[d]: I see `loop_depth` returning values like 30
13:16 karolherbst[d]: do I want to debug a shader with 4000 blocks 🙃 ?
13:19 karolherbst[d]: anyway.. something is wrong with the loop detection
13:20 karolherbst[d]: mhhhhh
13:20 karolherbst[d]: ohhh
13:20 karolherbst[d]: sooo
13:20 karolherbst[d]: this shader has multiple nested loops after each other
13:20 karolherbst[d]: what if it's not just counting nested loops, but all previous loops as well
13:21 karolherbst[d]: so if you have: "loop { loop { loop { loop {} }}} ... loop { ... }" the last one gets a depth of 4 (or 5?)
13:23 karolherbst[d]: ohh that even checks out mathematically
13:24 karolherbst[d]: so now the shader stats are skewed towards the last loop
13:26 karolherbst[d]: okay, that shader has like 65 loops in total
13:27 karolherbst[d]: the biggest loop depth I'm seeing seems to be 27?
13:27 karolherbst[d]: or 28
13:30 karolherbst[d]: ohh okay
13:30 karolherbst[d]: it counts all previous outer loops + all loop headers the current block is inside
13:31 karolherbst[d]: that gets me perfectly to 28
13:31 karolherbst[d]: when counting
13:33 karolherbst[d]: https://gist.githubusercontent.com/karolherbst/ffb64690beb8ed987f4beba92c6263dd/raw/e9e0f4104e30de0fb775074d447061ed10b2bc50/gistfile1.txt
13:33 karolherbst[d]: that's how nak calculates the loop depth atm
13:35 karolherbst[d]: or osmething like that
13:35 karolherbst[d]: but it matches the numbers I'm seeing out of `cfg.loop_depth`
13:51 karolherbst[d]: uhh.. how to fix that one...
13:52 gfxstrand[d]: karolherbst[d]: Do cycle counts need to be `u64` now?
13:52 gfxstrand[d]: 😭
13:52 karolherbst[d]: gfxstrand[d]: no, the issue is loop detection is busted
13:52 karolherbst[d]: loop_depth specifically
13:52 gfxstrand[d]: Uh... wat?
13:52 karolherbst[d]: karolherbst[d]: gfxstrand[d] ^^
13:52 karolherbst[d]: this is how it's counted 🙂
13:53 gfxstrand[d]: Uh...
13:53 gfxstrand[d]: Damn
13:53 karolherbst[d]: yeah...
13:53 gfxstrand[d]: Okay
13:53 karolherbst[d]: so I see a loop depth of 28 🙂
13:53 gfxstrand[d]: That's actually a bigger problem. That means loop detection is busted and that affects spilling
13:53 karolherbst[d]: yeah...
13:56 karolherbst[d]: I do have patches to convert the cycle counter to u64, but... I think fixing the loop depth part is more important ... I think in theory it could still overflow, but my other idea was to not do a 10 ^ n thing but something that kinda flattens.. maybe 10 ^ ((log n) + 1) or something
13:57 karolherbst[d]: so depth of 10 still gives 2007 as a factor
13:57 karolherbst[d]: or something
13:59 karolherbst[d]: anyway, I hope fixing the loop_depth thing improves the shader stats 😄
14:39 karolherbst[d]: gfxstrand[d]: anyway, are you thinking about how to fix it, because I don't have the brains for CFG stuff like that
14:40 gfxstrand[d]: yes
14:40 gfxstrand[d]: I'm working on it right now
14:40 gfxstrand[d]: I don't like how reachability tests are going to blow up the algorithm runtime but here we are
14:41 karolherbst[d]: okay, cool
14:42 karolherbst[d]: if you want a great test, the shader with all those loops is part of resident_evil_village
14:43 karolherbst[d]: hash 00788d6db5f4fed6 if you have access to those fossils
16:44 karolherbst[d]: okay, nir_opt_offset is cool and all, but sadly by default it's only useful for load/store_shared, wiring it up for ldc, but the benefits are.. questionable.. Maybe I should focus on the left shift/imul into shared first...
16:46 karolherbst[d]: yeah...
16:46 karolherbst[d]: that makes more sense
17:22 karolherbst[d]: gfxstrand[d]: any thoughts on how to model address space specific modifiers on OpSt and OpLd? Like the STS and LDS instructions have a multiplier on the base address (not the offset) of 1, 4, 8 or 16. So a (base << 3) + 0x1000 can be encoded as: `LDS s[base.X8 + 0x1000]`. I could just add it to the OpLd/OpSt instructions directly or have an enum for specializing address spaces, or make it part of
17:22 karolherbst[d]: `MemSpace`? Any opinions? Otherwise I'll just write something up
17:24 gfxstrand[d]: Don't make it part of the memory space
17:24 gfxstrand[d]: Just put a `u8` or an enum somewhere on the op.
17:25 karolherbst[d]: okay
18:00 mohamexiety[d]: gfxstrand[d]: modifier queries call into `nil_drm_format_mods_for_format()` to fill in the list of modifiers depending on format. I guess with compression this should have the compressed modifiers as well. question is, how does this interact with max_modifier_count? does it still remain as 7?
18:02 mohamexiety[d]: or is that side of things completely unrelated to comp?
18:28 gfxstrand[d]: Yeah, max modifier count will have to grow
18:28 gfxstrand[d]: Also, there's a bit of 😩 in there. I don't know how we advertise a different set of modifiers for dedicated allocations vs. not.
18:28 gfxstrand[d]: I don't think there's a good way to do that
18:30 mohamexiety[d]: Theres quite a bit of stuff actually
18:30 mohamexiety[d]: Firstly we don’t even have enough info at the query stage whether we can compress or not
18:32 mohamexiety[d]: Since the only thing we really get is format which well ok they’re all compressible. I guess that’s easy, we will just always return the compressible modifiers in the list.
18:32 mohamexiety[d]: Secondly due to all the weirdness we have to do with memory types/dedicated allocs we don’t even know if the image will _actually_ be compressed until bind time
18:32 mohamexiety[d]: Since that’s when we know whether it’s a DEVICE_LOCAL image or not
18:32 mohamexiety[d]: That stuff really needs another look tbh since I don’t have other ideas
18:33 gfxstrand[d]: yup
18:33 gfxstrand[d]: It's all a mess. 😩
18:34 mohamexiety[d]: I am tired
18:34 karolherbst[d]: `st.shared.strong.cta.b128 [r63.x16] r12..16` mhh
18:35 karolherbst[d]: I kinda don't like the `.x16` thing 😄
18:35 mohamexiety[d]: I think I will wait for your review first before modifiers tbh
18:35 karolherbst[d]: `st.shared.strong.cta.b128 [r63.x16+0x2800] r24..28`
18:35 karolherbst[d]: mhh
18:35 mohamexiety[d]: mohamexiety[d]: Since this will hinge on how we clean up things anyways, might as well :thonk:
18:36 karolherbst[d]: nvidia prints it like this: `STS.128 [R10.X8+0x1400], R24 ;`
18:36 karolherbst[d]: (different byte values)
18:36 karolherbst[d]: I don't encode it yet
18:37 karolherbst[d]: I kinda like the `X` more than the `.x` 😄
18:37 karolherbst[d]: maybe I should do `r63*16` instead?
18:37 karolherbst[d]: maybe a shift?
18:37 karolherbst[d]: any preferences here?
18:42 cubanismo[d]: gfxstrand[d]: See my comment about subtly breaking dedicated allocations with last minute tweaks 😉
18:42 cubanismo[d]: But IIRC, you can still report dedicated required for a specific modifier after image creation.
18:42 gfxstrand[d]: Oh, good point!
18:42 mohamexiety[d]: We do report dedicated required for all modifiers anyways, no?
18:43 gfxstrand[d]: Yeah, we can totally say that if there's a compressed modifier that dedicated allocation is required
18:43 gfxstrand[d]: mohamexiety[d]: Maybe? I don't remember.
18:44 cubanismo[d]: We do currently on proprietary
18:44 mohamexiety[d]: We do on nvk too. https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/vulkan/nvk_image.c?ref_type=heads#L1242
18:44 cubanismo[d]: But it's essentially a bug that we have to. Lina and I were both pretty unhappy about that, since part of the point of modifiers is not needing dedicated allocations
18:45 cubanismo[d]: That "We don't know if it's device local" thing is mostly why the proprietary drivers don't report compressed modifiers from Vulkan at the moment.
18:46 mohamexiety[d]: I don’t mind disabling modifiers for compression tbh; I don’t think it even makes a difference there
18:46 mohamexiety[d]: Err
18:46 mohamexiety[d]: Other way around
18:46 mohamexiety[d]: Compression for modifiers*
18:46 cubanismo[d]: It makes a big difference if the producer and consumer can both access the compression
18:47 cubanismo[d]: E.g., rendering to a window fro ma Vulkan client and compositing using GL+EGL in a wayland compositor
18:47 cubanismo[d]: It makes ~zero difference for a full-screen game in a dedicated plane.
18:48 cubanismo[d]: Because the display hardware can't read compressed surfaces.
18:48 cubanismo[d]: Well I guess you don't use compressed modifiers in the latter case anyway
18:50 cubanismo[d]: But how much real-world difference depends on various factors. I did a talk at XDC a while ago that showed in bandwidth-constrained situations, you can get some huge speedup (>=50% IIRC) by using compression from client->compositor if you're compositing with the 3D engine, and simultaneously save ~10% by using uncompressed memory when you're going direct to display because it avoids the
18:50 cubanismo[d]: decompression step needed during handoff
18:51 mohamexiety[d]: Oh wow ok that’s surprising :blobcatnotlikethis:
18:51 cubanismo[d]: Those were certainly cherry-picked numbers
18:52 cubanismo[d]: But the effects can be dramatic.
18:52 mohamexiety[d]: Yeah I see
18:53 cubanismo[d]: I also got to use that talk as an excuse to document the decompress-in-place method, back when that sort of thing was a lot harder to do here. It was narrowly approved in time for XDC.
18:53 cubanismo[d]: Good times.
18:54 mohamexiety[d]: Hehe nice!
18:54 mohamexiety[d]: I should look that talk up, it sounds really interesting
18:54 cubanismo[d]: Unfortunately, like most of my XDC talks, I promptly did nothing to follow up on the findings since.
18:54 mohamexiety[d]: Aww
18:54 cubanismo[d]: But the code is all out there somewhere if you want to reproduce the results.
18:55 cubanismo[d]: Probably just have to sync your mesa tree back like 5 years
18:55 mohamexiety[d]: :nervous:
18:55 karolherbst[d]: ~~backport NVK~~
18:56 karolherbst[d]: shoo.. my niche address calculation optimizations are still having almost no impact across the board 🙃
18:57 cubanismo[d]: The code wasn't production ready at all. It was to make the point that compression matters, and so does the ability to seemlessly transition into and out of it.
18:57 cubanismo[d]: I think Intel had demonstrated the first point a few years before.
18:57 karolherbst[d]: though at least for 0.6% of the shaders it's great
18:57 cubanismo[d]: So I was trying to show the next part.
19:02 mohamexiety[d]: Well first I guess have to figure out some loophole to get modifiers working with compression. But figuring this out will depend on how we figure out enabling/disabling compression so I guess should focus on _that_ first. Say you wouldn’t have advice for trying to handle that would you :KEKW:
19:02 mohamexiety[d]: We can only enable it for things that are DEVICE_LOCAL and only that. Thing is the Vulkan API is weirdly restrictive here and images with equal parameters can’t have different memorytypes returned (so I can’t for example have a memorytype dedicated to certain usages and that’s TILING_OPTIMAL and also DEVICE_LOCAL only). So we worked around that by making it so that we only enable compression for
19:02 mohamexiety[d]: things that we know can be pinned in VRAM and are write heavy (color, z/s, and storage images) and that those need a dedicated allocation.
19:02 mohamexiety[d]: Issue is what if an app wants to create those on host visible/system memory? Compression should be disabled but by the point I know that the app is doing that, it’s kinda too late as we only know that during AllocateMemory/BindImage time
19:04 mohamexiety[d]: I noticed the prop driver has a lot of heaps/types which makes me think maybe there’s a way to have a different memorytype for it but I am not sure tbh
19:04 mohamexiety[d]: https://cdn.discordapp.com/attachments/1246430913588494346/1398388311516512327/image.png?ex=688dbfda&is=688c6e5a&hm=c601f8fdcb9b49361c91bb9383576dcf33a2e9c519bbc3c80ca7d57bfe6052d7& this thing caused so much headache
19:10 cubanismo[d]: I can't really get into how we solve things in the proprietary driver.
19:10 gfxstrand[d]: karolherbst[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36524
19:10 cubanismo[d]: Just the problems we haven't solved yet 😄
19:10 gfxstrand[d]: I'm CTSing now
19:10 cubanismo[d]: I'm relatively certain it's all solveable, just with varying degrees of complexity.
19:10 karolherbst[d]: gfxstrand[d]: sorry for ruining your day 🙃
19:11 gfxstrand[d]: Nah, you're good. I got to do compiler theory today. Any day I write 4 pages of notes trying to prove a graph property is a good day. 😄
19:11 karolherbst[d]: 😄
19:12 gfxstrand[d]: Then number of days when I actually get to use my PhD is disappointingly small.
19:12 karolherbst[d]: but you still run into overflow issues with this? Well, I can fix that up, already got the patch somewhere
19:12 gfxstrand[d]: Yeah but now spilling isn't weirdly borked.
19:12 karolherbst[d]: good good
19:12 karolherbst[d]: let me try that patch
19:12 gfxstrand[d]: Well, one hopes
19:13 gfxstrand[d]: I'm CTSing now because while I think the spilling algorithm should be correct with the more conservative concept of loop containment, this shit is subtle and I don't want to break anything.
19:14 karolherbst[d]: yeah...
19:14 gfxstrand[d]: This one's gonna need mhenning[d] review
19:16 mohamexiety[d]: cubanismo[d]: It was worth a shot I guess :KEKW:
19:16 mohamexiety[d]: But yeah same. I am trying to think of the options these days
19:16 gfxstrand[d]: In other news... https://www.khronos.org/conformance/adopters/conformant-products#submission_935
19:16 chikuwad[d]: :PogDuck:
19:16 mohamexiety[d]: Now to enable it by default on Blackwell
19:17 mohamexiety[d]: Should be smooth to get it ported to .2 too
19:17 cubanismo[d]: congrats to you guys.
19:17 gfxstrand[d]: Yeah, that's going to wait until Monday because our marketing people don't like it when I blog on Fridays. :frog_upsidedown:
19:17 karolherbst[d]: `Static cycle count: 2965267781` 🥲
19:17 mohamexiety[d]: Hahahaa
19:18 cubanismo[d]: mohamexiety[d]: are you a Khronos member, or at a company that is rather?
19:18 mohamexiety[d]: cubanismo[d]: Yeah
19:18 gfxstrand[d]: And I want to at least get the blog out before we land the MR so there's a chance we'll get the blog linked in the Phoronix article. 😂
19:18 mohamexiety[d]: Mohamed Ahmed (Valve)
19:18 cubanismo[d]: Let me send you a relevant link then
19:19 gfxstrand[d]: cubanismo[d]: Do you want to be invited to the secret group chat?
19:19 mangodev[d]: weird question
19:19 mangodev[d]: is it a known issue that NVK explodes when seeing stereoscopic framebuffer types?
19:19 cubanismo[d]: Ah, yeah, then I can paste it to everyone there.
19:19 mohamexiety[d]: cubanismo[d]: Ooh thanks!
19:21 mangodev[d]: mangodev[d]: idk if it's with NVK in specific, or just zink
19:21 mangodev[d]: just asking here in case it's with how NVK handles zink's outputs
19:21 gfxstrand[d]: Not sure what a "stereoscopic framebuffer type" is.
19:21 mangodev[d]: a program called aseprite causes a hard MMU fault
19:21 mangodev[d]: so hard the kernel panics 🫠
19:21 gfxstrand[d]: But if NVK is exploding, that's probably bad.
19:22 gfxstrand[d]: Do you have an easy reproducer?
19:22 mangodev[d]: gfxstrand[d]: `GL_LEFT_BACK` seems to crash the driver hard
19:22 mangodev[d]: it's not exactly stereoscopic (as `LEFT` also applies to monoscopic rendering), but i don't think it's normally specified
19:23 mangodev[d]: gfxstrand[d]: yes
19:23 mangodev[d]: just wiggle your mouse around in aseprite's dropdowns until the kernel panics
19:23 mangodev[d]: i could try taking a video, but it wouldn't exactly do much because then it'd just freeze
19:24 chikuwad[d]: phone recording
19:24 chikuwad[d]: :wolfFIRE:
19:24 mangodev[d]: it used to more clearly point to `GL_BACK_LEFT` being the cause, though in more recent versions, it just says it's an MMU fault, all keypresses do nothing, and everything stops logging afterward
19:26 karolherbst[d]: gfxstrand[d]: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/ba930a673716f551358de14850af9dd483e67aa8
19:27 karolherbst[d]: though I'll play around with my logarithmic scaling idea...
19:27 karolherbst[d]: like with that fix that one shader has `Static cycle count: 15850169669`
19:28 karolherbst[d]: and I don't like that one shader is like half the db 😄
19:30 mangodev[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1400923557416144988/image.png?ex=688e673b&is=688d15bb&hm=ecf2b9b4056f91d063141930c0fe0340a89a4b0c565adc9534ed8c6ffe9934a7&
19:30 mangodev[d]: gfxstrand[d]: it explodes a little hard :blobcatnotlikethis:
19:31 mangodev[d]: if the window's open for a few seconds, the whole system's locked up
19:31 mangodev[d]: if you manage to pkill it fast enough, new windows can't open properly
19:31 mangodev[d]: it murders it from the inside
19:31 mangodev[d]: i could try attaching a debugger, but idk what i'd get out of it
19:31 karolherbst[d]: karolherbst[d]: I think it's more like 99%
19:34 gfxstrand[d]: karolherbst[d]: Basically every time you do `u64::from()`, the thing you're converting should probably be a u64, too.
19:34 karolherbst[d]: yeah... but I don't want to convert the calc delay stuff to u64
19:35 karolherbst[d]: and it gets multiplied by like 100.000
19:35 karolherbst[d]: so it's not like it matters much
19:35 karolherbst[d]: but...
19:35 karolherbst[d]: could also make it all u64, just feels a bit pointless
19:37 gfxstrand[d]: karolherbst[d]: Why not? `cycle` can stay `u32` but we do `estimate_block_weight()` inside that function so we really should go u64 for the total.
19:39 gfxstrand[d]: Yeah...
19:39 karolherbst[d]: I mean.. that's what happens in that function, just that neither estimate_block_weight nor calc_delay are u64. I mean I can convert both to u64
19:39 gfxstrand[d]: both, please
19:39 karolherbst[d]: okay
19:42 gfxstrand[d]: karolherbst[d]: Ugh... Maybe 10x was a little much on the cycle counts? I upped it from 5 to use the intel standard of 10.
19:42 gfxstrand[d]: It's all BS, of course so <a:shrug_anim:1096500513106841673>
19:42 karolherbst[d]: the factor isn't the problem
19:42 gfxstrand[d]: 😅
19:42 karolherbst[d]: it just shouldn't be x ^ something
19:42 karolherbst[d]: just do a log
19:42 karolherbst[d]: x ^ ((log n) + 1)
19:42 karolherbst[d]: or something
19:43 karolherbst[d]: at least that's what I want to play around with
19:43 gfxstrand[d]: gfxstrand[d]: CTS is happy. CTS with `NAK_DEBUG=spill` is less happy. We'll see why after this run finishes.
19:43 karolherbst[d]: I'm sure that gives better numbers
19:43 karolherbst[d]: anyway, I'll check how all of this affects shader stats and will just choose something which doesn't look horrible cross the board
19:43 gfxstrand[d]: Looks like the cmat tests don't like `NAK_DEBUG=spill`. :silvy_sweat:
19:44 gfxstrand[d]: That feels like it's probably a preexisting problejm.
19:45 karolherbst[d]: mhhhh
19:45 karolherbst[d]: it does require a nop
19:45 karolherbst[d]: like because it waits for many cycles sometimes
19:45 karolherbst[d]: ehh wait
19:45 karolherbst[d]: spill, not serial 🙃
19:46 gfxstrand[d]: mangodev[d]: If it's a GL app, any trace you could get a GL apitrace of it on some other driver? That would be something I can reproduce
19:50 karolherbst[d]: should `calc_delays` even consider the block weight? It kinda feels like that static cycling counting for the stats should be its own thing
19:51 karolherbst[d]: well not that it matters much I guess
19:51 gfxstrand[d]: Yes. `calc_delays()` is what computes the one we actually report.
19:52 karolherbst[d]: I thought that's before `opt_instr_sched_postpass`?
19:52 karolherbst[d]: ohh it's after
19:52 gfxstrand[d]: Yup
19:53 karolherbst[d]: some shaders cycle count: `3101 -> 5085187775375+5085187772274+163985416713.12%` 🥲
19:53 karolherbst[d]: let me try my log idea
19:54 karolherbst[d]: it's funny how it impacts spilling
19:54 gfxstrand[d]: How what impacts spilling?
19:54 karolherbst[d]: ehh wait.. there are multiple commits in betweem nvm
20:07 karolherbst[d]: mhhhhh... I hate my idea.... not sure I'm a fan of introducing float maths just to do a 10^(log2(n) +1) just because I don't want a loop_depth of 4 and 7 to be treated equally...
20:07 karolherbst[d]: but the stats are nice tho
20:07 gfxstrand[d]: <a:shrug_anim:1096500513106841673>
20:07 gfxstrand[d]: This is userspace so meh
20:08 karolherbst[d]: `+99192.49%` for the biggest impact (ignoring loop_depth, vs 10^log2(depth) +1)
20:08 karolherbst[d]: mhh that's still the same count for a single shader as the entire db before that 😄
20:09 karolherbst[d]: the entire set of shaders: `Static cycle count: 217343251 -> 3875748736 (+1683.24%)`
20:09 gfxstrand[d]: Yeah, `10^(log2(n) + 1)` doesn't seem crazy
20:10 karolherbst[d]: which is way more reasonable than.. `Static cycle count: 217345431 -> 15298124189850 (+7038522.40%); split: -0.00%, +7038522.40%` 😄
20:10 karolherbst[d]: I think x16 across the entire set is acceptable
20:11 karolherbst[d]: well.. x18, but..
20:11 karolherbst[d]: good enough
20:18 karolherbst[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36528
20:25 gfxstrand[d]: Aw, crap... Our latencies are wrong. 😭
20:28 karolherbst[d]: 🥲
20:28 karolherbst[d]: which ones?
20:28 karolherbst[d]: also.. blackwell or ampere?
20:28 gfxstrand[d]: `dEQP-VK.compute.pipeline.cooperative_matrix.khr_a.subgroupscope.matrixmuladd_cross.sint8_sint32.buffer.colmajor.linear` fails with `NAK_DEBUG=spill` but only on Blackwell and it passes with `NAK_DEBUG=spill,serial`
20:29 gfxstrand[d]: Looks like it's only imma
20:30 karolherbst[d]: ahh right
20:31 karolherbst[d]: well the latencies we've gotten don't work on desktop GPUs
20:31 gfxstrand[d]: Looks like +2 on imma instead of +1 fixes it
20:31 karolherbst[d]: so it was dave guessing
20:31 karolherbst[d]: mhhh
20:31 karolherbst[d]: what was the original value?
20:32 gfxstrand[d]: <a:shrug_anim:1096500513106841673>
20:32 gfxstrand[d]: It's in the tables somewhere
20:32 karolherbst[d]: mhh 20...
20:32 karolherbst[d]: I think I suggest 27
20:32 gfxstrand[d]: Looks like +4 fixes it
20:33 karolherbst[d]: mhhh
20:33 karolherbst[d]: yeah whatever works
20:35 gfxstrand[d]: Okay, I'm gonna run the CTS again with that
20:44 airlied[d]: karolherbst[d]: ggml_vulkan: 0 = NVIDIA GeForce RTX 5080 (NVK GB203) (NVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
20:44 airlied[d]: | model | size | params | backend | ngl | test | t/s |
20:44 airlied[d]: | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
20:44 airlied[d]: | qwen3 8B Q4_K - Medium | 4.78 GiB | 8.19 B | Vulkan | 99 | pp512 | 209.27 ± 0.06 |
20:44 airlied[d]: | qwen3 8B Q4_K - Medium | 4.78 GiB | 8.19 B | Vulkan | 99 | tg128 | 47.32 ± 1.28 |
20:44 airlied[d]: ggml_vulkan: 0 = NVIDIA GeForce RTX 5080 (NVK GB203) (NVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
20:44 airlied[d]: | model | size | params | backend | ngl | test | t/s |
20:44 airlied[d]: | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
20:44 airlied[d]: | qwen3 8B Q4_K - Medium | 4.78 GiB | 8.19 B | Vulkan | 99 | pp512 | 378.92 ± 0.05 |
20:44 airlied[d]: | qwen3 8B Q4_K - Medium | 4.78 GiB | 8.19 B | Vulkan | 99 | tg128 | 50.39 ± 1.43 |
20:44 airlied[d]: old main vs new main
20:45 airlied[d]: yeah I just did the increment things until CTS passes, hadn't considered spill would stress it a bit more
20:45 karolherbst[d]: mhhh
20:45 mangodev[d]: airlied[d]: -# monospace? 🥺
20:45 karolherbst[d]: pp512 looks great, but tg128 doesn't 😄
20:46 karolherbst[d]: airlied[d]: that's on main, right?
20:46 airlied[d]: yup main of right now
20:46 karolherbst[d]: but yeah it has the membar opt, that should do a lot if you have membars
20:46 airlied[d]: the nvidia does pp 4871, tg 106
20:46 airlied[d]: with vulkan
20:46 karolherbst[d]: mhhh
20:46 karolherbst[d]: 4800... oof
20:47 karolherbst[d]: I can give you a branch if you want to spice it up 😄
20:47 karolherbst[d]: but I have to clean it up
20:47 karolherbst[d]: ohh
20:47 karolherbst[d]: this is blackwell, right?
20:47 airlied[d]: yes
20:47 karolherbst[d]: yeah.. we uhm..
20:47 karolherbst[d]: we'll have to wire up the new uniform tensor ops probably to be able to compete at all
20:49 airlied[d]: I can plug in the ada next week
20:49 karolherbst[d]: yeah that might be more fair 😄
20:49 karolherbst[d]: should get ldsm in before that
20:52 airlied[d]: ldsm + bra.u goes 390/50
20:52 airlied[d]: just had that branch here and rebased it
20:54 karolherbst[d]: nice
20:55 karolherbst[d]: I have a lot more alu ops in the pipeline
20:55 karolherbst[d]: gets me like 75% close to nvidia on vk_cooperative_matrix in some tests
20:55 karolherbst[d]: most of it is just making the inner loop really small
20:57 karolherbst[d]: airlied[d]: https://gitlab.freedesktop.org/karolherbst/mesa/-/commits/nak/offsets?ref_type=heads that branch should help a bit
21:00 karolherbst[d]: but maybe I should run `ggml_vulkan` and see what's to get there...
21:00 karolherbst[d]: though I still have like ~40 patches in the pipeline 😄
21:03 airlied[d]: rebased possibly badly onto ldsm, 391/51
21:04 airlied[d]: I do wonder at some point if we will run into command submission being slow
21:04 karolherbst[d]: I'm sure this is the problem I'm facing
21:04 karolherbst[d]: the shaders inner loop is almost perfect
21:04 karolherbst[d]: tho...
21:04 karolherbst[d]: aggressive loop unrolling still gives me a bit more perf 😄
21:05 karolherbst[d]: but it's just back to back ldsm + hmma + ldst and almost no address calc alu anymore
21:05 karolherbst[d]: so I don't think there is much to gain
21:05 airlied[d]: like the coop matrix perf benchmark should be easier to get closer on since I don't think it does as many submissions
21:05 karolherbst[d]: yeah...
21:05 karolherbst[d]: I'm like 75% close
21:06 karolherbst[d]: but your ggml_vulkan blackwell run probably needs a lot of the new instructions
21:06 karolherbst[d]: nvidia added tons of new stuff
21:06 karolherbst[d]: airlied[d]: but yeah.. that branch maybe removes one or two instructions 🙃 and makes RA less annoying
21:08 airlied[d]: I should probably have hacked an nvidia coopmat only round, since they also get coopmat2
21:09 karolherbst[d]: ohh yeah maybe
21:10 karolherbst[d]: but dunno if they use the new instructions under the hood anyway
21:10 karolherbst[d]: I don't know how how useful they are, but they do look useful
21:11 karolherbst[d]: depending on the thing, `nir_opt_lcim` might also help...
21:11 karolherbst[d]: but...
21:11 karolherbst[d]: I need reliable shader stats before I want to upstream using that
21:11 karolherbst[d]: it moves things out of loops, so depending on the shader it helps a bit
21:18 airlied[d]: I'd stay focused on closing the gap on the coop mat perf benchmark, I'm sure we'll get to wtf llama.cpp is doing in time 🙂
21:18 airlied[d]: I've gotten about 75% of coopmat2 flexible dimensions written for radv, but the lowering is pretty generic
21:19 karolherbst[d]: ahh, cool
21:20 karolherbst[d]: but yeah.. I'll just upstream whatever I have, because there is good general stuff also useful for other things
21:25 airlied[d]: indeed the generally useful stuff is probably more valuable than the matrix stuff (except when anyone asks 🙂
21:31 gfxstrand[d]: lol
21:35 karolherbst[d]: yep...
21:35 karolherbst[d]: I mean, it's time I can spent on improving performance on NVK without anybody asking questions 🙃
21:41 mohamexiety[d]: we should abuse this more. AI runs on NV gpus, by supporting an open source driver for NV gpus you're investing in AI. now give me a billion dollars
21:43 HdkR: Thinking too small, Trillions are necessary spread out over multiple years :)
21:44 mohamexiety[d]: hmm, true. what's another couple of zeroes to add. trillions it is
21:52 cubanismo[d]: mohamexiety[d]: Is Jens Owen still founding companies? Someone should get him on this.
21:53 mangodev[d]: mohamexiety[d]: beating nvidia at their own game >:)
21:53 airlied[d]: mohamexiety[d]: "AI runs on NV gpus, by supporting an open source driver for NV gpus you're investing in AI" is what I said to RH, I got told Karol and I could work on it, instead of a billion dollars
21:53 airlied[d]: I should probably have sent it to zuck first
21:54 mohamexiety[d]: LMAO
21:54 mangodev[d]: was about to say "maybe that could get you a pay raise" but i guess not 😔
21:54 cubanismo[d]: I'm still here because I just want to run Doom on Linux.
21:55 HdkR: Doom on ARM on Linux? :thonk:
21:55 mohamexiety[d]: that's big dedication :kneel:
21:55 cubanismo[d]: I don't care as much about which processor architecture.
21:55 airlied[d]: I think I'm here because I wanted to get NWN to run on my radeon 9200
21:56 airlied[d]: not sure I've ever gone back to finish the game
21:56 HdkR: :D
21:57 cubanismo[d]: I had a graphics internship at my college, saved up to get an AMD64 system very early on after running my K6-2 into the ground, had to patch the kernel to support AMD64 systems with <2 CPUs (Off by one errors are hard) so my name was in the changelog, sent out a terrible cut-and-paste patch to have both 3D accel and XV accel on the mach64 chip on my terrible laptop, and leveraged that into a job
21:57 cubanismo[d]: writing drivers at NVIDIA.
22:00 airlied[d]: I think that radeon 9200 was the last GPU I bought myself
22:01 cubanismo[d]: Interestingly, everyone at the graphics reserach department ran Linux... except my mentor who insisted we work on Windows.
22:05 HdkR: I think I originally got involved in Mesa, complaining about xmoto not running on my Asus EEE PC 7 with its GMA 900. Moving on to Pandaboard with PowerVR. Only have one real patch in mesa "implementing" GL_EXT_buffer_storage for ES 3.x. I'm just a chronic complainer turns out.
22:06 cubanismo[d]: Competent testing & feedback is an undervalued/under-recognized thing.
22:06 HdkR: Now I just complain loudly when ARM stuff breaks :P
22:09 HdkR: What do you mean you aren't testing on a ARM platform with broken PCIe fabric, with emulated x86 games, with a skewed CPU time domain?
22:11 HdkR: I like to think my project is doing a bit of torch carrying for the old Lightspeed SHIELD games ports.
22:16 mohamexiety[d]: going to get a Spark to complain about FEX things then >:3
22:17 HdkR: Perfect until proven otherwise :P
22:17 mohamexiety[d]: hahahaha. confident!
22:23 HdkR: It is definitely why I'm always interesting in the Tegra support although.
22:23 HdkR: interested*
22:25 mohamexiety[d]: yep
22:29 mohamexiety[d]: but yeah I am fresh here, started a bit over 2y ago. in uni I was originally looking at a career in HW design, ideally something related to GPUs. issue is I am from a country where none of the IHVs operate, and at the same time, was a bit wary of betting on other HW careers in general since a lot of people locally were looking at design roles (lots of companies like synopsys, siemens, etc have
22:29 mohamexiety[d]: subsidiaries/branches here and the field is kinda booming) so felt my chances were bad. alongside that I was working on learning software stuff by myself and playing around with graphics.. someone (Bas) suggested I try GSoC for more experience and I found a listing for nvk (with mentorship from Faith -- YCbCr support for nvk) which felt like a golden opportunity since it was in drivers which was
22:29 mohamexiety[d]: exciting, and also a new driver, and related to NV HW which is probably the most mysterious/interesting HW aside from Apple's stuff. that led to getting hired to work on it after thankfully
22:30 HdkR: Woo! GSoC works
22:33 notthatclippy[d]: airlied[d]: Hah! Me too! That bloody game ruined my life several times over.
22:33 HdkR: It's amazing to me that NwN Enhanced edition is still getting content and patches.
22:34 mohamexiety[d]: yeah. I still want to play it some time :KEKW:
22:34 gfxstrand[d]: I started with some ill-fated attempt at hacking on Android. I stayed for the pain.
22:34 HdkR: "How soon until GL on Android?"
22:34 mohamexiety[d]: I got it a while back alongside other retro games. played planescape torment from that bundle and really loved it but didnt play the others yet
22:35 notthatclippy[d]: HdkR: that's actually me btw. And few other guys from NV.
22:35 gfxstrand[d]: HdkR: Unlike desktop Linux WSI, I've managed to leave Android behind, thankfully.
22:36 HdkR: notthatclippy[d]: Oh! You do the NwN stuff?
22:38 HdkR: gfxstrand[d]: I'm also thankful for being able to escape Android things :D
22:39 notthatclippy[d]: HdkR: Yes, they let us (few modders from the last 20 years) do basically whatever we want.
22:40 mohamexiety[d]: oh _wow_ that's really, really, really cool
22:40 HdkR: notthatclippy[d]: That's pretty awesome! So many QoL improvements. My SO is even thinking about replaying it soon.
22:42 notthatclippy[d]: The game is total crap. It's really a tech demo for how you can build your own little RPG (MMO or not) which is where all the really cool stuff is
22:43 HdkR: haha, I get it
22:58 notthatclippy[d]: One of the leads behind GSP retired a few years ago, so as a parting gift we made a nwn mod where he was inside the GSP chip and had to beat up bugs (like, beetles) that had their names pulled from our database, from all the bugs he worked on in the ~20 years. As you punch them they move through states like "get repro" and "verify fix" and such. And you have special abilities like "summon intern".
23:00 HdkR: :D
23:01 mohamexiety[d]: That’s awesome and creative :KEKW:
23:13 HdkR: https://cdn.discordapp.com/attachments/765304672579092511/1400975366029578240/Screenshot_2025-08-01_15-54-23.png?ex=688e977c&is=688d45fc&hm=c52f22bcd02f92d81640804459acdad001a89654baf9f126554ba07915e21543&
23:14 HdkR: Sadly the white whale eludes me. Lightspeed got that to 60 on SHIELD TV, but I don't have source access :D