00:00 TranquilIty[m]: <gfxstrand[d]> "x512: Yup. And passing damage..." <- I hate that so much
00:00 x512[m]: https://www.haiku-os.org/legacy-docs/bebook/BDirectWindow_Overview.html
00:01 karolherbst[d]: I wished frontbuffer rendering would finally die
00:15 i509vcb[d]: DRI1: compositor at home
00:16 gfxstrand[d]: TranquilIty[m]: So say we all
00:19 mangodev[d]: GPU ACCEL WORKS IN STEAM :D
00:19 mangodev[d]: that's a one-up over proprietary
00:20 gfxstrand[d]: woo
00:20 gfxstrand[d]: (Though I'm not sure what's working now and/or didn't work before.)
00:23 mhenning[d]: Didn't you say it was the PRIME case that was problematic?
00:23 mhenning[d]: so maybe mango isn't on prime
00:26 mangodev[d]: mhenning[d]: tbf i'm not
00:26 mangodev[d]: although it didn't work for me when i first tried nvk
00:27 mangodev[d]: and didn't work for some couple months
00:27 mangodev[d]: i think the recent patches with zink fixed it for systems with one graphics device
00:32 gfxstrand[d]: Plausible
00:33 esdrastarsis[d]: mangodev[d]: Have you tested big picture?
00:45 airlied[d]: karolherbst[d]: ldsm seems to pass CTS on blackwell once I added the latency and rebased
02:07 mangodev[d]: esdrastarsis[d]: uhhhh good idea
02:07 mangodev[d]: ran about 1fps on proprietary, so i'm worried as to what it's gonna be like on nvk
02:07 gfxstrand[d]: What? Big picture?
02:10 matt_schwartz[d]: hardware acceleration is disabled by default on nv prop, even if the toggle says its enabled at first
02:10 matt_schwartz[d]: in big picture mode
02:12 mangodev[d]: would explain why it ran so poorly
02:12 mangodev[d]: -# and why every gaming handheld uses AMD instead
02:12 mangodev[d]: *every real gaming handheld
02:43 gfxstrand[d]: Are you saying the Switch and Switch 2 aren't real? 😛
02:44 illwieckz[d]: Well, AMD did a strategic move with its Fusion project.
02:44 illwieckz[d]: Nvidia only exists in gaming,
02:44 illwieckz[d]: - on Windows PC with discrete cards,
02:44 illwieckz[d]: - on Switch because Nintendo needed a tablet hardware.
02:44 illwieckz[d]: AMD is just dominating the gaming market, we're just fooled by the Winvidia bias.
02:45 illwieckz[d]: Nvidia would not exist on console market since the OG xbox in 2001 if they did not do that strategic tablet move that enabled them the Nintendo switch.
02:46 gfxstrand[d]: PS3 was nvidia
02:46 illwieckz[d]: Ah yes, this one.
02:46 gfxstrand[d]: But that barely counts
02:47 illwieckz[d]: Actually Nvidia got the OG xbox market by luck because ATI did a bold bluff move to milk Microsoft and Microsoft did not fall for the trick.
02:49 gfxstrand[d]: The Switch was kinda good for both parties. Nintendo was looking for a gaming-capable tablet and NVIDIA had been building Tegra for years with no one really wanting it for anything. They kept having to make their own set-top boxes to prove that they could do Android.
02:49 airlied[d]: a lot of people swear by their shields alright
02:52 matt_schwartz[d]: mangodev[d]: Don’t bash my MSI Claw 8 AI+ A2VM (yes, that’s the full product name in the DMI) 😡
02:55 gfxstrand[d]: Nah. It's Intel-based. I feel I have a right to bash that. 😛
02:56 sonicadvance1[d]: airlied[d]: https://nypost.com/wp-content/uploads/sites/2/2025/02/nvidia-ceo-jensen-huang-holds-96350996.jpg The new NVIDIA SHIELD doesn't really fit in front of the TV.
02:56 gfxstrand[d]: sonicadvance1[d]: I want one of those shields...
02:57 sonicadvance1[d]: Oh no 😄
02:58 gfxstrand[d]: I don't think it'll run off a standard 20A circuit, though.
02:58 matt_schwartz[d]: looks like a necessary prop for xdc
02:59 gfxstrand[d]: I'm sure I can find a craft store near the venue which will sell me poster board, glue, and a scissors.
03:00 illwieckz[d]: I wonder how many Intel things exist in the gaming market today, is there something outside the MSI Claw?
03:01 gfxstrand[d]: I've got an OG Intel steam box kicking around my office. It never really went to market, though.
03:01 illwieckz[d]: Nice.
03:05 x512[m]: illwieckz[d]: > AMD is just dominating the gaming market, we're just fooled by the Winvidia bias.
03:05 x512[m]: This sounds very false.
03:09 x512[m]: Steam Deck sales are ~30 times lower than Nintendo Switch.
03:21 illwieckz[d]: Nvidia has a very good client, but only one client, which is very unsafe for a business.
03:21 HdkR: ML is a pretty good client.
03:21 illwieckz[d]: Not gaming.
03:21 illwieckz[d]: I said gaming market.
03:22 HdkR: eh, if their gaming market implodes today then they'll still be fine as a business :)
03:22 illwieckz[d]: More precisely, console market.
03:22 illwieckz[d]: Of course, they are very fine with their other source of incomes.
03:22 illwieckz[d]: But on the console market they are an exception.
03:24 illwieckz[d]: If you were a maker and you wanted to build a console, Nvidia would not be the one you would consider at first.
03:29 airlied[d]: you making anything with an embedded GPU, you should only consider NVIDIA if all else fails 😛
03:34 HdkR: Ouya v2 incoming.
03:47 illwieckz[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1399599264295944284/20250729-053954-000.console-gpu-market.png?ex=688995e3&is=68884463&hm=1a477bc3ecb7e400c70b0f2f3df5a1c163d6fb1b11f4a849fd0f950a814ee69b&
03:47 illwieckz[d]: Some table I maintain on my end, hoping I haven't done obvious mistakes in it.
03:47 illwieckz[d]: Basically, if you're a console user, there is an high chance you buy an Nvidia system because of the Nintendo Switch, otherwise what you buy is an AMD system.
03:47 illwieckz[d]: But if you're a console maker, you are actually selling an AMD system, unless you are the only outsider.
03:47 illwieckz[d]: And that is true for 15 years.
04:33 redsheep[d]: karolherbst[d]: Wait, I've been gone for months, you know how hot your gpu is on nouveau now?
04:49 gfxstrand[d]: redsheep[d]: Welcome back!
04:52 mangodev[d]: illwieckz[d]: …the sega saturn had integrated graphics?
06:10 x512[m]: illwieckz[d]: PC gaming is a big market.
08:51 airlied[d]: karolherbst[d]: the sm120 ldsm is the exact same as the sm80 one, if it compiles it will work
08:51 airlied[d]: Can't comment on gl as my phone don't Anubis so good
08:52 snowycoder[d]: redsheep[d]: Welcome back :3
08:52 karolherbst[d]: ahh, okay
08:55 mohamexiety[d]: airlied[d]: found the secret AI
09:03 airlied[d]: Just ask Gemini a question about mesa git main and watch the money pile burn!
09:04 mohamexiety[d]: there was someone that used to do just that with radv and even opened vibe coded MRs for "improving" Vega perf :KEKW:
09:04 mohamexiety[d]: (it was all crap ofc but it was funny seeing it unfold)
09:06 mohamexiety[d]: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13285
09:06 mohamexiety[d]: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13195#note_2917703
09:06 mohamexiety[d]: there's more but these were the ones I could find quickly
09:38 karolherbst[d]: okay, anybody else wants to review the ldsm patches?
09:56 phomes_[d]: I am probably not capable of doing review but I can run my usual game tests with the MR, but it fails to build
10:03 x512[m]: gfxstrand[d]: Is it possible to create Nouveau GPU buffer knowing its PCI bus address?
10:04 x512[m]: From userland with root permissions.
10:19 TheHypervisor[m]: <x512[m]> "gfxstrand[d]: Is it possible..." <- Wouldn't that just be a dumb buffer from libdrm?
10:20 airlied[d]: No, you need to setup pagetables to map stuff into pci bar
10:20 x512[m]: As long it can be imported to NVK and copied into it, it is fine. It has linear layout.
10:21 kar1m0[d]: redsheep[d]: from what I remember showing gpu temp on nouveau isn't implemented yet
10:21 x512[m]: Render to some buffer, copy contents to another buffer, specified by PCI bus physical address.
10:21 kar1m0[d]: and I am still learning about gsp and how it interacts with the driver kernel
10:21 kar1m0[d]: but I focus more on gpu and vram usage in real time for now
10:22 kar1m0[d]: probably after that I will focus on wattage usage of the gpu in real time
10:24 x512[m]: kar1m0[d]: https://github.com/NVIDIA/open-gpu-kernel-modules/discussions/157#discussioncomment-10381610
10:25 x512[m]: Memory mapped structure should be used to get GPU statistics. Calling GSP methods may cause global GPU lock and slow down things.
10:26 x512[m]: It seems that GSP is internally implemented single-threaded.
10:26 kar1m0[d]: I'll look into this
10:27 kar1m0[d]: I did see some gsp functions but the ones that I found return static values (like total vram capacity)
10:40 notthatclippy[d]: I think mohamexiety[d] is already looking into this for nouveau
10:41 mohamexiety[d]: kind of. got side tracked with other nvk stuff so not really much done on that front yet
10:41 mohamexiety[d]: kar1m0[d]: https://discord.com/channels/1033216351990456371/1223647774575431771/1384860194638659635 for this, check this message. it outlines basically everything you need to do
11:12 kar1m0[d]: mohamexiety[d]: alright, let me know if you do get back to it
11:13 kar1m0[d]: mohamexiety[d]: thanks!
11:20 karolherbst[d]: `TILE_M=128 TILE_N=256, TILE_K=64 BColMajor=1 workgroupSize=256 99.607880 TFlops` will I break the 100 TFLOPS mark?!?! 🙃
11:24 mohamexiety[d]: kar1m0[d]: also if you have any questions or so just ask here. I didn’t do much but I did look at things so could help if something looks odd or such
11:47 ristovski[d]: I wonder if nvidias GSP internally supports userspace-provided "hints" similar to amdgpus "performance levels"
11:49 ristovski[d]: actualy, "power profiles"*, that has stuff like clock hysteresis which is what I meant
11:49 ristovski[d]: i.e. something like `POWER_SAVING` doesn't make the GPU clocks boost instantly
11:49 mohamexiety[d]: Should be this stuff https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/src/common/sdk/nvidia/inc/ctrl/ctrl2080/ctrl2080perf.h#L254-L289
11:50 ristovski[d]: Hmm, so basically a runtime version of `nvidia-smi -pl` in a way
11:51 ristovski[d]: seems rather.. coarse in comparison with the above amdgpu power profile example
11:56 kar1m0[d]: ristovski[d]: it doesn't work with my gpu 🫤
12:02 karolherbst[d]: I hate everything about this: https://gist.github.com/karolherbst/003525df9c4459cd1623fdd2f6431fad
12:04 karolherbst[d]: so there is a constant `0x60000` that could be moved into the load_global, but it's just very hard to proof in this case that it won't affect the address calculation if doing so
14:48 redsheep[d]: Ok so it wasn't just me, getting GPU temp still isn't implemented yet on nouveau. Karol, were together testing on nvrm, or do you have a branch somewhere to see the temps you mentioned?
14:53 notthatclippy[d]: He just infers temperature based on how long it takes the cat to settle onto the desktop.
14:54 notthatclippy[d]: (i.e. there may be external sensors available too)
14:54 notthatclippy[d]: In fact, the CPU temperature is correlated woth GPU temperature
14:55 redsheep[d]: Yeah adding some thermocouples to your test GPU is always an option, just an annoying one of that GPU is also your daily
14:55 karolherbst[d]: the fans were spinning much faster and the air was much hotter
14:56 karolherbst[d]: totally scientific method of using my hand
14:56 karolherbst[d]: mhenning[d]: sooo.. I tried your "if uub says it's 0, let's make it 0" trick and some shaders just by doing that are getting bigger 🙃
14:57 karolherbst[d]: sometimes I hate compilers
15:03 karolherbst[d]: it's just RA being silly *sigh*
15:03 karolherbst[d]: we kinda have to land the instruction scheduler thing or whatever makes RA do less pointless movs 😄
15:04 karolherbst[d]: might also be something something vector something
15:10 karolherbst[d]: it was causing all the bad stats in my testing 🙃
15:12 mohamexiety[d]: Ok I cleaned things up finally in a sane way
15:12 mohamexiety[d]: Xonotic on this AD104 was a 2x perf gain :evil_gears:
15:12 karolherbst[d]: nice
15:12 karolherbst[d]: what did you do?
15:12 mohamexiety[d]: It’s very heavily membw starved tho so probably a bit of an outlier in terms of gains (160bit bus)
15:13 mohamexiety[d]: karolherbst[d]: Compression + huge pages
15:13 karolherbst[d]: ahhh
15:13 karolherbst[d]: how much does huge pages help?
15:13 mohamexiety[d]: Didn’t try without compression tbh but Ben did here https://discord.com/channels/1033216351990456371/1034184951790305330/1398129424242577408
15:14 karolherbst[d]: 😮
15:14 karolherbst[d]: I wonder if that also helps with coop matrices 😄
15:14 mohamexiety[d]: Hahaa
15:16 redsheep[d]: It might, if the lower latency from huge pages is impacting your perf. I'd doubt it but it's possible
15:19 karolherbst[d]: I'm sure setting up page tables aand that stuff isn't free either
15:20 karolherbst[d]: anyway, what patches to I need to run lol 😄
15:20 phomes_[d]: my MR to fix the anti-lag layer on nvk in !36402 is being merged now. !36346 makes it easy to test in devenv if anyone wants to give it a try
15:20 phomes_[d]: it should help the most on slower systems
15:43 gfxstrand[d]: phomes_[d]: Thanks for finding/fixing that!
15:54 mhenning[d]: karolherbst[d]: Yeah, I still have some hacking to do on that one.
15:54 karolherbst[d]: mhenning[d]: well.. might want to wait until I'm done with my range analysis stuff, because big gains
15:54 karolherbst[d]: RELATIVE IMPROVEMENTS - CodeSize Before After Delta Percentage
15:54 karolherbst[d]: radv_fossils/rdr2/92f742e2ad7054e6/cs/0 40144 25648 -14496 -36.11%
15:54 karolherbst[d]: it's real
15:55 mhenning[d]: sure
15:55 karolherbst[d]: some shaders are doing silly shifts, it's impressive
15:55 karolherbst[d]: though most of the massive gains are just DCEed stuff
15:56 mhenning[d]: Yeah, the zero stuff is a small enough gain that it might not be worth the compile time on its own but if you've got a pass that's already doing the range analysis then we throw it in as an extra check
15:56 karolherbst[d]: yeah
15:56 karolherbst[d]: that's the idea
15:56 karolherbst[d]: we just need to deal with instruction scheduling, because it's becoming a real pain
15:58 karolherbst[d]: I also want to hook up nir_opt_offsets, but that's kinda a pain thing, because we already have manual handling for it and it kinda potentially conflicts in weird ways
15:58 karolherbst[d]: the range analysis based shift ops really help with address calculation specifically and extracting all those constant offsets
15:58 karolherbst[d]: *opts
16:02 mangodev[d]: karolherbst[d]: i was thinking that NVK was getting rapidly closer to the performance of proprietary as of late
16:02 mangodev[d]: didn't think the progress was *that* rapid
16:02 karolherbst[d]: 🙃
16:03 karolherbst[d]: it's just a silly shader
16:03 karolherbst[d]: overall it doesn't matter _that_ much
16:03 mangodev[d]: the kernel-side stuff must help a lot too
16:03 karolherbst[d]: yeah.. compression and huge pages will matter a lot
16:03 karolherbst[d]: especially compression
16:04 mangodev[d]: if zeta is getting plugged up, does that mean delta can be hooked up too? or is that a different process
16:04 mangodev[d]: compressing the depth buffer should help a ton with 3d games though
16:04 mangodev[d]: although i'd think DCC would help even more for deferred renderers
16:05 mangodev[d]: i've only been able to push forward renderers currently
16:05 mangodev[d]: which are few and far between
16:14 mohamexiety[d]: “Compression” in NV language is what’s called DCC on other vendors’ HW
16:14 chikuwad[d]: 👀
16:25 gfxstrand[d]: mohamexiety[d]: DCC, CCS, HiZ, there are lots of words
16:26 mohamexiety[d]: Oh yeah I forgot that Intel has a different name too :blobcatnotlikethis:
16:30 kar1m0[d]: forgot intel also makes gpus
16:30 karolherbst[d]: they aren't terrible anymore
16:31 karolherbst[d]: like you can actually play games with them now
16:47 karolherbst[d]: soo.. now how to integrate nir_opt_offsets so it's not potentially broken 🙃
17:02 gfxstrand[d]: kar1m0[d]: We would all like to
17:09 snowycoder[d]: Wait, didn't I open a Merge Request to fix Kepler scheduling issues? Is gitlab gaslighting me?
17:11 gfxstrand[d]: Gitlab might be gaslighting me
17:22 redsheep[d]: gfxstrand[d]: Insert haswell meme here
17:23 gfxstrand[d]: What we really want to forget is Sandy Bridge
17:24 snowycoder[d]: Why did you even build a bridge with just sand (/s)
17:24 gfxstrand[d]: Sandy Bridge was Intel's Fermi
17:25 gfxstrand[d]: I even have a laptop in my house that's a SNB+Fermi optimus setup for maximum garbage!
17:25 marysaka[d]: gfxstrand[d]: fun fact: Ryujinx was initially written targeting the windows opengl driver on sandy bridge
17:25 gfxstrand[d]: My condolences
17:25 marysaka[d]: yeaaah it was surely /something/
17:27 karolherbst[d]: Intels ISA is.... weird
17:27 karolherbst[d]: like looking at any other, I know what's going on. With Intel? no idea
17:27 karolherbst[d]: like...
17:27 karolherbst[d]: not even remotely
17:27 karolherbst[d]: couldn't tell what the shader does in principle
17:28 karolherbst[d]: doing some quick maths on some buffer, or doing weirdo graphics awesome effect? looks all the same 😛
17:29 mohamexiety[d]: unless if I screwed up the rebase, this should be it: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36450 gfxstrand[d] marysaka[d] chikuwad[d] redsheep[d] skeggsb9778[d]
17:29 mohamexiety[d]: kernel patches are linked in the desc too
17:30 chikuwad[d]: :shocked:
17:32 gfxstrand[d]: mohamexiety[d]: I'll push that to the top of my review stack tomorrow. I can't test today because a dumpster fire took out Collabora's internet. (I'm not joking. 😂)
17:32 mohamexiety[d]: LOL
17:32 marysaka[d]: :blobcatnotlikethis:
17:32 karolherbst[d]: nobody is gonna believe it
17:33 gfxstrand[d]: https://tenor.com/view/dumpsterfire-flooding-fail-gif-13031064
17:34 redsheep[d]: Oh jeez, fire taking down internet can take days or weeks to fix
17:34 gfxstrand[d]: Should be back in an hour according to building management
17:35 gfxstrand[d]: They're doing a nasty patch job to try and avoid re-running everything and get the internet back ASAP.
17:37 redsheep[d]: mohamexiety[d]: Good song you linked there. Are the remaining MMU faults possibly due to unknown hazards? Have you found anything that consistently faults?
17:40 mohamexiety[d]: it's completely random
17:40 mohamexiety[d]: the closest thing to a repro I could tell you is "resizing windows sometimes leads to this" but it doesn't always happen either
17:41 mohamexiety[d]: and to make matters more confusing I actually used to have this on stock 6.16-rc1
17:41 mohamexiety[d]: so I am not even sure it's related
17:43 mohamexiety[d]: I am also running a bit of a cursed setup so this could be adding to it (one monitor connected to GPU, other to iGPU. this is cuz the GPU doesn't have HDMI while the monitor needs it)
18:27 skeggsb9778[d]: mohamexiety[d]: does it pass CTS?
18:27 mohamexiety[d]: I actually didn’t try the clean branch with CTS yet :KEKW:
18:28 mohamexiety[d]: The hacky branch would run for a while and mmu fault though
18:28 skeggsb9778[d]: as for the faults, i was also worried about potential interactions with any merging of unmaps etc (ie. if they are of different page sizes)
18:28 mohamexiety[d]: mohamexiety[d]: But it was really annoying because the faults are random and they don’t happen on particular tests but after you run a bunch of tests together
18:29 mohamexiety[d]: Full run of dEQP-VK.binding_model.* would trigger a mmu fault in the very last test on the hacky branch
18:30 mohamexiety[d]: Downside is it takes around half an hour to get there
18:30 mohamexiety[d]: skeggsb9778[d]: Yeah I am not sure we handle this stuff
18:31 mohamexiety[d]: Idk if it’s helpful but the faults are always of fault type 0
18:31 mohamexiety[d]: So PTE not found
18:34 karolherbst[d]: anybody here having red dead redemption 2 installed/available and want to try out a mesa branch?
18:36 mohamexiety[d]: mohamexiety[d]: Type 2* sorry
18:36 mohamexiety[d]: Older log:
18:36 mohamexiety[d]: [ 2449.250588] nouveau 0000:07:00.0: gsp: mmu fault queued
18:36 mohamexiety[d]: [ 2449.420673] nouveau 0000:07:00.0: gsp: rc engn:00000001 chid:12 gfid:0 level:2 type:31 scope:1 part:233 fault_addr:0000003ffcf50000 fault_type:00000002
18:36 mohamexiety[d]: [ 2449.420690] nouveau 0000:07:00.0: fifo:c00000:000c:000c:[deqp-vk[14736]] errored - disabling channel
18:36 mohamexiety[d]: [ 2449.420701] nouveau 0000:07:00.0: deqp-vk[14736]: channel 12 killed!
18:36 mohamexiety[d]: [ 2449.478513] deqp-vk[14736]: segfault at 7fc80db47e80 ip 00007fc80db47e80 sp 00007ffca9d39588 error 14 likely on CPU 9 (core 16, socket 0)
18:36 mohamexiety[d]: [ 2449.478521] Code: Unable to access opcode bytes at 0x7fc80db47e56.
19:19 snowycoder[d]: mohamexiety[d]: If the tests are deterministic (i.e. you have a full list that can generate the faults every time you run it).
19:19 snowycoder[d]: I have a program that might help you.
19:19 snowycoder[d]: It does both a bisect search and a one-by-one search to minimize the reproduction list as much as possible
19:21 snowycoder[d]: It's still experimental but it does work, I've used it for Kepler scheduling issues, it reduce the reproduction list from 500 tests down to 10 tests.
20:03 mohamexiety[d]: Yeah kinda have a full list I think with the binding model list
20:14 karolherbst[d]: `Static cycle count: 222856984 -> 194889427 (-12.55%); split: -12.65%, +0.11%` I'm cooking again
20:25 karolherbst[d]: though I just stole work
20:27 snowycoder[d]: mohamexiety[d]: https://gitlab.freedesktop.org/SnowyCoder/deqp-cofailure-finder
20:27 snowycoder[d]: Here you go, it's experimental but it should work, point it at a file containing a failing testcase list and it should do the rest.
20:27 snowycoder[d]: It's not fast yet, and I can't work on it in the coming week since I'm on holidays, but tell me if it could help
20:27 mohamexiety[d]: Oooh thanks!
20:28 snowycoder[d]: Oh... I did kinda forget an absolute path in `main.rs` :3
20:28 snowycoder[d]: Edit it with your VK-GL-CTS clone path
20:29 snowycoder[d]: karolherbst[d]: OwO what's this?
20:29 mohamexiety[d]: Yup, no worries
20:29 mohamexiety[d]: Coop matrix optimizations
20:29 karolherbst[d]: nah
20:29 karolherbst[d]: those are Mels instruction scheduling patches
20:29 karolherbst[d]:but
20:30 karolherbst[d]: with them my opts also look a lot better 😄
20:31 mohamexiety[d]: Ahhhh
20:31 karolherbst[d]: but I ahve patches which cut some shaders of some games by like 30%
20:31 karolherbst[d]: they.. just cause a bit of a scheduling disaster
20:31 mohamexiety[d]: It’s ok, correctness optional :evil_gears:
20:32 karolherbst[d]: I mean it's correct
20:32 karolherbst[d]: just uhm..
20:32 karolherbst[d]: might need more gprs for silly reasons
20:32 mohamexiety[d]: Oh
20:35 karolherbst[d]: well.. I also threw in opt_sink and opt_move 😄
20:35 karolherbst[d]: my range analysis still causes this: `Max warps/SM: 26792 -> 26620 (-0.64%); split: +0.03%, -0.67%`
20:36 karolherbst[d]: range analysis + nir_opt_offsets: https://gist.githubusercontent.com/karolherbst/011b58eafb1a855eb0943600639a9111/raw/fad6e84135095b828578a8c3abb19b2a0b783374/gistfile1.txt
20:37 karolherbst[d]: this looks a lot better, still...
20:37 karolherbst[d]: some shaders fo from 21 to 40 gprs...
20:39 karolherbst[d]: let's run more coop matrix stuff putting everything together, lol
20:40 karolherbst[d]: ohh it's fsater
20:44 karolherbst[d]: mhhhhh
20:45 karolherbst[d]: I think I'm CPU bound 🙃
20:48 karolherbst[d]: maybe not
20:51 karolherbst[d]: mhhhh
21:10 karolherbst[d]: get great performance with this one small little trick: `op.max_unroll_iterations = 1024;`
21:10 karolherbst[d]: ez: `TILE_M=256 TILE_N=128, TILE_K=64 BColMajor=1 workgroupSize=256 107.846009 TFlops`
21:11 karolherbst[d]: so it does give me like 7% perf increase, impressive
21:11 karolherbst[d]: the shaders are msasive
21:11 karolherbst[d]: but who cares
21:12 karolherbst[d]: it's funny, I use less gprs
21:12 karolherbst[d]: like 12 of them less
21:12 karolherbst[d]: mhhhhh
21:12 karolherbst[d]: I wonder if I can hook this up with range analysis...
21:12 karolherbst[d]: like this shader does offset an address by 0x20 each iteration
21:13 karolherbst[d]: but range_analysis doesn't see it
21:14 karolherbst[d]: I kinda want this `div 32 %211 = iadd %202, %210 (0x40000)` to be folded into the load globals there...
21:14 karolherbst[d]: but: `%202 = iadd3 %197, %201, %38`
21:14 karolherbst[d]: so that add _might_ overflow
21:15 karolherbst[d]: `%197` is workgroup_id + the counter.. only low bits set really
21:16 karolherbst[d]: %201 is subgroupid >> 2 << 0xc, so kinda middle bits
21:17 karolherbst[d]: %38 is (lane_id & 0x3) << 3
21:17 karolherbst[d]: so _I_ can see that this `iadd3` won't overflow
21:18 karolherbst[d]: which means the 211 iadd won't overflow
21:18 karolherbst[d]: which means I can extract the constant
21:18 karolherbst[d]: it gets u2u64(c >> 3) << 4 + const buffer address
21:19 karolherbst[d]: I think I need to make range analysis be able to look through loops for that one 🙃
21:20 karolherbst[d]: or I search for more low hanging fruits
21:20 karolherbst[d]: though I kinda started around 850 instructions with that one, and now I'm down below 700...
21:20 karolherbst[d]: there aren't many low hanging fruits left 😄
21:21 mhenning[d]: A while ago I was working on a known bits analysis that might be able to prove some of those kinds of things
21:21 mhenning[d]: maybe I should go back and write the other half of it
21:22 karolherbst[d]: yeah.....
21:22 karolherbst[d]: I do some of it with uub, but it's limited
21:22 karolherbst[d]: like looking at the uub of an iadd, you can see if it overflows or not. Don't need known bits for it, though it could increase the precision a bit
21:23 karolherbst[d]: there is `nir_def_bits_used`
21:23 karolherbst[d]: but that only tells what bits are used
21:23 mhenning[d]: yeah, I'm working on something more general than that
21:24 karolherbst[d]: cool
21:25 karolherbst[d]: my range analysis looks like this atm: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/cf90b6fa161c1a2545d5cb8c69be33a4a91f9c79
21:25 karolherbst[d]: pretty basic
21:25 karolherbst[d]: but with nuw it allows for pretty cool opt_algebraic opts
21:26 karolherbst[d]: the two first expressions are enough to get rid of most address calcs in the coop matrix shaders
21:26 karolherbst[d]: and fold everything into ldsm
21:26 karolherbst[d]: e.g. https://gist.github.com/karolherbst/9e41eb70b3f5cf7f3644e595e846126f
21:26 karolherbst[d]: nir_opt_offsets moves the constants into `base`
21:27 karolherbst[d]: and I'm thinking about making nak rely on that one more
21:28 karolherbst[d]: you also see the load_global constants there...
21:28 mhenning[d]: Yeah, I haven't really looked at nir_opt_offsets in much detail
21:29 karolherbst[d]: it makes use of uub internally aas well
21:29 karolherbst[d]: but very limited
21:29 karolherbst[d]: it does need some help to get things going like I want them to
21:30 karolherbst[d]: another idea I had was that `nir_opt_range_analysis` would simply tag alu instructions, like nuw, but with more tags
21:30 karolherbst[d]: and then write opt_algebraic patterns using those
21:30 karolherbst[d]: there are some unused bits in nir_alu_instr
21:31 karolherbst[d]: with your bit pattern matching, could have a tag like "sources impact bits in the dest independently"
21:31 karolherbst[d]: or have a tag for ushr meaning "won't discard bits"
21:31 karolherbst[d]: but not really doable with uub
21:36 mhenning[d]: ohh, wait ssa_def_bits_used is actually the reverse direction of what I'm working on
21:37 karolherbst[d]: yeah
21:37 karolherbst[d]: it's not _that_ useful.. I thought I wanted to use it, and then I noticed it's like.. not useful 😄
21:37 karolherbst[d]: though it might help to reduce bit sizes of instructions
21:37 karolherbst[d]: like 64 -> 32 if you know only the lower 32 bits are used
21:38 karolherbst[d]: which given the shader I'm looking at is 75% address calculation and 25% matrix ops doesn't help me 😄
21:39 mhenning[d]: oh, it's really that much address calc? that's wild
21:39 karolherbst[d]: but it's full of shifts
21:39 karolherbst[d]: yeah......
21:39 karolherbst[d]: mhenning[d]: https://gist.githubusercontent.com/karolherbst/cc1e1f45b65a35c52bf980755b49df9c/raw/e98f9f8e919fb18ae192dccb01ebff510cb09b53/gistfile1.txt
21:39 karolherbst[d]: aand that's _With_ all the opts already 😄
21:40 karolherbst[d]: like it loads, stores in shared, loads from shared, does MMA and stores it again
21:40 karolherbst[d]: and loops
21:40 karolherbst[d]: and then stores the result
21:40 karolherbst[d]: that's the entire thing 😄
21:40 karolherbst[d]: well it does some token fp32 math
21:41 mhenning[d]: yeah, that's a lot
21:41 karolherbst[d]: I'm considering adding a ld_global_nv
21:41 karolherbst[d]: just to have a BASE index on it
21:42 karolherbst[d]: anyway... range analysis helps a lot there tagging things as nuw and then getting rid of a lot of pointless shifts, because the shader is constantly shifting the same values left and right and left and right and..
21:43 karolherbst[d]: it was a disaster when the offsets weren't folded into cmat_load_shared_nvs base index
21:43 karolherbst[d]: so block 7 is perfect, I need the other blocks to also be perfect lol
21:44 karolherbst[d]: though not sure it all matters _that_ much...
21:44 karolherbst[d]: there are some scheduling problems inside the loop: waits on block trailing instructions
21:44 karolherbst[d]: the bra UP stuff (but I have a patch for that, and that helps)
21:45 karolherbst[d]: hey... wait a second...
21:45 karolherbst[d]: the second if is dead isn't it?
21:46 karolherbst[d]: ohh wait, it's not
21:47 karolherbst[d]: right.. the `i2i64` also also leaving the prmt [bbbb] around... wanted to do something actually correct about those
21:47 karolherbst[d]: but uub won't help with that, because it doesn't really handle signed stuff correctly...
21:48 mhenning[d]: oh, does it give up on signed stuff?
21:48 karolherbst[d]: nah.. it just implements them wrongly
21:48 karolherbst[d]: it only gives you an upper bound considering you operate on unsigned values
21:49 karolherbst[d]: the `imin` handling is a bit odd
21:49 karolherbst[d]: like.. it treats imin and imax like umax
21:49 karolherbst[d]: which.. is correct if you assume the result is an unsigned value
21:50 karolherbst[d]: I planned to use it also to write nsw tags, but...
21:51 karolherbst[d]: I'd rather not open that can of worms yet
21:51 mhenning[d]: yeah that sounds odd
21:51 karolherbst[d]: I mean.. it says unsigned upper bound 😄
21:53 mhenning[d]: well, right but I thought that was just the abstract representation that's unsigned
21:53 mhenning[d]: since nir is untyped
21:53 karolherbst[d]: yeah.. I did so too at first
21:55 karolherbst[d]: anyway.. it's still useful, just for different things
22:03 anholt: karolherbst[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33926 is our wip for globals with a base.
22:04 karolherbst[d]: ohh nice
22:04 karolherbst[d]: we only have 24 bit offsets, but good enough
22:04 anholt: if your intrinsics have any implicit shifts of addresses away from bytes, then https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35092 may be interesting.
22:05 karolherbst[d]: it doesn't
22:05 karolherbst[d]: anholt: anyway.. we do have nir_opt_offsets, so I think you can just ignore the problem of lower_io already setting the base itself. With enough uub and range analysis you can make it optimize even more
22:06 anholt: I think that one may still help with nir_opt_offsets-ing because we pull constant parts out better. iirc. I need to page it back in and finish reviewing.
22:06 karolherbst[d]: I have patches
22:06 karolherbst[d]: but yeah...
22:06 karolherbst[d]: it's not just about pulling them out, but also dealing with shifts and wrapping to know what opts you can do
22:07 karolherbst[d]: I have some range_analysis based stuff that basically just reasons about shifts enough that it allows for a lot of constant folding to happen
22:07 anholt: cool. I bet we'll like that too.
22:11 karolherbst[d]: looked into it for coop matrix stuff.. turned https://gist.github.com/karolherbst/aaf21997562260677606ef30602fb95a#file-gistfile1-txt-L302-L448 into https://gist.github.com/karolherbst/cc1e1f45b65a35c52bf980755b49df9c#file-gistfile1-txt-L302-L374
22:12 karolherbst[d]: lea_nv is a + b >> c
22:13 karolherbst[d]: or b << c?
22:13 karolherbst[d]: I think it's a left shift
22:14 mhenning[d]: yeah, it's typically a left shift
22:15 karolherbst[d]: anyway, got rid of two more instructions by using ushr_imm instead of ishr_imm inside naks cmat lowering 🙃
22:16 karolherbst[d]: once I'm done, address calculations won't be any perf concern anymore, lol
22:17 karolherbst[d]: mhhhhh
22:17 karolherbst[d]: what I hate about those global address calculations is, that they do the exact same offset calc over and over again
22:17 karolherbst[d]: but with a different base
22:17 karolherbst[d]: we just add the base first
22:20 karolherbst[d]: who is adding those i2i64 tho...
22:21 karolherbst[d]: like seriously..
22:25 karolherbst[d]: like vtn adds them, but I don't see them in the spirv?
22:28 esdrastarsis[d]: mohamexiety[d]: You should put rick roll instead of daft punk 🐸
22:30 karolherbst[d]: ohh it's array indexing 🙃
22:36 karolherbst[d]: mhhhhh
22:36 karolherbst[d]: is the i2i even correct there?
22:36 karolherbst[d]: like what if the index is an unsigned value
22:37 karolherbst[d]: ohh they are always treated as signed...
22:37 karolherbst[d]: *sigh*
22:39 karolherbst[d]: anyway enough for today\
23:04 gfxstrand[d]: anholt: Looking at those offset patches, yeah, we need something kinda similar. Unfortunately, it's not just one offset. It's something like `<GPR> + <UGPR> + imm24` where the UGPR might be `u32`.
23:04 gfxstrand[d]: It's all pretty cursed