00:00gfxstrand[d]: cubanismo[d]: I meant `INVALIDATE_SHADER_CACHES` with `FLUSH_DATA_TRUE`.
00:58cubanismo[d]: I think that probably works, but can't say for sure.
03:51mangodev[d]: i hope this doesn't turn out to be bad
03:51mangodev[d]: i hope this is just the product of an outdated git driver
03:51mangodev[d]: #
03:51mangodev[d]: # A fatal error has been detected by the Java Runtime Environment:
03:51mangodev[d]: #
03:51mangodev[d]: # SIGSEGV (0xb) at pc=0x00007f13cb73eb85, pid=11294, tid=11305
03:51mangodev[d]: #
03:51mangodev[d]: # JRE version: OpenJDK Runtime Environment (21.0.8+9) (build 21.0.8+9)
03:51mangodev[d]: # Java VM: OpenJDK 64-Bit Server VM (21.0.8+9, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, linux-amd64)
03:51mangodev[d]: # Problematic frame:
03:51mangodev[d]: # C [libgallium-25.3.0-devel.so+0x193eb85]
03:51mangodev[d]: #
03:51mangodev[d]: # Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %d %F" (or dumping to /home/mango/.local/share/PrismLauncher/instances/figua smp/minecraft/core.11294)
03:51mangodev[d]: #
03:51mangodev[d]: # An error report file with more information is saved as:
03:51mangodev[d]: # /home/mango/.local/share/PrismLauncher/instances/figua smp/minecraft/hs_err_pid11294.log
03:51mangodev[d]: [52.881s][warning][os] Loading hsdis library failed
03:51mangodev[d]: #
03:51mangodev[d]: # If you would like to submit a bug report, please visit:
03:51mangodev[d]: # https://bugreport.java.com/bugreport/crash.jsp
03:51mangodev[d]: # The crash happened outside the Java Virtual Machine in native code.
03:51mangodev[d]: # See problematic frame for where to report the bug.
03:51mangodev[d]: #
03:51mangodev[d]: Process crashed with exitcode 6.
03:53mangodev[d]: i'll continue in misc category since those aren't bridge channels
04:19gfxstrand[d]: cubanismo[d]: We actually fought with a similar thing when I was working at Intel where sometimes interrupts would arrive before other state updates. Except there it was written directly by fixed function hardware, not a programmable command streamer so there was no way to add barriers. Lots of code was written. It was sad.
04:20mangodev[d]: gfxstrand[d]: how many places *have* you worked? i'm honestly curious
04:21gfxstrand[d]: That are remotely relevant to what I do now? Just Intel and Collabora.
04:21mangodev[d]: i mean
04:21mangodev[d]: your (inanimate) blog goes back to 2007 iirc
04:21mangodev[d]: so that's already quite a lot of industry time
04:21mangodev[d]: gfxstrand[d]: ah
04:22mangodev[d]: mangodev[d]: although on that subject
04:22mangodev[d]: -# what happened to the devblog π₯Ί
04:22mangodev[d]: it seemed to be replaced with the collabora public blog, i guess, kinda, maybe
04:22gfxstrand[d]: I also worked for a small NASA contractor between undergrad and graduate school but besides being kinda cool, it's not really applicable.
04:23gfxstrand[d]: mangodev[d]: Yeah, most of my stuff goes on the Collabora blog. That was kinda the deal when I joined. A lot of why I get a paycheck is because I make Collabora look good.
04:23gfxstrand[d]: Blogging is part of that.
04:23gfxstrand[d]: And I need to do more of it
04:24mangodev[d]: fair point
04:24mangodev[d]: β¨brand imageβ¨
04:24mangodev[d]: maybe if you talk more about AI, then maybe that'll get you a pay raise (jk)
04:43gfxstrand[d]: You joke...
04:45sonicadvance1[d]: NVK now accelerates AI workloads so your model can thrive in an open-source environment! It’s basically a free selling point right now.
06:14asdqueerfromeu[d]: gfxstrand[d]: Did you work on anything orthoimagery-related by chance?
06:56airlied[d]: sonicadvance1[d]: It's what is funding Karols and my coopmat work
06:57sonicadvance1[d]: It's definitely the buzzword to use to get funding.
06:58sonicadvance1[d]: FEX-Emu, now with ML!...Wait no, maybe that's better left out.
08:52dissiusmena: You are allowed to say your defense words, as to why you talk entire nonsense, as how you did not know to harvest and injure a donor and plant chips to the nexk etc. you did not know it was illegal along with bombs you detonated with number of assaults from behind etc. But such nonsense no one will ever count to overrule your execution. You will get collectively executed for the terror you
08:52dissiusmena: have placed on me, trust me, so you can get cyanide from cherry stones if you are coward to face your charges.
09:21karolherbst[d]: sonicadvance1[d]: yeah.. though I _tried_ to argue "maybe we should make it work with dxvk" but apparently it's less interesting from a business perspective π
09:21karolherbst[d]:but
09:21karolherbst[d]: I doubt it's much work
09:21karolherbst[d]: so if somebody wants to make FSR 4 work....
09:24karolherbst[d]: well at least "make it go fast" also helps with random other stuff if you focus on the right things π
09:26karolherbst[d]: huh funny...
09:26karolherbst[d]: nvidia does indeed merge compute jobs together
09:31karolherbst[d]: but the numbers don't add up...
09:32karolherbst[d]: oh no
09:33karolherbst[d]: lol
09:33karolherbst[d]: I know what they are doing
09:33karolherbst[d]: sneaky
09:33karolherbst[d]: so our shader uses 104 registers, theirs... 16 π
09:34karolherbst[d]: and also just 1/5 of the shared memory
09:34karolherbst[d]: haven't looked at the shader _but_ I think they rework the shader in a way that it's smaller and they just run it with more threads
09:34karolherbst[d]: and because they use less registers, they might even get more parallelism out of it
09:36marysaka[d]: interesting :aki_thonk:
09:36karolherbst[d]: they might reuse command buffers
09:36karolherbst[d]: but we have 60 * 0x20 * 0x20
09:36karolherbst[d]: and they have 4 * 0x20000 * 0x1
09:37karolherbst[d]: it's a factor of `8.533333333333333` so not really sure how it all adds up
09:37karolherbst[d]: but I can see how launching only 4 dispatches can help and also how only using 16 registers improves things π
09:37karolherbst[d]: maybe they split the loop?
09:38karolherbst[d]: it has like 1000 iterations?
09:38karolherbst[d]: give or take
09:38karolherbst[d]: but not sure how that would reduce register pressure ...
09:39marysaka[d]: karolherbst[d]: do you see any REPLAY being used
09:39marysaka[d]: or track with filter
09:40karolherbst[d]: how would I be able to tell?
09:40marysaka[d]: let me find the actual method
09:41marysaka[d]: karolherbst[d]: SET_MME_SHADOW_RAM_CONTROL
09:41karolherbst[d]: nope
09:41marysaka[d]: hmmm
09:41marysaka[d]: well I guess it's not that useful on compute then
09:41marysaka[d]: still have to research those more they are funky
09:42karolherbst[d]: what's interesting is, that there is another push buffer with a single inline QMD
09:42karolherbst[d]: and one with 4
09:42karolherbst[d]: let me diff them just in case
09:43karolherbst[d]: but nvidia using `NVC7C0_LOAD_INLINE_QMD_DATA` is so nice, let me easily compare the QMDs π
09:43marysaka[d]: if you feel like it, could you type something to auto extract them on nv_push?
09:44marysaka[d]: kind of want to move MME dumping to that too later on
09:44marysaka[d]: for shaders of course envyhooks will have to do the work tho so will still need to parse qmd anyway..
09:44karolherbst[d]: okay.. they are all identical except for cb0 and cb7
09:47bradleyhinter: the touchpad of apples laptop on kernel 6.x has been cracked now on request. So after doing all the pseudo code in public now, i stress out to all my fans and real business owners that people who harassed our locations and sites of accommodation aka those trashing anal buffalos and their mental illness diagnosed and stabbed monsters with vivid syndicate dreams, they had been doing those
09:47bradleyhinter: exact same things very long time, if you join to offer your feedback on those monsters, we can collaborate to kick them out, we have access to firearms, but i would like to hear your other cases with same monsters, and we are going to then already crush those cockroaches. There was one anal buffolo sitting on my facebook all the time and some others robbing clients on vulnerable networks,
09:47bradleyhinter: my dad put million dollars onto those gangsters like Indrek Raud as loan, so this shithead is expected to not pay it back, but i would not advise to go to visit that idiot before the management has been finally swapped out, he is not capable to pay back the debt and likely we need to collect the hotel with bloody bad history and vibe in it, otherwise construction wise it's very high
09:47bradleyhinter: standard and location wise.
10:21karolherbst[d]: I can't get over them using almost no shared memory or registers π
10:21karolherbst[d]: but that kinda explains the perf gap tbh
10:27mohamexiety[d]: the question is if this is a general opt they do
10:27mohamexiety[d]: or this is just a known shader replacement
10:36karolherbst[d]: mhhhh
10:36karolherbst[d]: you think if I mess with the spirv they might not do it?
10:37karolherbst[d]: it's from an open source benchmark... not sure...
10:37karolherbst[d]: shader got updated 2 months ago
10:37karolherbst[d]: I wouldn't be surprised if they have some specific pattern matching
10:40karolherbst[d]: but the shader is also not _that_ complex
10:40karolherbst[d]: I'm actually surprised we burn that many registers on it
11:05karolherbst[d]: mhhh
11:05karolherbst[d]: okay
11:05karolherbst[d]: so LAUNCH_DMA seems to make performance more consistent
11:05karolherbst[d]: but might need to do a bit more testing on it
14:05snowycoder[d]: Testing cross-block scheduling on shader-db.
14:05snowycoder[d]: Apparently it only minimally changes CodeSize, and nothing else?
14:05snowycoder[d]: Static cycle count remains the same 0_o
14:19gfxstrand[d]: karolherbst[d]: Uh... What?
14:20karolherbst[d]: they spawn more threads than us
14:20karolherbst[d]: I haven't checked out their shader yet, but I have ..... an idea what they are doing
14:20karolherbst[d]: they might just split the shader into independent threads
14:21karolherbst[d]: but I really should check out their shaders
14:21gfxstrand[d]: Yeah.
14:21gfxstrand[d]: Because 104 vs 16 seems impossible
14:21karolherbst[d]: but if you loop over something and it's like a vec4 or _something_ and they are all independent, you might just go and say: well.. let's just move it into 4 threads instead of 1
14:22gfxstrand[d]: NAK isn't perfect but it should use close to the theoretical minimum number of registers.
14:22karolherbst[d]: and the sahder is just a big loop
14:22karolherbst[d]: over multiple matrices
14:22gfxstrand[d]: Right
14:22karolherbst[d]: but I know they are doing crazy CFG level optimizations
14:22karolherbst[d]: loop merging is another one
14:23gfxstrand[d]: Also, they might turn looping over matrices into looping over vec4s or something
14:23karolherbst[d]: I saw them merging 3 loops into one and used a state machine with predication π
14:23karolherbst[d]: like nested loops
14:23gfxstrand[d]: Crazy
14:23karolherbst[d]: yeah...
14:23karolherbst[d]: so splitting loops into threads isn't even the most crazy idea
14:24karolherbst[d]: but loop merging makes sense because then you can just keep all threads converged, they just do different things
14:25karolherbst[d]: helps a lot with unbalanced loops, where some threads iterate less often than others
14:25karolherbst[d]: but might iterate more over outer loops
14:26karolherbst[d]: in case you ever wondered why there are soo many predicates
14:29karolherbst[d]: Anyway.. it does make sense actually.. keep in mind they also have CUDA π and some devs use silly C code doing loops which should be GPU threads instead
14:29karolherbst[d]: I'm sure they do optimize those things
17:13karolherbst[d]: marysaka[d]: ohh right.. I couldn't compile those spirvs because of the dumper not supporting like spirv things or some dep or something
17:13karolherbst[d]: do we have another way of dumping shaders from the blob?
17:14marysaka[d]: karolherbst[d]: hijack pipeline cache binaries
17:14karolherbst[d]: how?
17:14karolherbst[d]: the dumper we have in tree doesn't seem to work anymore
17:14marysaka[d]: I think gfxstrand[d] had some old layer buut yeah
17:14marysaka[d]: I think I had something able to consume those just fine
17:14karolherbst[d]: mhhh
17:15marysaka[d]: tho they changed compression so that's annoying would need to dig into that again at some point
17:15karolherbst[d]: what's the issue with usami here tho? Like why does it need to parse the spirv anyway
17:15marysaka[d]: to generate layouts
17:16karolherbst[d]: ahh.. pain
17:17marysaka[d]: it would be easier to update the library anyway
17:17karolherbst[d]: heh
17:17karolherbst[d]: what do I need to do
17:19marysaka[d]: karolherbst[d]: In <https://github.com/gwihlidal/spirv-reflect-rs>, you need to
17:19marysaka[d]: - Update SPRIV-Reflect submodule
17:19marysaka[d]: - Build with generate_bindings feature to generate the bindings again
17:19marysaka[d]: - Fix whatever is broken
17:23karolherbst[d]: okay
17:23karolherbst[d]: how can I make usami use my local thing?
17:24marysaka[d]: karolherbst[d]: <https://github.com/marysaka/usami/blob/master/Cargo.toml#L30> replaced with `spirv-reflect = { path = "<whatever>" }`
17:25karolherbst[d]: `value is 5349` mhh
17:26karolherbst[d]: `PhysicalStorageBuffer`
17:29karolherbst[d]: it crashes π₯²
17:30karolherbst[d]: guess I need to wire up support for `PhysicalStorageBuffer` then...
17:31marysaka[d]: hmmm
17:33karolherbst[d]: `VALIDATION [VUID-VkShaderCreateInfoEXT-pCode-08740 (927469294)] : vkCreateShadersEXT(): pCreateInfos[0] SPIR-V Capability CooperativeMatrixKHR was declared, but one of the following requirements is required (VkPhysicalDeviceCooperativeMatrixFeaturesKHR::cooperativeMatrix).`
17:33karolherbst[d]: heh
17:33karolherbst[d]: what does it want from me
17:34karolherbst[d]: do I have to declare the extension somewhere?
17:37karolherbst[d]: ahhhh
17:38karolherbst[d]: I need to select an entry point
17:38karolherbst[d]: or uhm...
17:38karolherbst[d]: spec constant
17:39karolherbst[d]: ooof
17:40karolherbst[d]: ahhhh..
17:40karolherbst[d]: it has like 15 spec constants and who knows what they are all doing
17:41karolherbst[d]: maybe getting the layer working again is less work after all
17:43mhenning[d]: snowycoder[d]: Static cycle count is estimated during opt_instr_sched_postpass. it doesn't reflect any changes to calc_instr_deps
17:44mhenning[d]: we could possibly add another statistic that would show improvements to calc_instr_deps, but it's expected that you wouldn't see a stats change right now
17:46snowycoder[d]: Thanks, I was just reading that code
17:49snowycoder[d]: Let's see if pixmark_piano benefits anything
18:04snowycoder[d]: pixmark piano: 8996 -> 9114 (~1%)
18:04snowycoder[d]: furmark: 35418 -> 35594 (~2%)
18:04snowycoder[d]: Ain't much but it's honest work
18:15karolherbst[d]: nice
18:59karolherbst[d]: yeah... those matrix shaders can be split cleanly in 4
18:59karolherbst[d]: mhh 8 actually
19:00karolherbst[d]: mhh let's go with 4
19:01karolherbst[d]: marysaka[d]: I have a very very cursed idea on how to extract shaders.... Like.. if we see a QMD, nothing prevents us from inserting additional push buffers, does it?
19:02karolherbst[d]: like...
19:02karolherbst[d]: could allocate memory, copy the shader and map it?
19:02marysaka[d]: we could but that's quite a bit to handle in envyhooks
19:03karolherbst[d]: mhh
19:04karolherbst[d]: we also don't know the size of the shader, right..
20:53marysaka[d]: karolherbst[d]: one easy thing we can likely do: track down all DMA copies src/dst with size and match against the dst address when a shader is bound/qmd is sched
20:54marysaka[d]: that would give the proper size that was copied at least problem could be on reuse tho
21:01karolherbst[d]: ohhhhh
21:01karolherbst[d]: let me check something
21:05karolherbst[d]: why is nvidia copying the shader around? π
21:06karolherbst[d]: heh....
21:21karolherbst[d]: hahahahhaa
21:23karolherbst[d]: I need a better hexdump tool tho
21:23airlied[d]: I've been extracting shaders with nvcachedec tool
21:24karolherbst[d]: airlied[d]: how does that one work tho?
21:24airlied[d]: https://github.com/therontarigo/nvcachetools
21:25airlied[d]: Set the gl cache for run app, then use that to pull the objs out, then nvdisasm the objs
21:25karolherbst[d]: I created a core file with gdb and found the shader, but oh boi is hexdump a pain to use
21:25airlied[d]: There is an env var to set cache path
21:27karolherbst[d]: that env var isn't doing anything
21:27karolherbst[d]: or well if the vk driver uses a different one I don't know that one
21:54karolherbst[d]: https://gist.githubusercontent.com/karolherbst/8388d9a9915b8872229208101be72194/raw/89b7d10aa519bca34ac9ae3aaa6f9f3364afcfee/gistfile1.txt
21:54karolherbst[d]: does this look like a valid shader? π
21:56karolherbst[d]: hey. kinda looks like mine
21:56karolherbst[d]: this is prolly the funniest way to extract shaders
21:59karolherbst[d]: full thing: https://gist.githubusercontent.com/karolherbst/c803c9e20cf645fb5673d92615dbbb8e/raw/9a28be54da4d2a8bd0024f9109edb21b6e81c716/gistfile1.txt
21:59karolherbst[d]: sooo
21:59karolherbst[d]: open gdb
21:59karolherbst[d]: break on ckCreateComputePipeline
21:59karolherbst[d]: step over it
21:59karolherbst[d]: generate-core-file
21:59karolherbst[d]: search for some well known instructions π
22:00karolherbst[d]: their address math is less of a disaster, I give them that
22:01karolherbst[d]: okay
22:01karolherbst[d]: sooo
22:02karolherbst[d]: there are a few interesting bits
22:02karolherbst[d]: they pull the same cb over and over again
22:03karolherbst[d]: mhhh
22:03karolherbst[d]: dunno why their QMD only has 16 registers tho...
22:06karolherbst[d]: that's my shader in comparison so far: https://gist.githubusercontent.com/karolherbst/0cdc7be967cc1301e4609846686f66ff/raw/08516180a658016aaa65f85097d34be7ba303c55/gistfile1.txt
22:09karolherbst[d]: but yeah.. our address calculation is a disaster in comparison
22:12karolherbst[d]: karolherbst[d]: gfxstrand[d] here you can see how nvidia messes around with CFG and interleaves blocks through predication with... well.. normal blocks. The `@!P0` stuff is usually all before the LDSM/HMMA bits, but they merge those blocks and do funky stuff
22:17karolherbst[d]: I want those GPR+UGPR load/stores π
22:18karolherbst[d]: it's really annoying, because we need to cast a 32 bit offset to 64 bit, and then do a 64 bit additional, meanwhile it can be done inside the load/stores
22:20karolherbst[d]: what's interesting is that nvidia prefers to use IMAD over IADD3
22:20karolherbst[d]: for a 2 operand add
22:21karolherbst[d]: and also IMAD over shifts
22:24karolherbst[d]: but yeah... turning conditional branches into predicates and then do really aggressive scheduling seems one of the biggest differences here
22:25karolherbst[d]: anyway.. next weeks project: UGPR+GPR load/stores
22:26karolherbst[d]: the address math after the loop is kinda not very important, but might as well figure it out, because it's way easier than the predication stuff π
22:28karolherbst[d]: kinda surprised they don't do the `.X16` on the `STS` even though they clearly could
22:29karolherbst[d]: maybe because it's used in two STS and they never bothered to check if all uses would get eliminated or something
22:35karolherbst[d]: I'm still confused about the low register count in the QMD...
22:36karolherbst[d]: maybe they shader variant the thing...
22:46karolherbst[d]: okay there are indeed multiple shaders, mhhh
22:49karolherbst[d]: mhh it's alwyas the same
22:50karolherbst[d]: mthd 0370 NVC7C0_LOAD_INLINE_QMD_DATA(20)
22:50karolherbst[d]: .V = (0x21001)
22:50karolherbst[d]: ```...
22:51karolherbst[d]: 20 * 32 +8 == 648: `NVC7C0_QMDV03_00_REGISTER_COUNT_V MW(656:648)`
22:51karolherbst[d]: and it's clearly `0x10` there...
22:52karolherbst[d]: I wonder if the occupancy settings override this somehow?
22:52karolherbst[d]: ` mthd 0400 NVC7C0_LOAD_INLINE_QMD_DATA(56)
22:52karolherbst[d]: .V = (0xff40ff10)`
22:53karolherbst[d]: threshold_register == 0x40 here
22:53karolherbst[d]: that they still use more in the shader
22:56karolherbst[d]: I wonder if there is some magic going on here where you can launch things with fewer registers and once it needs to use more, the hw checks if resources are available and if not, just pauses the shader until they are?
22:57karolherbst[d]: I can see it mattering if the second instruction wouldn't be `/*0010*/ S2R R68, SR_TID.X ;` π
22:59karolherbst[d]: anyway...
22:59karolherbst[d]: there is also nice .reuse stuff
23:33snowycoder[d]: karolherbst[d]: Wasn't .reuse removed from recent graphic cards?