01:23 phomes_[d]: I am not having any luck with DLSS games tonight. First I bought the wrong star wars game. When I got the right one (jedi survivor) it wont launch at all. Then I try payday 3, but it won't let me choose DLSS in settings. Same happens with Cyberpunk, Arcadegeddon, and Predecessor
01:25 phomes_[d]: They don't attempt to call the new API or anything. Then I tried faking our driver version number and the driverID to match the prop driver. Still not showing DLSS in settings.
08:06 marysaka[d]: phomes_[d]: Are you sure the DLLs are copied in the system32 directory of the prefix? Maybe something to check just in case
12:05 karolherbst[d]: Mhh.. looking more at RA, I'm convinced that the picked registers are leading to not that terrible allocation through mere chance. This is the situation I was running into yesterday, where being smarter leads to a worse outcome, because "pick the first free register" is often a good choice, because we don't schedule instructions yet pre RA:
12:05 karolherbst[d]: https://gist.github.com/karolherbst/dd149022bf40bdd94513655a0c9a0405
13:08 phomes_[d]: marysaka[d]: Thanks. This prompted me to actually learn where the dlss dll's go and how to print the version of the dll in use. The two games with issues don't use the same dsll version. So probably just coincidence that they are both UE4
13:11 phomes_[d]: I then tried to grab the quite new version of dlss dll's from Arc Raiders and overwrote it into Atomic Heart and Deep Rock Galactic. And with that the issues are no longer there. So not our bug after all I guess 🙂
16:30 mhenning[d]: phomes_[d]: It might still be our bug. The thing I'm concerned about with that MR is that it's a little bit fragile and I'm worried that the approach might break between different versions of DLSS. So if those DLSS versions work on the proprietary driver I'd still guess that it's our bug.
17:40 karolherbst[d]: yo benchmarks: https://www.phoronix.com/review/nvidia-nvk-linux-618-mesa-26/2
17:40 karolherbst[d]: 25.2 vs git vs nvidia. Quite nice to see those improvements 🙂
17:46 karolherbst[d]: there are even benchmarks we are winning at 😄
17:46 karolherbst[d]: vlpeak fp16 vec4...
17:46 karolherbst[d]: but that's kinda what I've seen also on the compute side. It seems mesa is pretty good at vectorizing alu
17:48 x512[m]: What makes most difference? Shader compiler? Command buffer generator?
17:49 karolherbst[d]: compiler most of the time
17:49 karolherbst[d]: depends on the benchmark tho
17:49 karolherbst[d]: the more micro it gets, the more the compiler matters
17:49 karolherbst[d]: with games as a lot of things are going on you can do a lot of cross command optimizations and pipeline reordering stuff
17:50 karolherbst[d]: and also optimize more against memory bandwidth
17:50 karolherbst[d]: but the raw number crunching benchmarks are almost only depending on the compiler
17:50 x512[m]: Can it be some problem with inefficient synchronization or subchannel switch?
17:51 phomes_[d]: it is a good comparison right before compression lands
17:51 karolherbst[d]: yeah, but we already landed some stuff there
17:51 karolherbst[d]: but that's only reducing stalls or idle times
17:53 karolherbst[d]: phomes_[d]: then we have more big jumps next time 😄
18:02 karolherbst[d]: oh man.. he tested before I landed the coop matrix vector patches 🙃
18:02 karolherbst[d]: that's like +10% on top
18:06 karolherbst[d]: I have to check out why some of the CL benchmarks are so slow... but probably some weird thing going on
18:10 phomes_[d]: That ProjectPhysX FP32 Compute benchmark is brutal 🙂
18:12 snowycoder[d]: phomes_[d]: Seems 70x slower, wow
18:12 karolherbst[d]: I'm sure it's something super silly
18:13 karolherbst[d]: but probably also easy to fix
18:13 karolherbst[d]: could also be const buffer stuff
18:19 mhenning[d]: Yeah, it's interesting that our performance varies so wildly from one benchmark to the next. Might point out areas that we can focus on.
18:19 karolherbst[d]: yeah... some of the CL stuff will also be rusticl internal things. I know of some stuff, but ENOTIME to take care of it 🙃
18:20 karolherbst[d]: I want to use ubos for constant* kernel arguments e.g.
18:20 karolherbst[d]: and atm it's a plain global buffer, which should map to bda with zink
18:20 karolherbst[d]: so that's gonna hurt perf
18:21 mhenning[d]: yeah, that could make a big difference
18:21 karolherbst[d]: others wanted me to look into that as well, because even on hardware not having native ubos, it's drops from 64 to 32 bit address calculations
18:24 karolherbst[d]: giving that's ada/blackwell tested there, there are two things I'm aware of on the top of my head: for llama.cpp wiring up STSM (shared memory matrix store) and 256 bit load/stores could help
18:25 karolherbst[d]: but not sure how much would actually benefit from 256 bit load/stores...
18:26 asdqueerfromeu[d]: karolherbst[d]: Hitman game benchmark results are quite painful though
18:27 karolherbst[d]: looks like a bug
18:27 karolherbst[d]: DIRT uses the same engine, no?
18:28 karolherbst[d]: anyway.. need to make compute even faster <a:ferrisBongo:498944916286603265>
18:30 asdqueerfromeu[d]: karolherbst[d]: So D3D12 games actually use that heavily though?
18:30 karolherbst[d]: well.. upscaling and frame gen is compute these days
18:31 karolherbst[d]: but they use compute for all sorts of stuff
18:31 karolherbst[d]: asdqueerfromeu[d]: some benchmarks: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37135#note_3082221
18:32 karolherbst[d]: anyway "making compute faster" also aligns with what I'm supposed to do for work 😄