00:14fdobridge_: <gfxstrand> Basically, yeah.
01:48fdobridge_: <gfxstrand> @airlied It's still failing.
01:48fdobridge_: <gfxstrand> @airlied I think what I'm hitting are random test timeouts. Is there any way we can adjust the timeouts for GSP?
01:49fdobridge_: <gfxstrand> Though if memory allocation is failing, maybe it's something else?
01:49fdobridge_: <gfxstrand> I'm tempted to attempt a run without GSP and see what happens.
01:53fdobridge_: <airlied> You get the backtraces even with that patch?
02:47fdobridge_: <airlied> not sure if there is anything to configure in that area
02:53fdobridge_: <airlied> openrm src/nvidia/src/kernel/gpu/rc/kernel_rc_watchdog.c suggests there might be something more we can do, seems like a lot of work to figure out
03:46fdobridge_: <gfxstrand> No backtrace. Just a timeout
03:47fdobridge_: <gfxstrand> Either something is legitimately getting stuck or we're just timing out
03:47fdobridge_: <gfxstrand> No channel errors. Just a timeout
03:47fdobridge_: <gfxstrand> I'm going to reboot without GSP and see what happens
04:05fdobridge_: <gfxstrand> How do I force disable GSP on Ampere?
04:12fdobridge_: <gfxstrand> Or does Ampere require GSP?
04:12fdobridge_: <gfxstrand> I guess I can plug in my Turing
04:14fdobridge_: <airlied> ampere doesn't require it, but lots of ampere doesn't work without it
04:15fdobridge_: <airlied> just don't pass the config option to disable it
04:15fdobridge_: <airlied> but I think you had one that didn't work pre-gsp
04:24fdobridge_: <gfxstrand> I plugged in my Turing
04:28fdobridge_: <gfxstrand> Okay, running sans GSP now.
04:28fdobridge_: <gfxstrand> We'll see if this fares any better.
04:28fdobridge_: <gfxstrand> I did my original CTS runs without GSP and those worked okay
04:31fdobridge_: <airlied> would be good to see if turing gsp has same problem or not
04:32fdobridge_: <gfxstrand> Well reproducing the problem is a lot faster than getting good results. 😅
04:34fdobridge_: <airlied> in the fail list I posted earlier does anything look like the same problem?
04:44fdobridge_: <Sid> hang on I have some info for this
04:45fdobridge_: <gfxstrand> Nope
04:45fdobridge_: <Sid> this is the exact function that's throwing the error, verified by placing a a _debug_printf before the return https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/vulkan/nvk_queue_drm_nouveau.c#L241-#L274
04:45fdobridge_: <Sid> that's where we run into device lost
04:46fdobridge_: <Sid> basically there's something we do with drmCommandWriteRead that non-GSP is ok with but GSP does not like
04:47fdobridge_: <gfxstrand> Unfortunately, that's not very helpful. That's usually where device lost happens, regardless of the actual bug.
04:47fdobridge_: <Sid> ah
04:51fdobridge_: <Sid> are we trying to run the full test suite? I've got my turing system w/ gsp ready atm
04:53fdobridge_: <gfxstrand> Damn!
04:53fdobridge_: <gfxstrand> `fifo: fault 00 [VIRT_READ] at 0000003ffff65000 engine 20 [HOST0] client 06 [HUB/HOST] reason 02 [PTE] on channel 2 [02ffe45000 deqp-vk[3919]]`
04:54fdobridge_: <gfxstrand> Of course it doesn't fault when run individually. 😭
04:54fdobridge_: <Sid> that's without gsp?
04:55fdobridge_: <gfxstrand> Yeah
04:55fdobridge_: <Sid> and yeah, running into issues on the full test with GSP too
04:55fdobridge_: <Sid> ```
04:55fdobridge_: <Sid> [Thu Jan 18 10:21:48 2024] nouveau 0000:01:00.0: deqp-vk[6095]: job timeout, channel 24 killed!
04:55fdobridge_: <Sid> [Thu Jan 18 10:21:59 2024] nouveau 0000:01:00.0: deqp-vk[6095]: job timeout, channel 24 killed!
04:55fdobridge_: <Sid> [Thu Jan 18 10:24:22 2024] __vm_enough_memory: pid: 6095, comm: deqp-vk, not enough memory for the allocation
04:55fdobridge_: <Sid> [Thu Jan 18 10:24:22 2024] __vm_enough_memory: pid: 6095, comm: deqp-vk, not enough memory for the allocation
04:55fdobridge_: <Sid> [Thu Jan 18 10:24:22 2024] __vm_enough_memory: pid: 6095, comm: deqp-vk, not enough memory for the allocation
04:55fdobridge_: <Sid> [Thu Jan 18 10:24:22 2024] __vm_enough_memory: pid: 6095, comm: deqp-vk, not enough memory for the allocation
04:55fdobridge_: <Sid> ```
04:59fdobridge_: <redsheep> So, now might be the time to voice a crazy theory I have had that probably makes no sense. The performance I get seems a whole lot like what I would expect if my GPU could only use one of the 11 GPCs. Is it even remotely possible that there's some issue where work is only scheduling to one GPC and that something breaks when things are run in parallel due to that causing the work to spread over more than one GPC?
05:00fdobridge_: <Sid> huh
05:00fdobridge_: <Sid> hm
05:01fdobridge_: <gfxstrand> It's probably actually possible to figure that out.
05:01fdobridge_: <airlied> one of the reasons I'd like to port nvk to openrm for a while
05:02fdobridge_: <airlied> the other option would be tracing a few vulkan app openrm calls and seeing if we missed anything too obvious
05:02fdobridge_: <gfxstrand> We could implement VK_NV_shader_sm_builtins and fire off a really big compute shader which records SM numbers and sees if we hit them all.
05:02fdobridge_: <Sid> I could do that if I'm told how to, I've got openrm vulkan beta installed and ready to go on my machine
05:04fdobridge_: <airlied> it's not a thing I know how to do until I do it
05:04fdobridge_: <Sid> fair
05:04fdobridge_: <Sid> wasn't there an MR that made nvk work with openrm?
05:05fdobridge_: <Sid> let me find it...
05:05fdobridge_: <gfxstrand> There was some attempt at prep-work but I don't remember there being anything close to an actual implementation.
05:05fdobridge_: <airlied> tracing might be over describing "sticking printfs all over the place"
05:06fdobridge_: <Sid> https://gitlab.freedesktop.org/nouveau/mesa/-/merge_requests/65 yeah, seems like prep but no impl
05:07fdobridge_: <Sid> further down the test suite...
05:07fdobridge_: <Sid> ```
05:07fdobridge_: <Sid> [Thu Jan 18 10:35:32 2024] nouveau 0000:01:00.0: gsp: mmu fault queued
05:07fdobridge_: <Sid> [Thu Jan 18 10:35:32 2024] nouveau 0000:01:00.0: gsp: rc engn:00000001 chid:16 type:31 scope:1 part:233
05:07fdobridge_: <Sid> [Thu Jan 18 10:35:32 2024] nouveau 0000:01:00.0: fifo:001001:0002:0010:[deqp-vk[6095]] errored - disabling channel
05:07fdobridge_: <Sid> ```
05:12fdobridge_: <Sid> tbh this kinda makes sense to me, since the loglines in the dmesg do correspond to sched-related functions in the kernel module
05:14fdobridge_: <redsheep> It also goes a really long way towards explaining why small gpus seem to retain so much more of their performance, even with simple stuff like raw pixel throughput that really shouldn't be failing to get good utilization.
05:14fdobridge_: <Sid> yeah
05:17fdobridge_: <gfxstrand> Yeah, it definitely would.
05:17fdobridge_: <gfxstrand> I can probably wire up VK_NV_shader_sm_builtins in an hour or so if you wanted to play around.
05:18fdobridge_: <gfxstrand> Of course someone would have to write a test
05:18fdobridge_: <Sid> I'm up for it
05:18fdobridge_: <Sid> I've got a few classes in the next 5h, but after that I'm free all day
05:19fdobridge_: <gfxstrand> It's past bedtime here. I'll try to wire it up and post an MR tomorrow.
05:19fdobridge_: <Sid> okie, have a good night!
05:19fdobridge_: <gfxstrand> Thanks!
05:19fdobridge_: <gfxstrand> It would probably actually be a pretty easy first compiler project
05:20fdobridge_: <gfxstrand> But if we're more interested in the experiment, I can plumb it all through in under an hour while it would take a newbie like a week of asking questions.
05:20fdobridge_: <!DodoNVK (she) 🇱🇹> I thought people were opposed to this
05:20fdobridge_: <Sid> wiring up the ext?
05:20fdobridge_: <gfxstrand> Yeah
05:21fdobridge_: <gfxstrand> We're generally opposed to supporting it officially. Wiring someting up for the sake of testing is a totally different kettle of fish.
05:21fdobridge_: <gfxstrand> I was also opposed to the approach taken in the original MR. It basically tried to replicate the RADV winsys stuff which I really don't like.
05:21fdobridge_: <gfxstrand> But I've already structured various driver bits to make running on other kernels possible.
05:22fdobridge_: <gfxstrand> Some refactoring would be needed but some of it is already ready.
05:23fdobridge_: <Sid> yeah I think it'd be better if we stuck to you wiring it up for this case, since solving this issue will *greatly* help testing efforts and real-world use
05:23fdobridge_: <redsheep> Much as I would love to I think it would take me more than a week to get my Rust and C up to scratch. The experiment would be interesting though.
05:23fdobridge_: <gfxstrand> Okay, good news! I ran it again and it stopped on the same test. This means I can probably bisect through the test list and come up with a minimal reproducer case. It'll take me most of tomorrow to do it but I should be able to.
05:25fdobridge_: <gfxstrand> Of course, the fact that it takes nearly an hour isn't great but I should be able to trim that down after a few iterations.
05:25fdobridge_: <airlied> seems strange there could be a race around that, so it sounds like the kernel is failing to allocate the ptes
05:25fdobridge_: <gfxstrand> Kernel failing to allocate PTEs feels like the likely culpret but it's also possible that we have a userspace bug that requires just the right set of tests.
05:25fdobridge_: <gfxstrand> We've had that before
05:26fdobridge_: <gfxstrand> In either case (kernel or userspace) getting it down to a smaller reproducer is key
05:27fdobridge_: <gfxstrand> I don't think it will acutally involve writing any Rust.
05:29fdobridge_: <redsheep> My only programming experience is C++ and y'all probably would not enjoy the C code I would end up writing.
05:30fdobridge_: <gfxstrand> Eh, in this case it's all adding stuff that's very much like stuff that's already there. Just make it look like the surrounding code and you'll be fine.
05:30fdobridge_: <gfxstrand> The lack of any published tests for the feature make it tricky, though.
05:32fdobridge_: <gfxstrand> But like I said, I can type it in under an hour so I'm happy to do that.
05:35fdobridge_: <redsheep> I am looking at the doc for the extension and yeah I am not really sure where I would begin.
05:35fdobridge_: <gfxstrand> that's okay
05:50fdobridge_: <gfxstrand> I'll have to throw together a test to validate it, too, but that shouldn't be too hard. Just fire off a massive compute shader dispatch and stash SM numbers in a buffer.
05:58fdobridge_: <airlied> uggh forcing the cond render copying path fixes the cond render/xfb cross over tess
05:58fdobridge_: <airlied> tests
06:01fdobridge_: <gfxstrand> Ugh...
06:02fdobridge_: <redsheep> Is the information for WarpsPerSMNV and SMCountNV something you have to get by making a call to the GPU, like the GSP just knows that or has it burned into fuses or whatever, or is that buried somewhere in the million defines of the nvidia headers? There's so many layers I don't really understand.
06:03fdobridge_: <gfxstrand> It's in the shaders as system values.
06:04fdobridge_: <gfxstrand> Well, the counts I think we have to get from the hardware or firmware and IDK if I trust our current reporting.
06:04fdobridge_: <gfxstrand> The warp ID and SM ID you get in the shader.
06:05fdobridge_: <gfxstrand> So if we flood the GPU with a big enough compute dispatch, we should be able to verify that we have the counts correct and are actually using them all.
06:05fdobridge_: <redsheep> Right, that part (mostly) makes sense, make sure that you see enough distinct IDs to total what you expect the gpu to have, yeah?
06:05fdobridge_: <gfxstrand> My other theory about big GPUs is that we're just stalling too much. The bigger the GPU the more those WFIs hurt.
06:06fdobridge_: <gfxstrand> Yup. I did something similar to R/E the image layouts.
06:46fdobridge_: <Sid> what's a WFI
06:49fdobridge_: <airlied> Wait for idle
08:09fdobridge_: <airlied> @gfxstrand still semi-comfortable with landing 24.0 nvk in f40?
12:09fdobridge_: <karolherbst🐧🦀> random thought of the day: do I have to zero initialize shared mem in OpenCL ....
12:11fdobridge_: <karolherbst🐧🦀> ahh, seems to be a vulkan only thing
12:11fdobridge_: <karolherbst🐧🦀> or at least the initializers
14:48rdrg109: [Q] Is there a tool that let me monitor the usage of my GPU when using nouveau (similar to nvidia-htop.py)?
14:52karolherbst: no
15:08rdrg109: ^ I am asking because I'm using nouveau in Guix SD and I suspect that I'm not using my GPU because showing 30000 fishes in http://webglsamples.org/aquarium/aquarium.html is sluggish. I thought that if I had a monitoring tool, I could determine it by looking at the spikes. If anyone knows how to find out whether GPU is being used, please let me know.
15:09karolherbst: rdrg109: what gpu are you on btw?
15:14rdrg109: karolherbst: GeForce RTX 2070 Super
15:14karolherbst: rdrg109: if you are on linux-6.7 you can boot with "nouveau.config=NvGspRm=1" to get more performance
15:15karolherbst: like 10x or something
15:15RSpliet: Also, 30000 fishes is a lot of fishes. Firefox, AMD RX 6600 gets about 8fps with that. Not sure what the expected perf is supposed to be.
15:15karolherbst: 8 seems low
15:15karolherbst: prolly CPU bound
15:15karolherbst: I get 21 fps here on intel
15:15RSpliet: heh, oh yeah I forgot about that, there is that
15:15RSpliet: it's only an FX6300
15:16karolherbst: fair
15:16RSpliet: And it's a 4k monitor. Lots of variables.
15:16karolherbst: but yeah, one CPU thread at 100%
15:17fdobridge_: <gfxstrand> 24.0, no. 24.1, probably.
15:17bencoh: ~20fps with intel/uhd630 as well
15:17RSpliet: rdrg: in Firefox you can get some info about how it renders from about:support
15:18karolherbst: but anyway
15:18karolherbst: stock nouveau will be slow due to no power management on turing GPUs
15:18karolherbst: so you kinda want to run 6.7 and nouveau.config=NvGspRm=1 anyway and report back if you run into any issues or enjoy life with a faster GPU if not :D
15:19karolherbst: but anyway
15:19karolherbst: that demo is CPU bottlenecked
15:25rdrg109: karolherbst: Ok, I'll try that and report back. I think I'll be using glmark2 for benchmarking because I feel that chromium and firefox have higher level of complexity than that of glmark2 and issues regarding using the GPU might arise.
16:40rdrg109: karolherbst: "so you kinda want to run 6.7 and nouveau.config=NvGspRm=1 anyway and report" Do you hapen to know if that would work with linux-6.6? linux-6.7 is not available for Guix (in the nonguix channel) and I lack knowledge for creating the definition for linux-6.7.
16:42Sid127: rdrg109: it won't work unless you patch GSP support in by yourself
16:42Sid127: on 6.6, I mean
16:42Sid127: GSP support got merged only in 6.7, so either you have to patch it in or wait for your distro to ship 6.7
16:49rdrg109: Ok, thanks for the help! By the way, I just learned that glmark2 can show a given number of entities. Here's the command that I'm using for benchmarking: $ glmark2 --size 1920x1080 --benchmark desktop:windows=1000:show-fps=true --run-forever
16:51rdrg109: I had previously used http://webglsamples.org/aquarium/aquarium.html and set the number of fishes to 30000 for benchmarking, but I now prefer glmark2 because it is simpler and doesn't require a browser to get it running.
16:51rdrg109: I thought it might be useful to share it here. If anyone is aware of other tools for benchmarking GPUs when using nouveau drivers, please let me know.
16:57fdobridge_: <Sid> is there any reason we're not marking nvk as `all DONE` for vulkan 1.1 and 1.3 in features.txt?
17:03fdobridge_: <redsheep> If it doesn't pass CTS yet that's a blocker right?
17:04fdobridge_: <Sid> but the missing extension for 1.3 is an optional one
17:05fdobridge_: <Sid> and we're only missing one extension out of 1.2, which is VK_KHR_shader_float16_int8
17:05fdobridge_: <Sid> also, does anyone know how many GPCs the 3060 has?
17:07fdobridge_: <Sid> because there's more "evidence" supporting the only-one-GPC-used theory: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10383
17:08fdobridge_: <Sid> that ext is also optional, apparently
17:08fdobridge_: <tom3026> ". The full GA102 GPU contains seven GPCs" https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf
17:08fdobridge_: <tom3026> whatever that means
17:09fdobridge_: <Sid> thank you <3
17:09fdobridge_: <Sid> GPC is graphics processing cluster
17:10fdobridge_: <tom3026> yeah but i meant more with "the full .." is there an half gpu too?
17:10fdobridge_: <tom3026> 😄
17:10fdobridge_: <redsheep> GA102 is 3080 and 3090, not 3060
17:10fdobridge_: <Sid> ah, heh
17:10fdobridge_: <redsheep> Yes
17:10fdobridge_: <Sid> maybe laptop variants are nerfed?
17:10fdobridge_: <redsheep> That's what the cut down gpus are, the 3080 disables a gpc
17:11fdobridge_: <Sid> sooo, let's assume 3060 has 6?
17:13fdobridge_: <redsheep> GA106, which is the 3060, has 3
17:13fdobridge_: <tom3026> https://www.techpowerup.com/gpu-specs/nvidia-ga106.g966 "GPCs
17:13fdobridge_: <tom3026> 3"
17:13fdobridge_: <tom3026> ah redsheep was faster
17:14fdobridge_: <Sid> oh gpu-z has the count? that makes it easier
17:15fdobridge_: <tom3026> there seems to be a https://www.techpowerup.com/gpu-specs/geforce-rtx-3060-8-gb-ga104.c4132 ga104 3060 with 6 gpcs?
17:15fdobridge_: <tom3026> too
17:15fdobridge_: <tom3026> *shrug*
17:17fdobridge_: <redsheep> Nvidia loves to do lots of creative cuts, but generally in low volume just to move some bad dies. The vast majority of the units sold are GA106.
17:18fdobridge_: <tom3026> ah true
17:22fdobridge_: <redsheep> In this case it might actually be that they just need to get any revenue at all out of the huge oversupply of GA104, but if that's the case they did a poor job of compelling people to buy it, the 3060 8GB is awful
17:23fdobridge_: <tom3026> would be interesting to figure out how its locked maybe you can make a find like old amd cpus where cores where unlockable 😄
17:23fdobridge_: <tom3026> like old athlon II triple cores had an unlockable core or similiar
17:23fdobridge_: <Sid> I know a few older nv gpus could be modded into workstation equivalents
17:24fdobridge_: <redsheep> They've all gotten wise, with few exceptions they laser off the disabled areas.
17:24fdobridge_: <redsheep> Intel, amd, Nvidia, they don't want you doing that.
17:24fdobridge_: <tom3026> meh 😦
17:26fdobridge_: <redsheep> When the supply chain is healthy it's a good thing too, ideally they're fusing off areas with defects that you wouldn't want to try to use anyway.
17:26fdobridge_: <Sid> I'm really looking forward to this perf/job timeout thing being fixed
17:27fdobridge_: <Sid> it's been living in my head rent free for a while now, job timeout at least
18:08fdobridge_: <gfxstrand> Oh, right. I said I'd wire up an extension. Let me get hacking on that.
18:10fdobridge_: <Sid> heh, no worries~
18:22fdobridge_: <redsheep> Oh, it occurred to me that there's probably a much easier way to test my theory, but I can't reboot to try it myself right now. If you run vkpeak and get an fp32 number that's too high for one GPC to theoretically achieve then that idea is bunk
18:22fdobridge_: <redsheep> Still, that extension is probably useful either way
18:25fdobridge_: <Sid> on it, gimme 10 mins
18:28fdobridge_: <Sid> oof vkpeak aur package won't build right
18:29fdobridge_: <Sid> something funny with the ncnn submodule it has
18:31fdobridge_: <redsheep> Does the binary from the releases page work? https://github.com/nihui/vkpeak/releases
18:31fdobridge_: <Sid> ok sorted
18:31fdobridge_: <Sid> manual compile moment
18:32fdobridge_: <redsheep> It says "ubuntu" but nothing about that download looks to have anything specific to ubuntu lol
18:32fdobridge_: <redsheep> Guess Ubuntu is the laymen word for Linux
18:32fdobridge_: <Sid> yeah
18:33fdobridge_: <Sid> now just to wait for this to finish running
18:37fdobridge_: <Sid> ```[sidpr@strogg build]$ marigold -k ./vkpeak 0
18:37fdobridge_: <Sid> WARNING: NVK is not a conformant Vulkan implementation, testing use only.
18:37fdobridge_: <Sid> device = TU116
18:37fdobridge_: <Sid>
18:37fdobridge_: <Sid> fp32-scalar = 841.23 GFLOPS
18:37fdobridge_: <Sid> fp32-vec4 = 978.56 GFLOPS
18:37fdobridge_: <Sid>
18:37fdobridge_: <Sid> fp16-scalar = 0.00 GFLOPS
18:37fdobridge_: <Sid> fp16-vec4 = 0.00 GFLOPS
18:37fdobridge_: <Sid> fp16-matrix = 0.00 GFLOPS
18:37fdobridge_: <Sid>
18:37fdobridge_: <Sid> fp64-scalar = 26.07 GFLOPS
18:37fdobridge_: <Sid> fp64-vec4 = 26.07 GFLOPS
18:37fdobridge_: <Sid>
18:37fdobridge_: <Sid> int32-scalar = 981.64 GIOPS
18:37fdobridge_: <Sid> int32-vec4 = 1673.76 GIOPS
18:37fdobridge_: <Sid>
18:37fdobridge_: <Sid> int16-scalar = 0.00 GIOPS
18:37fdobridge_: <Sid> int16-vec4 = 0.00 GIOPS
18:37fdobridge_: <Sid> ```
18:37fdobridge_: <Sid> time to try the same thing on proprietary :D
18:37fdobridge_: <!DodoNVK (she) 🇱🇹> Is this with GSP, right?
18:38fdobridge_: <redsheep> It would have to be, that's far outside of possible perf on a 1660ti that isn't clocked up at least a little
18:40fdobridge_: <Sid> yes, with GSP
18:40fdobridge_: <Sid> nothing output in the dmesg during the run
18:40fdobridge_: <Sid> well, I *think* it was with GSP 🙃
18:41fdobridge_: <redsheep> This test doesn't seem to use FMA to get 2 ops per clock per alu. Given that, that aligns almost perfectly with 1/3rd of the gpu functioning and you do have 3 GPCs.
18:41fdobridge_: <Sid> I know I had to boot to main kernel for a kernel module I didn't have in mine, I dunno if I rebooted after
18:41fdobridge_: <Sid> ```
18:41fdobridge_: <Sid> [sidpr@strogg build]$ marigold -n ./vkpeak 0
18:41fdobridge_: <Sid> device = NVIDIA GeForce GTX 1660 Ti
18:41fdobridge_: <Sid>
18:41fdobridge_: <Sid> fp32-scalar = 5871.09 GFLOPS
18:41fdobridge_: <Sid> fp32-vec4 = 5833.69 GFLOPS
18:41fdobridge_: <Sid>
18:41fdobridge_: <Sid> fp16-scalar = 5770.31 GFLOPS
18:41fdobridge_: <Sid> fp16-vec4 = 11433.25 GFLOPS
18:41fdobridge_: <Sid> fp16-matrix = 0.00 GFLOPS
18:41fdobridge_: <Sid>
18:41fdobridge_: <Sid> fp64-scalar = 180.85 GFLOPS
18:41fdobridge_: <Sid> fp64-vec4 = 181.35 GFLOPS
18:41fdobridge_: <Sid>
18:41fdobridge_: <Sid> int32-scalar = 5755.84 GIOPS
18:41fdobridge_: <Sid> int32-vec4 = 5685.32 GIOPS
18:42fdobridge_: <Sid>
18:42fdobridge_: <Sid> int16-scalar = 3724.67 GIOPS
18:42fdobridge_: <Sid> int16-vec4 = 4648.77 GIOPS
18:42fdobridge_: <Sid> ```
18:42fdobridge_: <Sid> preprietary driver
18:42fdobridge_: <Sid> proprietary driver (edited)
18:43fdobridge_: <redsheep> Hmm. Ok that is unexpected, it does get what appears to be 2 ops per clock for you, but it doesn't on my GA102 on windows right now. That doesn't make much sense, so I am not sure that is conclusive.
18:44fdobridge_: <Sid> I BRING NEWS
18:44fdobridge_: <Sid> that first run was without GSP
18:44fdobridge_: <Sid> GSP enabled run is already going much faster
18:45fdobridge_: <redsheep> Well that's surprising, guess maybe it's getting 2 ops per clock on nvk as well then if that wasn't using gsp
18:46fdobridge_: <Sid> ```
18:46fdobridge_: <Sid> [sidpr@strogg build]$ marigold -k ./vkpeak 0
18:46fdobridge_: <Sid> WARNING: NVK is not a conformant Vulkan implementation, testing use only.
18:46fdobridge_: <Sid> device = TU116
18:46fdobridge_: <Sid>
18:46fdobridge_: <Sid> fp32-scalar = 2603.89 GFLOPS
18:46fdobridge_: <Sid> fp32-vec4 = 2882.69 GFLOPS
18:46fdobridge_: <Sid>
18:46fdobridge_: <Sid> fp16-scalar = 0.00 GFLOPS
18:46fdobridge_: <Sid> fp16-vec4 = 0.00 GFLOPS
18:46fdobridge_: <Sid> fp16-matrix = 0.00 GFLOPS
18:46fdobridge_: <Sid>
18:46fdobridge_: <Sid> fp64-scalar = 76.38 GFLOPS
18:46fdobridge_: <Sid> fp64-vec4 = 76.39 GFLOPS
18:46fdobridge_: <Sid>
18:46fdobridge_: <Sid> int32-scalar = 2790.15 GIOPS
18:46fdobridge_: <Sid> int32-vec4 = 4578.12 GIOPS
18:46fdobridge_: <Sid>
18:46fdobridge_: <Sid> int16-scalar = 0.00 GIOPS
18:46fdobridge_: <Sid> int16-vec4 = 0.00 GIOPS
18:46fdobridge_: <Sid> ```
18:46fdobridge_: <Sid> and that corroborates with our theory
18:46fdobridge_: <Sid> *kinda*
18:46fdobridge_: <redsheep> That's for sure more than 1 GPC.
18:46fdobridge_: <Sid> fp32 with gsp is half the perf of fp32 on prop
18:46fdobridge_: <redsheep> That's almost half, when it should be just over a third at best
18:46fdobridge_: <Sid> nothing in the dmesg either
18:47fdobridge_: <redsheep> If I was right, which... I was not.
18:48fdobridge_: <redsheep> This aligns much more with things stalling, I think. How in the world was non-gsp getting a third of the gsp perf?
18:49fdobridge_: <Sid> I suppose
18:52fdobridge_: <Sid> @gfxstrand you might wanna see this
18:53fdobridge_: <redsheep> With FMA your chip would have to be running 2.5 ghz with perfect utilization for that to be a possible number on one GPC, which... Unless you're dumping liquid nitrogen on your laptop is not what is happening
18:53fdobridge_: <Sid> nope, just cooled by the stock fans
18:54fdobridge_: <Sid> in an air conditioned room with the aircon set to 20C
18:54fdobridge_: <Sid> on a hard surface yes because we're not savages
18:55fdobridge_: <redsheep> Yeah, the prop driver test corresponds to about 1.9 ghz, which is very normal for turing. If NVK with gsp is getting it clocked up that high then occupancy is only about 45%
18:56fdobridge_: <Sid> 1.9ghz sounds about right, I think I've seen mangohud report 1855 as the peak for gpu clock
18:57fdobridge_: <Sid> if only we had a way to see stats on nouveau+gsp..
18:58fdobridge_: <redsheep> Might not be as useful for this as you would expect. Utilization is weird, at least most hardware reporting on windows will often say 100% without achieving the peak teraflops.
18:59fdobridge_: <Sid> I guess
18:59fdobridge_: <!DodoNVK (she) 🇱🇹> How does Windows Task Manager measure disk utilization? 💽
18:59fdobridge_: <gfxstrand> Yeah, so likely we're only getting about 1/2 ALU because we don't do coissue yet
19:00fdobridge_: <Sid> but hey, at least I now know Acer clocked my GPU on par with the desktop variant of the 1660Ti
19:01fdobridge_: <Sid> since the results I saw on the proprietary driver is much closer to GPU-Z's theoretical performance for the 1660Ti than it is to the 1660Ti Mobile
19:01fdobridge_: <Sid> 5.4 TFLOPS vs 4.8
19:01fdobridge_: <Sid> against the 5.8 I'm getting
19:02fdobridge_: <redsheep> It reads out throughput and active time, which is really how it should be IMO. Storage is weird.
19:04fdobridge_: <Sid> ..what's coissue
19:04fdobridge_: <redsheep> The spec'd theoretical numbers should be taken with a grain of salt. Nvidia pretty much just clocks as high as it can, if your chip is good or it's running cooler and has more power available it will just keep ratcheting it up.
19:04fdobridge_: <Sid> mhm
19:04fdobridge_: <redsheep> My 4090 is spec'd for 82 but easily achieves 95
19:05fdobridge_: <gfxstrand> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27154
19:05fdobridge_: <gfxstrand> Now I just need a test to test it
19:05fdobridge_: <Sid> I'm just happy funny number bigger than theoretical funny number
19:05fdobridge_: <Sid> small wins
19:06fdobridge_: <!DodoNVK (she) 🇱🇹> So will LN2 automatically give extreme performance numbers? 🔥
19:07fdobridge_: <Sid> it'll clock as high as your GPU can go within power limits, yeah
19:08fdobridge_: <redsheep> You do get more out of it by actually overclocking, but yes.
19:08fdobridge_: <Sid> that's also how they got Doom Eternal running at 1000fps
19:09fdobridge_: <Sid> https://www.youtube.com/watch?v=DzOiw9yP1rc
19:09fdobridge_: <Sid> fun stuff
19:11fdobridge_: <Sid> bethesda poland channel documented it a bit better
19:11fdobridge_: <Sid> https://www.youtube.com/watch?v=8Je3ryqUoz8
19:12fdobridge_: <redsheep> That was fast, yeah looking at the changes it would have taken me a good while to hunt all that down but now I know what it would have looked like if I did
19:13fdobridge_: <redsheep> Also, if lack of coissuing is cutting those numbers in half then that's actually pretty encouraging. When that's turned on then that that could go from 45% occupancy to 90%.
19:14fdobridge_: <redsheep> So at least for simpler compute NVK might be pretty close to being in good shape for performance.
19:16fdobridge_: <Sid> I'm trying to hack together a shitty test with the help of ChatGPT 🐸
19:16fdobridge_: <Sid> reusing some vulkan instance and device selection code from another project I'd started and never got anywhere
19:18fdobridge_: <Sid> I could do it 100% by myself but I'd be much slower since graphics programming is not really my field of expertise e-e
19:21fdobridge_: <gfxstrand> Yeah, that's why it's a good newbie project. It's basically just crawling all over the tree and learning where everything is.
19:25fdobridge_: <Sid> ok, I think this *should* work
19:25fdobridge_: <Sid> maybe
19:25fdobridge_: <Sid> probably
19:26fdobridge_: <Sid> still need a compute shader to run it with e-e
19:39fdobridge_: <Sid> ok yeah my head hurts
19:39fdobridge_: <Sid> who'd have thought 0109 isn't the best time to write code for something you have no experience with 🐸
19:45fdobridge_: <gfxstrand> https://gitlab.freedesktop.org/mesa/crucible/-/merge_requests/145
19:46fdobridge_: <Sid> :o
19:46fdobridge_: <gfxstrand> According to that test, I'm hitting all the SMs on my Ada GPU
19:47fdobridge_: <gfxstrand> @redsheep Feel like giving that a run on the 4090?
19:47fdobridge_: <gfxstrand> 😄
19:48fdobridge_: <redsheep> Sure, I can in a few hours
19:52fdobridge_: <!DodoNVK (she) 🇱🇹> Don't accidentally put that 4090 into a crucible though 😅
19:53fdobridge_: <Sid> compiling, will give it a go on my 1660Ti too
19:53fdobridge_: <redsheep> That would be some awfully expensive slag
19:55fdobridge_: <airlied> @gfxstrand is coissue a lot of work in nak?
19:57fdobridge_: <Sid> ```
19:57fdobridge_: <Sid> [sidpr@strogg bin]$ marigold -k ./crucible run func.nv.shader-sm-builtins
19:57fdobridge_: <Sid> WARNING: NVK is not a conformant Vulkan implementation, testing use only.
19:57fdobridge_: <Sid> crucible: info : running 1 tests
19:57fdobridge_: <Sid> crucible: info : ================================
19:57fdobridge_: <Sid> crucible: start : func.nv.shader-sm-builtins.q0
19:57fdobridge_: <Sid> crucible: info : func.nv.shader-sm-builtins.q0: shaderSMCount = 24
19:57fdobridge_: <Sid> crucible: info : func.nv.shader-sm-builtins.q0: shaderWarpsPerSM = 32
19:57fdobridge_: <Sid> crucible: info : func.nv.shader-sm-builtins.q0: Saw all advertised SMs in the results
19:57fdobridge_: <Sid> crucible: pass : func.nv.shader-sm-builtins.q0
19:57fdobridge_: <Sid> crucible: info : ================================
19:57fdobridge_: <Sid> crucible: info : ran 1 tests
19:57fdobridge_: <Sid> crucible: info : pass 1
19:57fdobridge_: <Sid> crucible: info : fail 0
19:57fdobridge_: <Sid> crucible: info : skip 0
19:57fdobridge_: <Sid> crucible: info : lost 0
19:57fdobridge_: <Sid> ```
19:57fdobridge_: <Sid> seems about right on non-RTX turing as well
19:57fdobridge_: <Sid> sm count is 24 according to the specs too
20:01fdobridge_: <gfxstrand> It's a big unknown.
20:01fdobridge_: <gfxstrand> There's a lot of "we need to figure out how the HW works"
20:02fdobridge_: <gfxstrand> Maybe @karolherbst has something in those magic docs of his but I suspect there's a lot of "you're on your own" going on.
20:02fdobridge_: <gfxstrand> I just pushed an update to the test
20:02fdobridge_: <Sid> on it
20:04fdobridge_: <Sid> `Never saw warps 20-31`
20:04fdobridge_: <gfxstrand> Yeah, that's expected
20:04fdobridge_: <karolherbst🐧🦀> what's coissue?
20:04fdobridge_: <karolherbst🐧🦀> or uhm...
20:04fdobridge_: <Sid> not pasting the whole output because it does get a bit spammy on the IRC side of the bridge
20:04fdobridge_: <karolherbst🐧🦀> shaders enqueuing shaders?
20:04HdkR: Put two ALU ops next to each other, bam, coissue
20:04fdobridge_: <gfxstrand> Hitting all the warps depends on residency
20:05fdobridge_: <karolherbst🐧🦀> ohh
20:05fdobridge_: <karolherbst🐧🦀> instruction issuing
20:05fdobridge_: <karolherbst🐧🦀> right
20:05fdobridge_: <karolherbst🐧🦀> well
20:05fdobridge_: <karolherbst🐧🦀> hw can't do it
20:05fdobridge_: <karolherbst🐧🦀> kepler was the only gen who was able to dual issue
20:05fdobridge_: <karolherbst🐧🦀> now you can just issue back to back with a delay of one
20:05fdobridge_: <gfxstrand> Hrm..
20:05fdobridge_: <gfxstrand> So maybe that's not it then
20:06fdobridge_: <Sid> :blobcatnotlikethis:
20:06fdobridge_: <gfxstrand> As long as you see about half of them, I think it's okay.
20:06fdobridge_: <karolherbst🐧🦀> what's the problem btw?
20:07fdobridge_: <gfxstrand> Trying to figure out why we're not getting full ALU throughput on some tests
20:07fdobridge_: <gfxstrand> I haven't even looked at the shaders yet, though.
20:07fdobridge_: <karolherbst🐧🦀> do you use `IMAD`s and stuff for movs?
20:07fdobridge_: <karolherbst🐧🦀> and other things
20:07fdobridge_: <Sid> even vkpeak says we're getting only half the throughput against the prop driver
20:07fdobridge_: <gfxstrand> The first step would be to actually look at the NAK output from vkpeak
20:07fdobridge_: <gfxstrand> Until then we're shooting blind
20:08fdobridge_: <karolherbst🐧🦀> we only have a limited amount of alu units
20:08fdobridge_: <karolherbst🐧🦀> and nvidia does interleave `IMAD` with `MOV`s for a reason
20:08fdobridge_: <Sid> how do I do this :D
20:08fdobridge_: <karolherbst🐧🦀> or uses IMAD for cb pulls
20:08fdobridge_: <karolherbst🐧🦀> and other things
20:08fdobridge_: <karolherbst🐧🦀> I'm sure this all matters here
20:08fdobridge_: <karolherbst🐧🦀> also
20:09fdobridge_: <karolherbst🐧🦀> we probably need to schedule instructions to reduce cycles
20:09fdobridge_: <karolherbst🐧🦀> and other things
20:09fdobridge_: <gfxstrand> NAK_DEBUG=print
20:09fdobridge_: <!DodoNVK (she) 🇱🇹> `NAK_DEBUG=print` 🤷♀️
20:10fdobridge_: <Sid> on it
20:10fdobridge_: <gfxstrand> @karolherbst if you wanted to review now that I've got a test, it'd be nice. 🙂
20:11fdobridge_: <gfxstrand> The SPIR-V bits need an actual RB
20:11fdobridge_: <Sid> oh, speaking of review
20:11fdobridge_: <Sid> technically we've got all the required extensions for advertising 1.3
20:11fdobridge_: <Sid> we're all done on vulkan 1.1
20:12fdobridge_: <Sid> on vulkan 1.3 we're only missing an optional ext (VK_EXT_texture_compression_astc_hdr)
20:12fdobridge_: <Sid> on vulkan 1.2 we're missing VK_KHR_shader_float16_int8, and I have conflicting info on whether or not that's optional
20:12fdobridge_: <Sid> but astc one would be a no-op if we did it anyway, since the hw doesn't support it apparently
20:17fdobridge_: <karolherbst🐧🦀> done
20:17fdobridge_: <karolherbst🐧🦀> can also be a kernel issue
20:17fdobridge_: <Sid> https://cdn.discordapp.com/attachments/1034184951790305330/1197635931990130768/nak.log?ex=65bbfc61&is=65a98761&hm=8c7744c1563f69ccff8b8e35c85666d1138c83f03cb3d2bada46e227005b0d98&
20:18fdobridge_: <Sid> nothing in my dmesg re: nouveau for the past 1.5 hours
20:18fdobridge_: <Sid> that's also system uptime atm
20:18fdobridge_: <gfxstrand> I'm seeing about 2/3 of SMs on the prop driver
20:18fdobridge_: <karolherbst🐧🦀> yeah.. you need to schedule instructions 🙂
20:19fdobridge_: <Sid> that went right over my head but I'm sure it's helpful to either faith or dave 😅
20:20fdobridge_: <karolherbst🐧🦀> but it's also weird...
20:20fdobridge_: <karolherbst🐧🦀> ` r4 = imad r8 r4 rZ`
20:20fdobridge_: <karolherbst🐧🦀> `r4 = iadd3 rZ r4 r12`
20:20fdobridge_: <karolherbst🐧🦀> why not `r4 = imad r8 r4 r12`
20:21fdobridge_: <gfxstrand> Yeah, we need to actually use `imad`
20:21fdobridge_: <pac85> What js is rZ?
20:21fdobridge_: <karolherbst🐧🦀> zero
20:21fdobridge_: <pac85> What is is rZ? (edited)
20:21fdobridge_: <pac85> I see thx
20:21fdobridge_: <gfxstrand> For the float test, it's all dependent instructions which sucks
20:21fdobridge_: <karolherbst🐧🦀> oh right.. nir doesn't have it?
20:21fdobridge_: <gfxstrand> No, not yet but we should add it
20:21fdobridge_: <karolherbst🐧🦀> yes we should 😄
20:22fdobridge_: <Sid> imad, ineeds to go to bed e-e
20:22fdobridge_: <gfxstrand> Other HW has it but typically with dumb restrictions like 24-bit or something
20:22fdobridge_: <karolherbst🐧🦀> right...
20:22fdobridge_: <gfxstrand> We have actual 32-bit imad
20:22fdobridge_: <karolherbst🐧🦀> the issue here is, that it will blow up opt_algebraic as frontends can also generate imad
20:22fdobridge_: <Sid> well, if there's anything to test in the next ~20-ish mins let me know, am gonna start winding down until then :)
20:22fdobridge_: <karolherbst🐧🦀> and it's prolly fun
20:22fdobridge_: <karolherbst🐧🦀> huh actually...
20:23fdobridge_: <karolherbst🐧🦀> spirv doesn't have it...
20:23fdobridge_: <karolherbst🐧🦀> I thought CL C has it.. but apparently not
20:23fdobridge_: <karolherbst🐧🦀> nvm then
20:23fdobridge_: <karolherbst🐧🦀> guess algebraic_late would be good enough then
20:24fdobridge_: <karolherbst🐧🦀> but anyway.. looking at that shader, not using `imad` properly can explain a 50% perf gap 😄
20:25fdobridge_: <gfxstrand> We also need `ffma` fusion
20:25fdobridge_: <gfxstrand> I should just steal the Intel pass
20:25fdobridge_: <karolherbst🐧🦀> ~~or finally add it to nir~~
20:26fdobridge_: <karolherbst🐧🦀> maybe I get bored enough to finally finish cleaning up that ffma vs fmad mess
20:27fdobridge_: <gfxstrand> Yeah....
20:28fdobridge_: <gfxstrand> I'm gonna do large constants first
20:33fdobridge_: <Sid> ~~all I'm hearing is nvk go brr when this is sorted out~~
20:42fdobridge_: <redsheep> There's quite a few other things preventing brr, but I hope it's a nice jump.
20:44fdobridge_: <redsheep> Even if I came to the conclusion that tiling isn't usually a huge deal on ada doesn't mean that other gpus won't see a huge bump there.
20:44fdobridge_: <redsheep> And zcull does look like it could be huge
20:45fdobridge_: <redsheep> How would one go about actually testing whether render compression is working? That one sounds interesting to me
20:48fdobridge_: <redsheep> That's the same thing as what Nvidia brands as delta color compression, right?
21:01fdobridge_: <redsheep> Iirc that feature is meant to be transparent but I remember Nvidia showing visualizations of it so clearly they have some way to get feedback on what is compressing and how much.
21:22Lyude: airlied, dakr btw - https://gitlab.freedesktop.org/lyudess/linux/-/commits/rvkms I -think- I've got all of the deps I need so far, hopefully I won't find anymore as I go through and finish writing up the skeleton
21:32fdobridge_: <gfxstrand> This should help perf in some apps. IDK which ones, though: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27156
21:52Lyude: (looks like I might still need a few more lol)
22:06fdobridge_: <mohamexiety> can't ampere/ada do it?
22:06fdobridge_: <mohamexiety> just can't have integer while doing that
22:06fdobridge_: <karolherbst🐧🦀> nah, that's something else
22:11fdobridge_: <gfxstrand> Yeah, there's a multiply by a power of two thing somewhere.
22:11fdobridge_: <gfxstrand> I don't remember where off-hand
22:13fdobridge_: <mohamexiety> yeah with ampere they doubled the amount of FP32 alus and while it wasn't exactly dual issue, it allowed you double the throughput in some cases but I don't exactly remember the catch with that
22:14fdobridge_: <gfxstrand> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27157
22:27fdobridge_: <redsheep> The patches just keep coming today. I am a little sad my idea didn't end up helping with the crashes but I'll take more performance any day.
22:28fdobridge_: <redsheep> I'll be testing the crucible for the sm IDs here in just a bit, and I'll try out these perf patches too 🙂
22:50fdobridge_: <gfxstrand> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27159
22:52fdobridge_: <gfxstrand> I'm typing patches faster than I can CTS them. 😂
22:57fdobridge_: <gfxstrand> That one adds integer mul+add fusion, again increasing vkpeak by about 40-50%, this time for integer stuff.
23:00fdobridge_: <dadschoorse> AMD doesn't have 32bit imad, but I think on some hw i/umad with 32bit mul operands and 64bit acc is as fast as 32bit imul 🐸
23:10fdobridge_: <gfxstrand> NV has something similar but I'm not sure how it works.
23:11fdobridge_: <gfxstrand> Like, `imad` has a 64-bit version that has a 64-bit destnation. I'm not sure if the add source is 64-bit or not.
23:12fdobridge_: <gfxstrand> If so, we should figure out how to encode that in NIR and make `nir_lower_int64()` use it for `imul`