01:24mmarchini: ahh ok, I saw something about hwmon on the logs earlier but wasn't sure if that included perf metrics or just https://gitlab.freedesktop.org/drm/nouveau/-/issues/335
02:24karolherbst[d]: mhhhh
02:24karolherbst[d]: I have a shader that mostly moves the address from the non uniform to the uniform source, so that's great.. but.. code size +3, cycle count -150, num GPRs *2...
02:25karolherbst[d]: and it's I think due to opt_instr_sched_prepass .. mhh
02:27karolherbst[d]: mhenning[d]: does prepass ignore UGPRs? It's not really obvious to me looking at the code
02:27karolherbst[d]: mhh guess it factors it in due to spilling...
02:29karolherbst[d]: but it feels like I need to change the heurestics there because I've seen wild regressions
02:29karolherbst[d]: mhh yeah...
02:30karolherbst[d]: making UGPRs super expensive to spill I get -42 instructions, -11 cycles and no change in GPRs...
02:34karolherbst[d]: ... RA is doing something weird...
02:35karolherbst[d]: like super weird
02:37karolherbst[d]: yeah.. for whatever reason the prepass rather does tons of loads using UGPRs rather than consuming the GPRs
02:37karolherbst[d]: so it does tons of loads, and then moves the values away to do even more loads
02:37karolherbst[d]: and then uses the gprs after doing all the loads and that makes the gpr usage explode
02:53karolherbst[d]: okay.. something is super odd there...
02:54karolherbst[d]: okay okay... so the "better version" is just prepass failing and falling back
02:55karolherbst[d]: which is a bit funny that it ends up using less registers
02:55karolherbst[d]: and way fewer instructions
03:01mhenning[d]: karolherbst[d]: yes, prepass factors in ugprs
03:04karolherbst[d]: mhh right soo.. moving the UGPR sources into the uniform_register src does make the prepass succeed more often.. or at least in one of the shaders I'm looking at and it makes them wildly different, it's kinda interesting
03:04karolherbst[d]: but I wished it wouldn't reduce warps/SM 🙃
03:05mhenning[d]: prepass should never reduce warps/sm (compared to prepass disabled)
03:05karolherbst[d]: well it does here
03:06karolherbst[d]: https://gist.github.com/karolherbst/42620867e1d8787a1260e01f67d08752
03:06mhenning[d]: If you comment out prepass it improves warps/SM?
03:06karolherbst[d]: reglimit is 37
03:06karolherbst[d]: yes
03:07karolherbst[d]: I think it was 37...
03:07karolherbst[d]: `[RegLimit(37)]` yep
03:08mhenning[d]: alright, that's a bug in prepass then
03:08karolherbst[d]: I had a different shader where with prepass cycles more than doubled and I wonder if it's a similar bug there..
03:08mhenning[d]: If you're spilling a lot of ugprs to gprs it's possible prepass models that wrong
03:08karolherbst[d]: mhhh
03:08karolherbst[d]: yeah so that's the odd thing
03:09karolherbst[d]: this barely uses UGPRs
03:09karolherbst[d]: maybe 7?
03:09karolherbst[d]: just using the same UGPR really really often
03:09karolherbst[d]: 8 UGPRs actually
03:10karolherbst[d]: also funny:
03:10karolherbst[d]: r5 = mov ur5 // delay=4
03:10karolherbst[d]: r5 = imnmx.u32 r5 0xf423f pT // delay=4
03:10karolherbst[d]: ur5 = r2ur r5 // delay=1 wr:0
03:10karolherbst[d]: ohh I thought UIMNMX existed..
03:11mhenning[d]: alright, well it's supposed to discard the new schedule if it's worse than the original in terms of warps/sm
03:11karolherbst[d]: yeah...
03:11mhenning[d]: karolherbst[d]: I think it exists on blackwell
03:11karolherbst[d]: maybe I dig into it tomorrow and figure out why it does
03:11karolherbst[d]: I just checked and it also doesn't...
03:12mhenning[d]: I'm seeing a uimnmx both in my table and in https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#ampere-and-ada-instruction-set for blackwell
03:12karolherbst[d]: mhhh
03:13karolherbst[d]: well that's new to me then 🙂
03:13mhenning[d]: but also that's one we would be better off lowering to a few instructions than converting to gpr and back
03:14karolherbst[d]: which ones tho
03:14mhenning[d]: what do you mean?
03:14karolherbst[d]: don't really see how you can efficiently implement it with U* instructions
03:15karolherbst[d]: mhh maybe UISETP + USEL...
03:15karolherbst[d]: yeah actually...
03:16mhenning[d]: yeah compare + bcsel
03:17karolherbst[d]: anyway.. I should check why prepass is doing what it's doing
03:18mhenning[d]: Do you have assertions on?
03:19mhenning[d]: there's a debug_assert!() in there that may be useful
03:19karolherbst[d]: yeah it's on
03:19karolherbst[d]: limit is 32 for GPRs e.g.
03:19karolherbst[d]: eh 37
03:19karolherbst[d]: `max_live[RegFile::GPR]` is 32
03:21mhenning[d]: oh, if you set SW_RESERVED_GPRS to 0 does that help?
03:21karolherbst[d]: it doesn't
03:22karolherbst[d]: I do wonder why it spills to regs tho...
03:22karolherbst[d]: like why at all
03:22karolherbst[d]: the shader really looks super weird
03:23karolherbst[d]: https://gist.githubusercontent.com/karolherbst/d148491204924ecbf8e7dfa842f1f0cd/raw/f61b67e1a7cda8866f981bd41bd340df4e7e69ed/gistfile1.txt
03:23mhenning[d]: I mean, that could just be RA being silly
03:24karolherbst[d]: mhhh
03:24karolherbst[d]: yeah but also where are those spills..
03:26karolherbst[d]: there are supposed to be 11 spills to/from regs in there
03:27mhenning[d]: `r5 = sel !p6 rZ 0xffffffff` is a spill from pred -> gpr
03:28karolherbst[d]: ahhh
03:28karolherbst[d]: ohh it spills predicates...
03:28karolherbst[d]: mhhh
03:28karolherbst[d]: ohh yeah it is using tons of predicates..
03:28mhenning[d]: and then `p4 = isetp.ne.u32 rZ r5` is the corresponding unspill
03:28karolherbst[d]: was focusing too much on the ugprs
03:29karolherbst[d]: let's see...
03:32karolherbst[d]: soo
03:33karolherbst[d]: `get_schedule_types` gets called with `min_gpr_target: 1`, `max_gpr_target: 21`
03:33karolherbst[d]: and the result is `[RegLimit(37)]`
03:33karolherbst[d]: is that expected?
03:34karolherbst[d]: though I guess with 40 regs it would allow 48 warps/SM...
03:34karolherbst[d]: maybe..
03:34karolherbst[d]: let me do maths
03:34karolherbst[d]: ohhhh...
03:34mhenning[d]: Yeah, that might be right
03:35karolherbst[d]: I think I know what's going on
03:35karolherbst[d]: it doesn't account for the fact that preds gets spilled
03:35karolherbst[d]: sooo.
03:35karolherbst[d]: it does find a solution with 40 regs (37 + 3 reserved)
03:35karolherbst[d]: but then RA just spills preds to regs
03:35karolherbst[d]: and then the warps/SM gets reduced
03:36karolherbst[d]: because it goes above 40 regs
03:36mhenning[d]: It's supposed to take into account spilling
03:36mhenning[d]: but yeah it could be incorrect
03:37karolherbst[d]: well max_live[GPR] is 32
03:38mhenning[d]: what is max_live[Pred]?
03:38karolherbst[d]: 7
03:38karolherbst[d]: ehh wait
03:38karolherbst[d]: that's UGPR
03:38karolherbst[d]: 13
03:38karolherbst[d]: for preds
03:40karolherbst[d]: okay so _actual_ GPR usage seems to be 41 (if I remove my "use free GPRs" opt)
03:41karolherbst[d]: okay...
03:41karolherbst[d]: yeah soo...
03:41karolherbst[d]: it's funny
03:42karolherbst[d]: `const SW_RESERVED_GPRS: i32 = 2;` seems to fix it
03:42karolherbst[d]: Instruction count: 273
03:42karolherbst[d]: Static cycle count: 908
03:42karolherbst[d]: Max warps/SM: 48
03:42karolherbst[d]: Spills to mem: 0
03:42karolherbst[d]: Spills to reg: 9
03:42karolherbst[d]: Fills from mem: 0
03:42karolherbst[d]: Fills from reg: 9
03:42karolherbst[d]: Num GPRs: 40
03:42karolherbst[d]: ``` with that
03:42karolherbst[d]: while that still uses like 60 instructions more, it's less cycles 😄
03:43karolherbst[d]: maybe something doesn't account for something and there is a weirdo off by one somewhere
03:47karolherbst[d]: but that does seem to hurt a lot of other things a lot more
03:49mhenning[d]: yeah, maybe we do reserve an extra register in that case?
03:50karolherbst[d]: let me try something else..
03:50karolherbst[d]: well RA reserves two gprs for par_copy lowering
03:51karolherbst[d]: ehh that's only with DEBUG.spill..
03:52mhenning[d]: It might reserve two for cycle lowering
03:52karolherbst[d]: if I give RA one reg less:
03:52karolherbst[d]: Instruction count: 272
03:52karolherbst[d]: Static cycle count: 892
03:52karolherbst[d]: Max warps/SM: 48
03:52karolherbst[d]: Spills to mem: 0
03:52karolherbst[d]: Spills to reg: 10
03:52karolherbst[d]: Fills from mem: 0
03:52karolherbst[d]: Fills from reg: 10
03:52karolherbst[d]: Num GPRs: 40
03:52karolherbst[d]: SLM size: 0
03:53mhenning[d]: I'd really like to make the reserved regs only live for as long as we use them rather than reserving them for the full shader
03:53karolherbst[d]: so the gpr_limit in assign_regs is 38
03:53karolherbst[d]: and total_gprs is 39
03:53mhenning[d]: I think that might get rid of that problem
03:53karolherbst[d]: and then you have 2 hw resered regs
03:54karolherbst[d]: so it's actually 41
03:56karolherbst[d]: mhh probably
03:56karolherbst[d]: but which reserved regs do you mean?
03:57karolherbst[d]: because the hw ones are always in use
03:58karolherbst[d]: though if you mean the temp gprs then yeah...
03:58karolherbst[d]: tmp_gprs is 1 here
03:59karolherbst[d]: and setting that to 0 does fix the issue as well
03:59mhenning[d]: Yeah, I mean the ones reserved for RA
04:00karolherbst[d]: mhh right
04:00karolherbst[d]: yeah that probably could help with a couple of thigs
04:03karolherbst[d]: ohhh yeah
04:03karolherbst[d]: that's big
04:03karolherbst[d]: only partial run, but:
04:03karolherbst[d]: Totals:
04:03karolherbst[d]: CodeSize: 268805104 -> 269371088 (+0.21%); split: -0.24%, +0.45%
04:03karolherbst[d]: Number of GPRs: 1386156 -> 1372716 (-0.97%)
04:03karolherbst[d]: Static cycle count: 155621632 -> 156336784 (+0.46%); split: -0.13%, +0.59%
04:03karolherbst[d]: Max warps/SM: 1468088 -> 1470916 (+0.19%)
04:03karolherbst[d]: tho not sure if that's better or worse 😄
04:04karolherbst[d]: and it might fail RA in other cases because my patch is dirty
04:05karolherbst[d]: anyway, I should go to sleep 😪
11:36mohamexiety[d]: mmarchini: They’re both p much the same. It’s just there’s more info than what hwinfo can express so those would need an extra method to export
13:36phomes_[d]: mohamexiety[d]: I pushed the perfetto layer to https://gitlab.freedesktop.org/phomes/mesa/-/commits/nvk-perfetto-layer-hacks
13:36phomes_[d]: It adds an internal layer that does perfetto tracing to most vulkan api calls. It reuses some of the layering infra from the debug hud so it sits on top of that. Perfetto is just the top commit
13:39phomes_[d]: you can take a look at that and see if that approach fits into your bigger perfetto/u_trace picture. I am fine with you taking on the work. I just wanted to put my WIP things out in case it is useful
14:41mohamexiety[d]: phomes_[d]: Awesome, thanks a ton!
18:51apkumar: Hi y’all, I’m trying to dump the vbios for 8 Nvidia h100 sxms and I’m having some trouble. I’ve tried 3 approaches: 1. Bind them with vfio-pci and try to use vfio-ioctl to read BAR0 regions for vbios kind of how nvagetbios.c does it. This doesn’t work because either the pramin register for steering the window doesn’t move it, or everything in BAR0 is 0xAA…
18:51apkumar: 2. In a vm with nouveau and gsp firmware, try mounting debugfs and looking in `/sys/kernel/debug/dri/*/vbios.rom`. I gave up on this because vbios.rom is always empty
18:51apkumar: 3. I just used nvflash, which worked
18:52apkumar: Any idea why the first two things didn’t work and what I might have been doing wrong?
18:53karolherbst: apkumar: nouveau might not have succeeded to load on the GPU?
18:54karolherbst: the vbios.rom is only available if it could bring up the device properly
18:55apkumar: nouveau recognized and initialized the GH100s. (sees 80GB VRAM and DRM registered)
18:56karolherbst: mhhh
18:57karolherbst: have you actually tried to read the vbios.rom file?
18:58karolherbst: ohh mhh but it's also empty here wth cat...
18:58karolherbst: huh...
18:58karolherbst: maybe it's broken now...
18:58karolherbst: or maybe with GSP it's always empty...
18:58karolherbst: okay 2 sounds like a nouveau bug
19:02karolherbst: apkumar: also.. the prom method is generally what you want to use on modern GPUs, and that worked on my turing. But the vbios is in it's original raw form
19:06apkumar: prom as in BAR0 + 0x300000? I think that was all 0xAA for me
19:07apkumar: I tried reading vbios.rom with cat, and that was empty :(
19:09apkumar: as a general question: do y'all know what nvflash is doing different from nouveau? AI is telling me that it's using SPI flash interface, but I have no idea if that's accurate
19:09karolherbst: apkumar: prom access needs to be enabled first, see https://github.com/envytools/envytools/blob/master/nva/nvagetbios.c#L121
19:10karolherbst: worked on my TU102 at least
19:15karolherbst[d]: okay.. we have a couple of `float_controls2` tests failing in the CTS now
19:15karolherbst[d]: `dEQP-VK.spirv_assembly.instruction.compute.float_controls2.fp16.input_args.cross_testedWithout_NotNaN_arg1_nan_arg2_one_res_nan_deco` e.g.
19:18mhenning[d]: That test passes here on main from a few days ago ( 4a654aee7cd5b99e74ebcc53755a1f3690bccf65 )
19:24karolherbst: oh yeah.. I think it broke like yesterdaay or so. Some float_controls2 stuff went in
19:25karolherbst: maybe it was fbc056220381504296b1bff68181b000fe200359 ? checking here
19:32karolherbst[d]: fastest git bisect in the world
19:32karolherbst[d]: it is fbc056220381504296b1bff68181b000fe200359
19:40apkumar: karolherbst: for some reason, zeroing 0x88050 doesn't work, but if I scan the whole prom region, I do find the vbios in there at 0x41200. Is that a magic number, or should I have been pointed there by something else? thanks, btw
19:41karolherbst: no idea tbh
19:42karolherbst: maybe it's also something new with hopper or so..
19:42karolherbst: or new with older gens
19:44karolherbst: nvidia started to do multi part vbios, so I wouldn't be surprised if it's related to that somehow
19:50apkumar: ok gotcha. thanks so much!
19:52airlied[d]: they might have moved some of the regions on hopper
19:52airlied[d]: I think the pci mirror moved
19:54karolherbst[d]: ohh so it's not at 0x88000 anymore?
19:58airlied[d]: 0x92000 on gh
20:00karolherbst[d]: ohhh
20:00karolherbst[d]: yeah so wouldn't be `0x88050` then but `0x92050`
20:00karolherbst[d]: but the prom region could have also moved soo..
20:00karolherbst[d]: or it is `0x41200`
20:02karolherbst[d]: okay.. ULDC looking promising now:
20:02karolherbst[d]: Totals from 16769 (5.70% of 294397) affected shaders:
20:02karolherbst[d]: CodeSize: 193807568 -> 193100528 (-0.36%); split: -0.41%, +0.04%
20:02karolherbst[d]: Number of GPRs: 796088 -> 791944 (-0.52%); split: -0.53%, +0.01%
20:02karolherbst[d]: Static cycle count: 150269722 -> 149578613 (-0.46%); split: -0.52%, +0.06%
20:02karolherbst[d]: Max warps/SM: 676768 -> 678876 (+0.31%); split: +0.31%, -0.00%
20:02karolherbst[d]: still running and the stats seem to go up
20:03apkumar: airlied: do you happen to know if it's also that on blackwells?
20:03airlied[d]: yes everything after hopper moved
20:05apkumar: so golden path would be to zero 0x92050 to enable prom access and then scan for 0x55AA till we find a valid vbios
20:05apkumar: though it may not be necessary to enable since I was able to do this while failing to use 0x88050
20:07airlied[d]: we don't need the bios on hopper+ so probably why nobody has cared
20:22apkumar: correction: hopper vbios was found at 0x341200 (I forgot I was using the offset from nvagetbios.c)
20:58apkumar: on a GB100, zeroing 0x92050 doesn't work, nor does 0x55AA show up in a full scan BAR0
20:59airlied[d]: is that a spark? I wonder if they even have a vbios
21:06apkumar: no, it's a supermicro hgx b200
22:55mohamexiety[d]: airlied[d]: spark is gb20b (same lineage as gb20x)
22:55mohamexiety[d]: thor is gb10b, same linage as gb10x
23:04karolherbst[d]: some shaders with ULDC: Num GPRs: 80 -> 24 🙃
23:05karolherbst[d]: Totals:
23:05karolherbst[d]: CodeSize: 9316341632 -> 9328340640 (+0.13%); split: -0.17%, +0.29%
23:05karolherbst[d]: Number of GPRs: 47269299 -> 46631726 (-1.35%); split: -1.36%, +0.01%
23:05karolherbst[d]: SLM Size: 5409064 -> 5408136 (-0.02%); split: -0.02%, +0.00%
23:05karolherbst[d]: Static cycle count: 6107729207 -> 6053739912 (-0.88%); split: -0.92%, +0.03%
23:05karolherbst[d]: Spills to memory: 44514 -> 43465 (-2.36%); split: -2.77%, +0.42%
23:05karolherbst[d]: Fills from memory: 44514 -> 43465 (-2.36%); split: -2.77%, +0.42%
23:05karolherbst[d]: Spills to reg: 187271 -> 201674 (+7.69%); split: -0.35%, +8.04%
23:05karolherbst[d]: Fills from reg: 227424 -> 234604 (+3.16%); split: -0.28%, +3.44%
23:05karolherbst[d]: Max warps/SM: 50674224 -> 50914112 (+0.47%); split: +0.47%, -0.00%
23:08karolherbst[d]: but I think there are more places to make use of it but not sure...
23:09karolherbst[d]: not quite sure how to implement legalize for ULDC...