01:07 airlied[d]: _lyude[d]: it does appear my ad102 rtx6000 fails to resume, can you access the system at all after resume?
01:08 airlied[d]: oh I got ssh back in
01:36 _lyude[d]: airlied[d]: yep - it's just gsp that times out
03:51 mangodev[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1460481432966267006/image.png?ex=696712d4&is=6965c154&hm=364fa25b8fa2eb87cd0bb9fc0b70fc153d6678fb83a665bab6761db09e996e7f&
03:51 mangodev[d]: this is getting kinda tiring
03:53 mangodev[d]: chrome and derivatives seems to have a random chance to throw this error when loading images and specifically loading images
04:44 mhenning[d]: does it happen with specific images?
04:46 mhenning[d]: oh wait, I just managed to reproduce an assert failure on chromium. that's worth looking into
04:48 mhenning[d]: okay, filed https://gitlab.freedesktop.org/mesa/mesa/-/issues/14647
06:02 airlied[d]: _lyude[d]: doesn't seem to be failing because of the size calcs from what I can see, at least it seems like we use close to the same values as 570 does
07:06 mangodev[d]: mhenning[d]: :D
10:24 snowycoder[d]: airlied[d]: firefox and thunderbirds are the only apps that freeze sometimes, it also persists app restarts.
10:24 snowycoder[d]: I have no idea how to debug that though
13:54 karolherbst[d]: fyi, I'll focus more on the load/store ugpr+gpr+const stuff and try to get that ready and merged. Though for that I also have to start thinking about how to solve the issue with ugpr in non uniform control flow situation... though _maybe_ we should just make sure that base addresses are loaded in uniform control flow and then offset calcs are non uniform...
16:01 _lyude[d]: airlied[d]: I'm going off what was supposedly said in the gsp log
16:30 _lyude[d]: airlied[d]: btw - https://discordapp.com/channels/1033216351990456371/1034184951790305330/1450800268026314947 this is what I was referring to
17:00 mhenning[d]: karolherbst[d]: I have ideas for the "ugpr in non uniform control flow situation" but it needs a bit more reverse engineering before I can figure out anything concrete
17:02 mhenning[d]: although does a first iteration actually need that? You can check `nak_block_is_divergent` to figure out if you can write ugprs in nir passes
17:04 karolherbst[d]: atm I'm checking on the nir side if I can rely on the base address to be within an ugpr and it seems reliable enough
17:04 karolherbst[d]: not using `nak_block_is_divergent` tho, so maybe I should
17:04 karolherbst[d]: but yeah, I had to take that into account otherwise legalizing would undo the opt and the stats weren't really great 🙂
17:06 mhenning[d]: yeah, if you have any checks on nir_block.divergent they should be changed to use nak_block_is_divergent instead
17:06 karolherbst[d]: I'm basically doing what `nak_block_is_divergent` is doing 🙃
17:07 karolherbst[d]: wasn't aware we have a helper for it
17:07 mhenning[d]: oh, yeah the helper is pretty recent
17:11 karolherbst[d]: I wonder if I want to write a pass that moves instructions producing uniform values from `nak_block_is_divergent` blocks to uniform blocks
17:12 karolherbst[d]: or do we have something like that already?
17:12 karolherbst[d]: though it's also something that opt_licm would do...
17:12 karolherbst[d]: probably
17:12 karolherbst[d]: partially at least
17:13 karolherbst[d]: I'd need to play around with it...
17:14 mhenning[d]: I think we really just want to allow writes to ugprs in non-uniform control flow, and I'm pretty sure we can if the hardware works the way I think it works
17:14 karolherbst[d]: mhh yeah...
17:14 mhenning[d]: but that's a pretty big project
17:14 karolherbst[d]: it _should_ be predictable enough
17:16 karolherbst[d]: but yeah, I can ignore the issue for now and just land something good enough (tm). Most of the work is encoding and not messing up anyway...
18:29 _lyude[d]: Does anyone know how to get OpenRM to actually print LEVEL_INFO messages in the driver? Setting `DEBUG=1` when building the kernel module doesn't seem to make a difference, manually changing the default `NVLOG_LEVEL` in src/nvidia/inc/kernel/core/printf.h doesn't seem to help either, and defining NVLOG_ENABLE to 1 just makes the build fail
19:42 airlied[d]: _lyude[d]: yup I saw the earlier log stuff, but I'm not seeing a fail on suspend driver unload at all, we have a WARN in nouveau for that. But I also dumped sizes from openrm and we use the same ones from what I can see
19:42 _lyude[d]: interesting
19:43 _lyude[d]: at the very least I now know that R570 of openrm is able to suspend the GPU properly, and I think I understand how to turn on openrm's debug messaging so I can hopefully follow along with what's happening better
19:44 _lyude[d]: though I can't spend too much longer on this today since I've got other stuff at work to attend to
19:46 _lyude[d]: woo! yep i got it, it's finally spitting out all of the debug messages now 🙂
19:59 marysaka[d]: _lyude[d]: I think it's NVreg_RmMsg that you want to set up? (https://github.com/NVIDIA/open-gpu-kernel-modules/discussions/197)
19:59 _lyude[d]: yeah I got it!
19:59 _lyude[d]: thank you though
21:06 _lyude[d]: Something else weird that's been happening around this issue that somehow I still haven't managed to figure out: https://lyude.net/~lyudess/tmp/goldenwind-vbl-wait-fail.txt
21:06 _lyude[d]: I often end up seeing this warning right before suspend, but I actually have no idea where this could possibly be coming from. The obvious assumption would be that we're not stopping fbdev before suspending, except that the first thing we do in `nouveau_display_suspend()` is call `drm_client_dev_suspend()` which should be stopping fbdev
21:42 airlied: _lyude[d]: does that that sync and submitted vbl waits?
21:48 _lyude[d]: wait. actually - we might be skipping this call because apparently we have never filled out the atomic_check/atomic_commit callback fields...
21:52 _lyude[d]: OK - that bug is fixed now at least
22:17 _lyude[d]: btw airlied[d] at some point could you review https://patchwork.freedesktop.org/series/159325/ ? it should be very simple. Tl;dr we're not grabbing the lock we need to write to the core nvdisplay channel during some cursor updates, that makes it so we do and print WARN_ONs in the future if it happens again
22:42 karolherbst[d]: _lyude[d]: if you want to spend some time on a cursed issue, here is one instance of "if the GPU is faster, disp doesn't cause issues" https://gitlab.freedesktop.org/drm/nouveau/-/issues/74#note_3279149 and I'm sometimes wondering if we have a bunch of issues we never figured out that are basically "VRAM too slow"
22:43 _lyude[d]: karolherbst[d]: I wish I cold but unfortunately this suspend/resume issue is cursed enough 🙁
22:44 karolherbst[d]: ohh yeah, not asking you to do a deep debug, I just don't know how to parse the disp error dumps 🙃
23:16 _lyude[d]: karolherbst[d]: I see one incomplete display state dump in https://gitlab.freedesktop.org/drm/nouveau/-/issues/74#note_3279149 but I don't see one anywhere else, did I miss something?
23:36 karolherbst[d]: it's incomplete? well if so probably should ask for a complete one
23:48 matt_schwartz[d]: confirmed my blackwell gpu is working again on -rc5 🥳