16:50 karolherbst[d]: gfxstrand[d]: I wonder if I should add a `nir_lower_mem_access_bit_sizes_options.uniform_addr_mode` enum field with `BYTE`, `INT`, `VEC4` as the values and just special treat `load_uniform` like that in the pass...
16:52 karolherbst[d]: but I totally don't want to add the real code for anything besides `BYTE` :ferrisUpsideDown:
16:52 karolherbst[d]: maybe I'll just assert on the value and say "if it's not byte, please don't use this"
16:55 asdqueerfromeu[d]: karolherbst[d]: Is `BYTE` an `unsigned char`?
16:56 karolherbst[d]: though I think we really only have `BYTE` and `INT`, `BYTE` == 1 slot is a full int, and `INT` means a slot is a full vec4
17:00 gfxstrand[d]: 🤷🏻‍♀️
17:01 gfxstrand[d]: You could also just make it handle wide loads in the backend
17:12 karolherbst[d]: that doesn't solve the problems with unaligned loads, which I think are theoretically possible as well
17:12 karolherbst[d]: though `load_uniform` isn't fit for dealing with that anyway
17:12 gfxstrand[d]: It really isn't
17:13 gfxstrand[d]: If we want a thing that can handle those, it may be time to use a different variable mode
17:13 karolherbst[d]: yeah... maybe
17:14 karolherbst[d]: though most drivers just use ubos and that's totally fine
17:15 karolherbst[d]: maybe I'm just using cb1 instead and not having to deal with this :ferrisUpsideDown:
21:17 airlied[d]: gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29746 aligning with nvidia advice
21:18 gfxstrand[d]: Sure. It very much doesn't matter but sure.
21:29 skeggsb9778[d]: airlied[d]: does it get rid of the random mmu faults you were asking about?
21:29 airlied[d]: it seems to get rid of the f40 gnome + emacs + firefox random deaths I was seeing
21:30 airlied[d]: the original problem I was chasing was a VIRT_WRITE from a customer, so probably not that one, but I've no reproducer on that
21:30 karolherbst[d]: what's the bug number anyway?
21:30 karolherbst[d]: maybe I'd have an idea
21:31 airlied[d]: going to test it some more today
21:31 karolherbst[d]: did you get a push buffer dump on it already?
21:31 karolherbst[d]: anyway, wouldn't be surprised if it's a shader doing something weird
21:32 karolherbst[d]: maybe we should just wire up nak in gl and....
21:34 airlied[d]: the customer bug is like we run the desktop for months we get one of these
21:34 airlied[d]: I sent the jira internally to you
21:34 karolherbst[d]: uhh...
21:34 airlied[d]: I was just hunting around for things that might be similiar and someone pointed out the VIRT_READ one
21:34 karolherbst[d]: `[565337.063551] nouveau 0000:01:00.0: fifo: fault 01 [VIRT_WRITE] at 0000000002802000 engine 40 [gr] client 13 [GPC0/PROP_0] reason 00 [PDE] on channel 2 [00ff5ad000 systemd-logind[1169]]` 🥲
21:35 karolherbst[d]: sooo
21:35 karolherbst[d]: in the past
21:35 karolherbst[d]: when trying to pass the CTS
21:35 karolherbst[d]: I got some of this after hours as well
21:35 karolherbst[d]: something is wrong
21:35 karolherbst[d]: I have no idea what
21:35 airlied[d]: but it's also running old rhel8 mesa, which might not have all the multithread fixes, but they aren't willing to move to rhel 9
21:35 karolherbst[d]: but something is
21:35 karolherbst[d]: ahh
21:35 airlied[d]: yeah we talked on the call about, skeggsb9778[d] still suspects some fencing race somewhere
21:36 karolherbst[d]: though chrome doesn't use multithreading with nouveau
21:36 karolherbst[d]: (but CEF/electron based apps do pick nouveau by default, because it's a chromium/chrome private workaround )
21:37 karolherbst[d]: airlied[d]: yeah.. would be my bet as well
21:38 karolherbst[d]: maybe we overflow some value somewhere as well...
21:38 asdqueerfromeu[d]: karolherbst[d]: How does systemd crash a graphics driver? 🤔
21:38 karolherbst[d]: asdqueerfromeu[d]: it created the FD
21:38 karolherbst[d]: and then passed to the compositor
21:38 karolherbst[d]: `channel 2` is most likely the context of the compositor
21:48 karolherbst[d]: airlied[d]: , skeggsb9778[d] sooo here is a though: what if userspace accesses a bo via shaders/commands, but doesn't _always_ add it to the bos in the submission. That could fail in a similar way, no? I wonder if we have some silly buffer tracking bug that almost never hits
21:50 karolherbst[d]: can we add a mode to the kernel driver, which pages out _all_ memory not referenced?
21:50 karolherbst[d]: I wonder if that would help us spot such bugs
21:51 skeggsb9778[d]: i've actually hacked that in before (many many many years ago) to debug such things - perhaps it'd be a good idea for debugging
21:51 skeggsb9778[d]: such issues*
21:51 karolherbst[d]: maybe it should be a flag on channel creation so it's opt in
21:51 karolherbst[d]: is only useful for the old UAPI anyway
21:52 karolherbst[d]: or maybe even useful for the new one
21:52 skeggsb9778[d]: but yeah, i have looked a number of times to try and find missed bos in submission etc, but never had any luck
21:52 karolherbst[d]: not sure how the VM_BIND uapi manages the page tables
21:52 skeggsb9778[d]: which is why i started suspecting fences
21:52 karolherbst[d]: mhhh
21:52 skeggsb9778[d]: but, it is *very* rare, so could definitely still be a missed bo
21:52 karolherbst[d]: yeah..
21:53 karolherbst[d]: so my idea would be to add such a mode to the driver, then run the CTS until you hit such a bug or so :ferrisUpsideDown:
21:53 karolherbst[d]: but I know absolutely nothing of that area of the driver
21:54 skeggsb9778[d]: ttm has changed heaps since i last hacked it in, but it used to mostly just be a ttm_bo_evict_all() or something like that before validating buffers back in
21:54 karolherbst[d]: mhhh
21:55 skeggsb9778[d]: perhaps locking considerations etc to take into account too, if one were to try and upstream a patch doing it
21:55 karolherbst[d]: right
21:56 skeggsb9778[d]: you actually don't really want a full eviction either (where vram gets moved to system memory)
21:56 skeggsb9778[d]: just the vmas unmapped from the channel
21:56 karolherbst[d]: yeah
22:58 airlied[d]: I also did a missed bo ref check last week and didn't spot any obvious ones
22:59 airlied[d]: I think the fence released before GPU was finished explaination is probably still more viable, just getting it reproducible
22:59 airlied[d]: should probably merge the busy wait fence removal patch since I think myself and danilo considered a race in the current code
23:07 skeggsb9778[d]: yeah, probably no good reason for that to exist still