14:42marysaka[d]: phomes_[d]: phomes_[d] can you try to run `dEQP-VK.renderpasses.renderpass1.dedicated_allocation.formats.r16g16b16a16_unorm.input.clear.store.*` (no deqp-runner, just regular deqp-vk run) with current main
14:43marysaka[d]: I reproduce that 3/4 for the MMU fault with large pages+comp page on the kernel side
14:43marysaka[d]: it seems related to 64KiB pages as 2MiB pages seems fine
20:38phomes_[d]: yes I can repro that
20:41phomes_[d]: I have CONFIG_PREEMPT_DYNAMIC=y so this was with `preempt=full` on the kernel command line
20:52phomes_[d]: it is not always the same test in that group that fails. But the fault log is always:
20:52phomes_[d]: `nouveau 0000:01:00.0: gsp: rc engn:00000001 chid:15 gfid:0 level:2 type:31 scope:1 part:233 fault_addr:0000003ffdf00000 fault_type:00000002`
21:08mhenning[d]: So I guess the current best guess is that https://gitlab.freedesktop.org/mesa/mesa/-/issues/14610 is some sort of kernel locking issue?
21:09mhenning[d]: tbh I think I'm spooked enough by it at this point that I think we should turn compression off for 6.19 and try again for 6.20
21:18airlied[d]: fun, time to start modelling preempt races in my brain
21:29marysaka[d]: mhenning[d]: The test sequence that result in a fault (at address `0x3ffdf08000` for me) does the following for the given VA allocated:
21:29marysaka[d]: - Previous test allocate VA at `0x3ffdf00000` with a size of `0x10000` (64K), then free it after finishing the test
21:29marysaka[d]: - Current test allocate VA at `0x3ffdf00000` with a size of `0x20000` (128K), queue work and fault
21:29marysaka[d]: - [repeat sequence for X tests]
21:29marysaka[d]: It seems to not happen when 64KiB pages aren't exposed and I force alignment to 2MiB
21:29marysaka[d]: So maybe it's not locking, maybe it is some TLB invalidation going wrong if done in quick succession on the same address space?
21:29marysaka[d]: ... or it's locking
21:30marysaka[d]: I have been trying to get some logs from all the VMM_SPAM defined around the codebase without success so far...
21:30airlied[d]: locking around a tlb flush 😛
21:30mhenning[d]: yeah, it could be something with page table updates
21:33marysaka[d]: considering the fault_type is 2, yeah it must be PTE
21:33marysaka[d]: just the fact that it affect blackwell might make it not be related to the actual page update codepaths as the MMU format changed ect
21:37marysaka[d]: airlied[d]: we do not lock in certain cases that are a bit suspicious to me like `nvkm_vmm_map` or just any of the vmm "raw" variant it seems? so that mean all of nouveau_uvmm is not locking anything from what I understand?
21:37airlied[d]: the locking for uvmm is higher level, using gpuvm
21:38marysaka[d]: okay so not related I guess
21:39airlied[d]: well there could be races in gpuvm 🙂
22:07mohamexiety[d]: the weird thing is how hyper specific it is
22:07mohamexiety[d]: doesn't happen with 4K, doesn't happen with 2M, but happens specifically with 64K. and only in some cases
22:07mhenning[d]: mhenning[d]: submitted as https://lore.freedesktop.org/nouveau/20260116-disable_large_page-v1-1-fdbf85603353@darkrefraction.com/T/#u
22:07mohamexiety[d]: a locking issue would be a lot more noticeable i'd have guessed
22:09mhenning[d]: well, these kinds of bugs can be very sensitive to things like timing. It's possible 2M is broken too, just we don't have an easy repro case
22:09airlied[d]: yup, 64k is probably just a sweet spot of timing
22:09mohamexiety[d]: but 2M and 64K should have basically identical handling, right?
22:12mhenning[d]: marysaka[d]: Right, but eg. in Mary's message earlier the tests are allocating 64K, 128K, etc regions. That might mean that the specific test we're looking at doesn't exercise the 2M case in the same way.