IRC Logs of #radeon on irc.freenode.net for 2023-04-18

02:04 mareko: amdgpu.vm_fault_stop will stop the GPU on a VM fault, so you can investigate it with umr or other tools
02:19 kchibisov: Is amdgpu.vm_debug useful?
03:52 mareko: kchibisov: I only know that Christian Koenig knows how to use the options, agd5f?
04:04 airlied: don't think those are useful for user debug
07:52 mareko: AMD_DEBUG=check_vm should dump VM fault info into ~/ddebug_dumps, but I don't know if that option still works
11:43 kchibisov: mareko: I've tried with it and nothing appeared for me in the logs.
11:43 kchibisov: Though, maybe the fault should be fatal for this option to work.
17:17 agd5f: kchibisov, vm_fault_stop makes it fatal. That causes the memory controller to halt on a page fault
17:18 kchibisov: Yeah, that part I've understood.
17:18 kchibisov: So to provide anything useful to act on, I should run with vm_fault_stop enabled and then collect something with urm?
17:19 kchibisov: It's not clear what data could help from urm, to understand the cause of crash.
17:20 kchibisov: s/urm/umr/
17:20 agd5f: It just keeps the state of the GPU as is for further debug. Most of the useful information is in the page fault info printed in the log
17:20 agd5f: That will give you the faulting address, IP block that caused the fault r/w and the type of fault
17:21 kchibisov: Yeah, I'm "well aware" how such log looks :p
17:21 agd5f: vm_debug invalidates all BOs to in the CS ioctl. It's helpful for verifying whether you've properly mapped everything or set the proper BO list
17:22 kchibisov: Are the options to limit vm memory could work in daily use or are they solely for testing?
17:23 kchibisov: like vramlimit ?
17:24 agd5f: which options specifically?
17:24 kchibisov: vramlimit and vis_vramlimit.
17:25 agd5f: they don't limit the vm per se, they limit the actual amount of vram available to the driver.
17:25 kchibisov: I know that vramlimit won't boot at all for me.
17:26 kchibisov: Like amdgpu.vramlimit=4096 results in crash in amdgpu init.
17:29 agd5f: they should generally work. I'd need to see the backtrace
17:31 kchibisov: I'd assume you'd need a backtrace with lines and such, since I don't run debug kernel.
17:32 agd5f: most kernels should provide it regardless of debug
17:35 kchibisov: That's a trace https://paste.rs/LyO . Died somewhere in WARNING: CPU: 1 PID: 271 at drivers/gpu/drm/ttm/ttm_bo.c:353 ttm_bo_release+0x330/0x350 [ttm]
17:40 agd5f: kchibisov, I see the problem, It's not able to reserve a special buffer at the top of vram when you limit vram on your board.
17:42 agd5f: The buffer needs to be at a fixed location because it's shared with vbios/fw
17:43 kchibisov: And the limit logic doesn't account for that an "tries from start", which doesn't cover the buffer?
17:47 agd5f: the limit just replaces the size of vram we read back from the vbios. However, certain buffers on certain boards need to be at fixed offsets at the end of vram. The option would need to be reworked substantially to handle these cases since we'd internally need to track the real vram size, but only expose the vram limit for most things
17:50 kchibisov: I have a feeling that my crash(page fault) is specific to the board model, so I wanted to test this theory somehow by limiting the memory.
17:50 kchibisov: Because I'm not the only one having same crashes with exactly the same board model.
18:11 kchibisov: Should amdgpu.async_gfx_ring=0 generally work? (For me it null-derefs on boot).
18:19 agd5f: kchibisov, probably not. I think that is mostly leftover from bring up