13:16 karolherbst: arsenm: do you know if running compute shaders in CU mode could lead to GPU hangs for 16x64 byte stores? Radeonsi disabled WGP mode in mesa recently and it's causing weird GPU hangs only for a handful of OpenCL tests and it's not apparent on what's going wrong here
13:20 karolherbst: ehh.. 16x64 bit stores I mean
13:20 arsenm: no idea but I'd kind of doubt it?
13:21 karolherbst: weird...
13:21 karolherbst: maybe the shader gets compiled different, I see LLVM having some optimizations depending on the mode used
13:21 karolherbst: and even workarounds
13:22 karolherbst: e.g. llvm/test/CodeGen/AMDGPU/GlobalISel/lds-misaligned-bug.ll
13:23 arsenm: I have the vaguest memory of that bug
13:24 karolherbst: let me check if the shader changes first
13:26 karolherbst: mhh nothing
13:26 karolherbst: that's the kernel btw: nir, llvm, asm https://gist.github.com/karolherbst/914873135f0637c70149d054322aca7e
13:27 karolherbst: running the same test with only vec8 doesn't hit the problem.. it's kinda weird
13:28 karolherbst: vec8: https://gist.github.com/karolherbst/c20ae241599a9eef8e5833ef3328ee9b
13:30 arsenm: since when does opencl use amdgpu_cs
13:30 arsenm: also the crazy thing here is using dynamically sized LDS
13:30 arsenm: is it even used? it's declared?
13:30 karolherbst: there shouldn't be any LDS
13:31 karolherbst: also, what's the problem with amdgpu_cs? In mesa we go through the same path for CL and GL compute shaders
13:31 karolherbst: guess it's just a calling convention?
13:33 arsenm: yes, opencl always used amdgpu_kernel
13:34 arsenm: to me the wrong values look like the computation is just wrong for some reason, doesn't look like corruption
13:35 karolherbst: it's wrong because the GPU gets reset
13:36 karolherbst: "[ 153.869516] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=12, emitted seq=14"
13:36 arsenm: oh
13:36 karolherbst: yeah...
13:36 arsenm: that's not really the failure mode I would expect
13:36 karolherbst: same
13:36 karolherbst: it kinda appears super random
13:36 karolherbst: but imad and fmad trigger that when testing long16
13:36 karolherbst: most other tests are fine
13:37 karolherbst: maybe something with the imad/fmad lowering? or 64 bit lowering in general? no idea
13:37 karolherbst: and no idea why CU vs WGP mode would even matter here
13:37 arsenm: those super huge vector tests sometimes find spill and calling convention bus but you have neither here
13:37 karolherbst: it's not like those tests even rely on synchronization...
13:42 arsenm: I'm not fully up to speed on everything post gfx9. You're running wave64 which is weird. I also do not see the feature for wgp or cu (don't even remember which direction is specified / the default).
13:43 arsenm: i think all the bugs were in the matrix of wavesize and WGP/CU
13:44 arsenm: i'd start trying just use wave32 always for compute
13:44 karolherbst: why though?
13:45 arsenm: wave32 was done for compute and wave64 is hanging on because some graphics loads apparently are faster for some reason. narrower waves makes divergence less bad
13:46 karolherbst: mhh but yeah.. forcing wave32 makes it not reset...
13:46 arsenm: wave32 is the default on all wave32 targets
13:46 karolherbst: any idea what's the state with GL compute on wave32 vs wave64?
13:47 karolherbst: but I've also heard that some compute workloads are faster with wave64, maybe I do some benchmarking myself here to see how it impacts perf
13:55 arsenm: no idea
13:55 arsenm: I just know people wanting to mix them is a source of endless pain and suffering
13:56 karolherbst: yeah... I mean for HIP it makes totaly sense to go with 32 as this will make it easier to port applications, and without proper extensions it's also a pain to do it in CL properly either... I mostly just go with whatever radeonsi in mesa is doing with those things.
13:57 arsenm: some of the edge case cross lane features don't even work in wave64
13:58 karolherbst: mhhh... we don't have much of the subgroup stuff in mesa for now, so I guess I'll probably hit those issues once I add more of it
13:59 karolherbst: or maybe radeonsi detects that and just falls backt o 32.. dunno
14:05 karolherbst: arsenm: seems like wave64 is consistently a tad (~0.5%) faster than wave32 with luxmark-3.1 luxball here
14:06 karolherbst: but others also say that on RDNA3 wave64 should be faster across the board
14:31 bnieuwenhuizen: others being me
14:31 bnieuwenhuizen: wave64 can do a lot of float ops with dual issue, but on wave32 the compiler has to essentially make a VLIW2 instruction with a ton of register constraints
14:34 karolherbst: yeah... my personal opinion here is: I trust whatever the radeonsi devs think is best. I just want to understand why that issue I'm seeing exist in the first place and see what the actual proper fix here should be
14:36 karolherbst: and I kinda don't want to have a "force wave32 for CL, because bug we can't explain happens" there forever
14:47 arsenm: does it definitely start to execute? there's nothing in there that can really hang
14:48 karolherbst: I have no idea
14:48 karolherbst: I mean.. the same code works when WGP mode is enabled
14:49 karolherbst: it's literally the only difference I can see, so I would be surprised if it's anything outside or before running the code
14:50 karolherbst: but the sequence numbers are also printed, maybe I can figure out what work they are attached to.... bnieuwenhuizen: is there a way to figure this out easily from the mesa side?
14:50 bnieuwenhuizen: figure what out?
14:50 karolherbst: "signaled seq=49, emitted seq=51" -> translate that to whatever commands were sent out
14:51 bnieuwenhuizen: oh hmm, that is mostly on the level of whole commandbuffers
14:51 karolherbst: right
14:51 karolherbst: but I submit very short ones because of.. reasons
14:51 bnieuwenhuizen: IIRC there is a debug option to dump that hanging commandbuffer
14:51 karolherbst: ahh
14:52 bnieuwenhuizen: not sure offhand what it was, did a whole bunch of things like also enable gallium debug dumping etc.
14:53 bnieuwenhuizen: if it isn't documented you can probably trace it back from whatever calls into src/amd/common/ac_debug.c
14:54 karolherbst: there is AMD_DEBUG=ib
15:00 karolherbst: bnieuwenhuizen: mhh yeah.. it's not really clear with the ib command as it's printing everything.. is there an option to synchronize submiting them and wait for the result?
15:06 bnieuwenhuizen: karolherbst: IIRC the hang stuff was related to GALLIUM_DDEBUG
15:07 karolherbst: huh.. weird
18:07 karolherbst: arsenm: so marekos theory is that LLVM allocates too many VGPRs in CU mode and should have caped it to 128
18:08 mareko: LLVM is fine, it's a Mesa bug
18:33 DottorLeo: Hi! I'm curious about the Vulkan GPL feature on RADV, is it working on all the cards supported by it? Even the oldest like SI and CIK?
18:38 arsenm: I think there was also a firmware bug where it would incorrectly reject certain combinations of wavesize and register count, not sure if that fix ever went out