IRC Logs of #dri-devel on irc.freenode.net for 2023-06-10

03:14 mareko: any idea why RADV would run out of memory with Mesa 23.x but not 22.2?
03:21 mareko: I'm hearing UE5 Lyra runs out of memory with RADV from Mesa 23.x
06:11 HdkR: 4
07:07 daniels: karolherbst: maybe I’m just not really awake enough yet, but … there are no crashes in c15.r1.log … ?
07:24 mupuf: DavidHeidelberg[m]: yeepee! You should be able to use a b2c release kernel, it has all you need (including built-in amdgpu firmware)
08:40 karolherbst: daniels: that's exactly the point :)
08:41 karolherbst: I have _no_ idea what happaned there, but it's clearly something funky
08:41 karolherbst: just wanted to share it before I ignore it and move on :)
08:41 karolherbst: (in case we have more of such false positives people just ignore and move on)
08:41 karolherbst: *false negatives
09:52 karolherbst: airlied: mhh.. seems like that WGP mode only causes problems on some tests when running vec16... maybe it's just some compiler bug somewhere in the end
12:44 AndrewR: karolherbst, Finally updated llvm so mesa git compiles fully again (but I think it demand newer bindgen than 0.60 ..0.65 from Slackware current works ..)
12:44 karolherbst: mhhh, yeah, might be plausible. If your rustc toolchain uses your system LLVM then I can see why it needs some updates
13:00 AndrewR: karolherbst, yeah, in Slackware rust inked dynamically to llvm-libs ..so updating it basically meain installing second ver. of llvm over so rustc will not die yet...
13:00 AndrewR: karolherbst, I plan to re-compile rust one of those days ...
13:04 AndrewR: ..also, russian-speaking user surfaced on our big black website, so I discovered my patches for x265 on aarch64 were incorrect, and I updated them (as part of cinelerra-gg bundled libs) ...
13:04 AndrewR: https://www.linux.org.ru/gallery/workplaces/17252389 (not much to see but ... they exist! (asahi linux users) )
13:40 AndrewR: ... I also was reading psychtoolbox-3 code just for comments. One of few applications making use of 30bpc mode on amd gpus ...
13:55 karolherbst: airlied, mareko: mhh, seems like forcing wave32 for compute fixes it... I honestly have no idea about all the implications here but arsenm suggested to always use 32 for compute
14:04 bnieuwenhuizen: on RDNA3 using wave32 has some important perf disadvantages though
14:04 karolherbst: for compute?
14:04 bnieuwenhuizen: should be yeah
14:04 bnieuwenhuizen: the dual issue fp32 gets way more difficult
14:05 karolherbst: I don't know if it's important, but luxmark-luxball gives me ~0.5% higher scores with wave64
14:05 bnieuwenhuizen: that isn't RDNA3 though is it?
14:06 karolherbst: uhhh.... right, that's RDNA2
14:06 karolherbst: I think
14:06 karolherbst: navi22
14:06 bnieuwenhuizen: yeah
14:06 karolherbst: anyway... my kernel gets nuked in CU mode :'(
14:07 karolherbst: no idea why
14:07 karolherbst: at least some
14:07 karolherbst: I have no idea what's up
14:07 karolherbst: but it seems to be working with wave32
14:09 karolherbst: we can potentially also just do things differently for SHADER_KERNEL, but I also kinda want to know why that happens
17:49 mareko: karolherbst: what does it mean "nuked"?
17:50 karolherbst: ehhh... GPU reset
17:50 mareko: what is the resource usage of the kernel?
17:51 karolherbst: good question
17:51 mareko: i.e. thread count, shared memory, VGPRs
17:51 karolherbst: https://gist.github.com/karolherbst/914873135f0637c70149d054322aca7e
17:51 karolherbst: at the bottom
17:52 mareko: what's the block size?
17:53 karolherbst: I didn't check.. let me boot the machine up again
17:54 mareko: instead of having to use MESA_SHADER_KERNEL and getting broken because of that, it would be better to use MESA_SHADER_COMPUTE and have "is_kernel" in the shader info
17:55 karolherbst: mhhh.. maybe that would have been the better approach here
17:56 karolherbst: anyway, the block is 512x1x1
18:02 mareko: I think the problem is there is not enough VGPRs
18:02 karolherbst: and that matters with CU vs WGP mode?
18:02 karolherbst: the exact same thing works just fine in WGP mode
18:03 karolherbst: but it's plausible it's caused by launching too many threads.. I could check that
18:03 mareko: 144 VGPRs means 3 waves per SIMD
18:03 mareko: the CU has 2 SIMDs, so 6 waves at most, but 512x1x1 is 8 waves
18:03 karolherbst: yep
18:03 karolherbst: works fine with 256 threads
18:03 mareko: the WGP has 4 SIMDs, so 12 waves at most, so it fits
18:03 karolherbst: I guess we'll have to fix si_get_compute_state_info then
18:04 mareko: I think LLVM miscompiled that shader
18:04 karolherbst: ahh
18:04 karolherbst: ohhh...
18:04 karolherbst: it should use more SGPRs instead?
18:05 mareko: it should have capped VGPR usage to 128 and spill the rest
18:05 mareko: so that it would fit on the CU
18:06 karolherbst: the header is passed to LLVM, right?
18:06 karolherbst: or is there maybe a flag radeonsi would have to set to tell LLVM to assume CU mode or something?
18:07 karolherbst: maybe we should move this to #radeon as arsenm is there
18:07 mareko: no
18:08 mareko: this is another case of MESA_SHADER_KERNEL not being handled
18:08 karolherbst: ahh
18:09 karolherbst: mhhh... annoying
18:09 karolherbst: maybe we really should get rid of KERNEL and move it into shader_info
18:10 karolherbst: shader_info.cs.is_kernel
18:10 mareko: yes that could fix a lot of things
18:11 karolherbst: yeah.. let's do that
18:32 mareko: if it spills and you don't want spilling, we can implement and enable the WGP mode properly
18:34 karolherbst: would it spill into SGPRs first?
18:34 karolherbst: not sure how much of a perf difference it would make
18:38 mareko: no
18:38 mareko: SGPRs are the scarcest resource
18:39 mareko: SGPRs are spilled to VGPRs, and VGPRs are spilled to memory
18:39 karolherbst: ahhh
18:39 karolherbst: guess I got it backwards then
18:41 mareko: VGPRs are 2048-bit vector registers in wave64, while SGPRs are always 32-bit
18:42 HdkR: Map SGPRs to NVIDIA's Uniform Registers, and VGPRS to NVIDIA's "registers", close enough approximation that it works out :P
18:42 karolherbst: yeah I know, I just thought scalar are registers per thread :D
18:44 karolherbst: uhhh.. anv already makes use of the KERNEL stage in very annoying ways
19:25 karolherbst: :') 52 files changed, 145 insertions(+), 157 deletions(-)
19:32 karolherbst: mareko: still getting 140
19:36 karolherbst: what's the thing in radeonsi which should make sure LLVM spills VGPR? Maybe it's indeed some bug somewhere besides the shader stage
19:47 pendingchaos: I thought it was LLVM which determines the VGPR limit
19:47 pendingchaos: maybe https://pastebin.com/raw/mT2gj4wX ?
19:51 karolherbst: pendingchaos: yep, that fixes it
19:52 karolherbst: I really should start doing hardware based CI with rusticl as I keep hitting bugs which aren't CL specific 🙃
20:05 karolherbst: pendingchaos: will you write a proper patch and create the MR or should I do it?
20:22 pendingchaos: you can do it
20:27 karolherbst: okay
21:07 karolherbst: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23569
21:14 DemiMarie: Why do GPUs use ioctl() and not io_uring?
21:15 karolherbst: because it won't change anything really
21:15 karolherbst: though maybe it has some advantages inside pointless microbenchmarks...
21:16 karolherbst: but anyway, the real answer is: when mesa started there was just ioctl, and moving to io_uring doesn't really pay off.. probably
21:20 psykose: it's also usually painful to have to actually maintain 2 ways of doing something at once just for one to be 2% faster
21:20 DemiMarie: karolherbst: how can a GPU reset be anything but a kernel or firmware bug?
21:21 psykose: unless all old kernel support is dropped you can't remove the non-io-uring one
21:21 karolherbst: DemiMarie: ask AMD
21:21 karolherbst: on AMD it's either full reset or broken GPU
21:21 karolherbst: and if your shader hangs, for whatever reason,it's a full reset
21:21 karolherbst: but anyway
21:21 karolherbst: you can't validate GPU commands
21:21 karolherbst: sooo.. userspace has the power of causing GPU resets
21:22 DemiMarie: 1. Why can’t you validate GPU commands?
21:22 DemiMarie: 2. Shouldn’t the GPU be robust against invalid commands?
21:22 karolherbst: halting problem
21:23 mareko: the GPU only has single-tasking for gfx and any gfx app can use an infinite loop, so the kernel has to stop it with a reset, it's a feature
21:23 karolherbst: I don't know if I'd call so far calling it a feature, but there is nothing else you can do, soo....
21:23 DemiMarie: Are Intel and Nvidia better when it comes to GPU error recovery?
21:23 karolherbst: yes
21:23 mareko: freeze or reset, your choice
21:23 DemiMarie: mareko: to me this is a hardware bug
21:23 karolherbst: it is
21:24 karolherbst: the hardware in incabable or recovery
21:24 karolherbst: *incapable
21:24 mareko: DemiMarie: halting problem is a hw bug?
21:24 karolherbst: but new gens are supposed to implement some sort of recovery
21:24 karolherbst: mareko: no, having to reset the entire GPU is
21:24 karolherbst: at least the AMD way
21:24 mareko: no we don't
21:25 karolherbst: VRAM content is lost, no?
21:25 karolherbst: or rather.. why does my display state gets fully reset on GPU resets 🙃
21:25 karolherbst: it doesn't have to be this way
21:25 mareko: there are levels of resets
21:25 karolherbst: other GPUs can just kill a GPU context and move on
21:26 DemiMarie: A properly-designed GPU can support two or more mutually distrusting users, such that one user cannot interfere with the other user’s use of the GPU.
21:26 karolherbst: yeah.. I agree
21:26 mareko: if a shader doesn't finish, all shaders of that process are killed
21:27 krushia: ever run into webgl malware (or poorly coded shaders on a web site)? it isn't fun
21:27 karolherbst: I generally see more stuff getting killed
21:27 DemiMarie: My understanding is that Intel, Nvidia, and Apple GPUs all meet this criterion (modulo implementation bugs).
21:27 mareko: karolherbst: that's the next level
21:27 karolherbst: yeah.. which I run into today with the wgp issue :)
21:27 DemiMarie: mareko: that is arguably a blocker for virtGPU native contexts.
21:28 DemiMarie: A VM must not be able to crash the host.
21:28 karolherbst: AMD is improving the situation
21:28 mareko: DemiMarie: it's also a blocker for virgl
21:28 mareko: DemiMarie: if you want to argue that way
21:28 karolherbst: but yeah.. it's kinda messy
21:28 karolherbst: well.. other vendors are better at it
21:28 DemiMarie: mareko: could virgl add explicit preemption checks during compilation?
21:29 karolherbst: mhh.. should be possible
21:30 karolherbst: but it's not only sahder code
21:30 karolherbst: like .. my bug today was the shader just using too many registers causing GPU resets...
21:30 mareko: I'll just say that these assumptions and questions make no sense with how things work, virgl can't do anything and virgl is also a security hole for the whole VM
21:30 karolherbst: there is a lot of things the kernel/host side would have to deal with
21:31 mareko: security and isolation is not something virgl does well
21:31 karolherbst: the point is rather, that a guest could constantly DOS the entire host
21:31 karolherbst: but yeah...
21:31 karolherbst: GPU isolation in generaly is practically impossible problem
21:31 karolherbst: *general
21:31 karolherbst: though nvidia does support partitioning the GPU on the hardware level
21:31 mareko: the native context does isolation properly, venus also does it properly, virgl doesn't
21:32 karolherbst: and you can assign isolated parts of the GPU to guests which do not interfer with each other
21:32 DemiMarie: mareko: are virtGPU native contexts better at this?
21:32 mareko: yes
21:32 DemiMarie: Intel also supports SR-IOV on recent iGPUs (Alder Lake, IIRC)
21:32 karolherbst: yeah.. SR-IOV is probably the only isolation mode which actually works
21:33 DemiMarie: karolherbst: why do virtGPU native contexts not work?
21:33 mareko: but the gfx queue can do execute 1 job, so if a job takes 2 seconds, it will block everybody for 2 seconds even if it doesn't trigger the reset
21:33 Lynne: I wish amd's gpus didn't fall over for something as simple as doing unchecked out of bounds access
21:34 karolherbst: DemiMarie: because you don't properly isolate, but it could be good enough depending on the hardware/driver
21:34 DemiMarie: karolherbst: define “properly isolate”
21:34 DemiMarie: In a way that does not lead to circular reasoning
21:35 karolherbst: hw level isolation of tasks and resources
21:35 karolherbst: but stalls is something which is probably impossible to fix, but also not really problematic
21:35 karolherbst: well.. depends
21:36 karolherbst: don't want to do RT sensitive things on shared GPUs
21:36 karolherbst: but I've also seen nvidia not being able to recover properly 🙃 sometimes the GPU just disconnects itself from the PCIe bus
21:37 mareko: I love air conditioning
21:38 karolherbst: I wished I had some
21:38 zmike: air conditioning++
21:38 mareko: karolherbst: actually compute and RT has shader-instruction-level preemption on AMD hw, not sure if enabled though
21:39 mareko: outside of ROCm
21:40 karolherbst: probably not for graphics
21:40 karolherbst: but I was considering enabling it for some drivers
21:40 karolherbst: it's not nice if compute can stall the gfx pipelines
21:41 mareko: surprisingly compute shader-level preemption shouldn't need any userspace changes AFAIK, but you need to use a compute queue and whatever else ROCm is doing
21:42 karolherbst: yeah...
21:42 karolherbst: I have to figure this stuff out also for intel
21:42 karolherbst: i915 is reaping jobs taking more than 0.5seconds or something
21:42 karolherbst: and some workloads run into this already
21:42 mareko: the kernel implements suspending CU waves to memory
21:43 karolherbst: yeah.. makes sense
21:43 karolherbst: we really should add a flag on screen creation for this stuff
21:44 mareko: CUs have multitasking for running waves, but they can't release resources without suspending waves to memory
21:44 mareko: you can run 32 different shaders on a single CU at the same time if they all fit wrt VGPRs and shared mem
21:45 karolherbst: yeah.. I think it's similar to nvidia. And by default you can only context switched between shader invocations
21:45 karolherbst: and move them into VRAM
21:45 karolherbst: (the contexts that is)
22:39 DemiMarie: karolherbst: what do you mean by “hw level isolation of tasks and resources”?
22:39 karolherbst: well.. work on the GPU takes resources and you can split them up, be it either memory or execution units
22:39 DemiMarie: mareko: why not enable preemption for graphics on AMD GPUs?
22:39 DemiMarie: yes, one can
22:40 karolherbst: performance
22:40 DemiMarie: that’s what page tables and scheduling are for IIUC
22:40 karolherbst: nah
22:40 karolherbst: I meant proper isolation
22:40 DemiMarie: how big is the hit?
22:40 DemiMarie: define “proper”
22:40 karolherbst: splitting physical memory
22:40 DemiMarie: hmm?
22:40 karolherbst: and assigning execution units to partitions
22:41 DemiMarie: why is that stronger than what page tables can provide?
22:41 karolherbst: so you eliminate even the chance of reading back leftover state from other guests
22:41 DemiMarie: I thought that is just a matter of not having vulnerable drivers
22:41 karolherbst: because you don't even have to bother about tracking old use of physical memory
22:41 karolherbst: you get a private block of X physical memory assigned
22:42 karolherbst: that's the level of partitioning you can do with SR-IOV level virtualization
22:42 DemiMarie: Is that how it is implemented on e.g. Intel iGPUs?
22:43 karolherbst: I don't know
22:43 karolherbst: but nvidia hardware can do this
22:43 karolherbst: well.. recent one at least
22:43 karolherbst: or rather those which support SR-IOV
22:43 DemiMarie: which are unobtanium IIUC
22:44 karolherbst: I think ampere can do it already as well
22:44 DemiMarie: In my world, if it isn’t something the average user can afford, it doesn’t exist
22:44 karolherbst: and generally consumer GPUs
22:44 karolherbst: I think
22:44 DemiMarie: Are you referring to https://libvf.io?
22:44 karolherbst: mhh?
22:44 DemiMarie: okay
22:45 karolherbst: but anyway, you just have to set up the device partitions somewhere and then you have leanly isolated partitons of the GPU
22:46 karolherbst: each with their own page tables and everything
22:46 DemiMarie: One question I always have about hardware partitioning is, “How much is actually partitioned, and how much of this is just firmware trickery?”.
22:46 DemiMarie: Because the on-device firmware is a significant attack surface.
22:46 karolherbst: on nvidia it's almost all in hardware
22:47 karolherbst: you still have the firmware for context switching which is shared, but the actual resources used by actual workloads is all split
22:47 karolherbst: (in hardware)
22:47 karolherbst: there are fields to assign SMs to partitions, can do it even finer grained afaik
22:47 karolherbst: same for VRAM
22:48 DemiMarie: Even on consumer GPUs?
22:48 karolherbst: yeah
22:48 DemiMarie: What about the RM API?
22:48 karolherbst: dunno if nvidia enables it though in their driver
22:48 karolherbst: well.. I think GSP still would run shared as you only have one GSP processor on the GPU still
22:48 DemiMarie: Yeah
22:49 karolherbst: but that's mostly doing hardware level things
22:49 karolherbst: really doens't matter much
22:49 DemiMarie: I thought much of the RM API was implemented in firmware.
22:49 DemiMarie: Stuff like memory allocation.
22:49 karolherbst: nah, that stuff exists outside afaik
22:50 karolherbst: that's performance sensitive stuff,
22:51 karolherbst: but allocating memory is also very boring
23:33 DavidHeidelberg[m]:
23:33 DavidHeidelberg[m]: mupuf amdgpu:codename:NAVI21 also down? So I can disable also VALVE farm?