03:14mareko: any idea why RADV would run out of memory with Mesa 23.x but not 22.2?
03:21mareko: I'm hearing UE5 Lyra runs out of memory with RADV from Mesa 23.x
06:11HdkR: 4
07:07daniels: karolherbst: maybe I’m just not really awake enough yet, but … there are no crashes in c15.r1.log … ?
07:24mupuf: DavidHeidelberg[m]: yeepee! You should be able to use a b2c release kernel, it has all you need (including built-in amdgpu firmware)
08:40karolherbst: daniels: that's exactly the point :)
08:41karolherbst: I have _no_ idea what happaned there, but it's clearly something funky
08:41karolherbst: just wanted to share it before I ignore it and move on :)
08:41karolherbst: (in case we have more of such false positives people just ignore and move on)
08:41karolherbst: *false negatives
09:52karolherbst: airlied: mhh.. seems like that WGP mode only causes problems on some tests when running vec16... maybe it's just some compiler bug somewhere in the end
12:44AndrewR: karolherbst, Finally updated llvm so mesa git compiles fully again (but I think it demand newer bindgen than 0.60 ..0.65 from Slackware current works ..)
12:44karolherbst: mhhh, yeah, might be plausible. If your rustc toolchain uses your system LLVM then I can see why it needs some updates
13:00AndrewR: karolherbst, yeah, in Slackware rust inked dynamically to llvm-libs ..so updating it basically meain installing second ver. of llvm over so rustc will not die yet...
13:00AndrewR: karolherbst, I plan to re-compile rust one of those days ...
13:04AndrewR: ..also, russian-speaking user surfaced on our big black website, so I discovered my patches for x265 on aarch64 were incorrect, and I updated them (as part of cinelerra-gg bundled libs) ...
13:04AndrewR: https://www.linux.org.ru/gallery/workplaces/17252389 (not much to see but ... they exist! (asahi linux users) )
13:40AndrewR: ... I also was reading psychtoolbox-3 code just for comments. One of few applications making use of 30bpc mode on amd gpus ...
13:55karolherbst: airlied, mareko: mhh, seems like forcing wave32 for compute fixes it... I honestly have no idea about all the implications here but arsenm suggested to always use 32 for compute
14:04bnieuwenhuizen: on RDNA3 using wave32 has some important perf disadvantages though
14:04karolherbst: for compute?
14:04bnieuwenhuizen: should be yeah
14:04bnieuwenhuizen: the dual issue fp32 gets way more difficult
14:05karolherbst: I don't know if it's important, but luxmark-luxball gives me ~0.5% higher scores with wave64
14:05bnieuwenhuizen: that isn't RDNA3 though is it?
14:06karolherbst: uhhh.... right, that's RDNA2
14:06karolherbst: I think
14:06karolherbst: navi22
14:06bnieuwenhuizen: yeah
14:06karolherbst: anyway... my kernel gets nuked in CU mode :'(
14:07karolherbst: no idea why
14:07karolherbst: at least some
14:07karolherbst: I have no idea what's up
14:07karolherbst: but it seems to be working with wave32
14:09karolherbst: we can potentially also just do things differently for SHADER_KERNEL, but I also kinda want to know why that happens
17:49mareko: karolherbst: what does it mean "nuked"?
17:50karolherbst: ehhh... GPU reset
17:50mareko: what is the resource usage of the kernel?
17:51karolherbst: good question
17:51mareko: i.e. thread count, shared memory, VGPRs
17:51karolherbst: https://gist.github.com/karolherbst/914873135f0637c70149d054322aca7e
17:51karolherbst: at the bottom
17:52mareko: what's the block size?
17:53karolherbst: I didn't check.. let me boot the machine up again
17:54mareko: instead of having to use MESA_SHADER_KERNEL and getting broken because of that, it would be better to use MESA_SHADER_COMPUTE and have "is_kernel" in the shader info
17:55karolherbst: mhhh.. maybe that would have been the better approach here
17:56karolherbst: anyway, the block is 512x1x1
18:02mareko: I think the problem is there is not enough VGPRs
18:02karolherbst: and that matters with CU vs WGP mode?
18:02karolherbst: the exact same thing works just fine in WGP mode
18:03karolherbst: but it's plausible it's caused by launching too many threads.. I could check that
18:03mareko: 144 VGPRs means 3 waves per SIMD
18:03mareko: the CU has 2 SIMDs, so 6 waves at most, but 512x1x1 is 8 waves
18:03karolherbst: yep
18:03karolherbst: works fine with 256 threads
18:03mareko: the WGP has 4 SIMDs, so 12 waves at most, so it fits
18:03karolherbst: I guess we'll have to fix si_get_compute_state_info then
18:04mareko: I think LLVM miscompiled that shader
18:04karolherbst: ahh
18:04karolherbst: ohhh...
18:04karolherbst: it should use more SGPRs instead?
18:05mareko: it should have capped VGPR usage to 128 and spill the rest
18:05mareko: so that it would fit on the CU
18:06karolherbst: the header is passed to LLVM, right?
18:06karolherbst: or is there maybe a flag radeonsi would have to set to tell LLVM to assume CU mode or something?
18:07karolherbst: maybe we should move this to #radeon as arsenm is there
18:07mareko: no
18:08mareko: this is another case of MESA_SHADER_KERNEL not being handled
18:08karolherbst: ahh
18:09karolherbst: mhhh... annoying
18:09karolherbst: maybe we really should get rid of KERNEL and move it into shader_info
18:10karolherbst: shader_info.cs.is_kernel
18:10mareko: yes that could fix a lot of things
18:11karolherbst: yeah.. let's do that
18:32mareko: if it spills and you don't want spilling, we can implement and enable the WGP mode properly
18:34karolherbst: would it spill into SGPRs first?
18:34karolherbst: not sure how much of a perf difference it would make
18:38mareko: no
18:38mareko: SGPRs are the scarcest resource
18:39mareko: SGPRs are spilled to VGPRs, and VGPRs are spilled to memory
18:39karolherbst: ahhh
18:39karolherbst: guess I got it backwards then
18:41mareko: VGPRs are 2048-bit vector registers in wave64, while SGPRs are always 32-bit
18:42HdkR: Map SGPRs to NVIDIA's Uniform Registers, and VGPRS to NVIDIA's "registers", close enough approximation that it works out :P
18:42karolherbst: yeah I know, I just thought scalar are registers per thread :D
18:44karolherbst: uhhh.. anv already makes use of the KERNEL stage in very annoying ways
19:25karolherbst: :') 52 files changed, 145 insertions(+), 157 deletions(-)
19:32karolherbst: mareko: still getting 140
19:36karolherbst: what's the thing in radeonsi which should make sure LLVM spills VGPR? Maybe it's indeed some bug somewhere besides the shader stage
19:47pendingchaos: I thought it was LLVM which determines the VGPR limit
19:47pendingchaos: maybe https://pastebin.com/raw/mT2gj4wX ?
19:51karolherbst: pendingchaos: yep, that fixes it
19:52karolherbst: I really should start doing hardware based CI with rusticl as I keep hitting bugs which aren't CL specific 🙃
20:05karolherbst: pendingchaos: will you write a proper patch and create the MR or should I do it?
20:22pendingchaos: you can do it
20:27karolherbst: okay
21:07karolherbst: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23569
21:14DemiMarie: Why do GPUs use ioctl() and not io_uring?
21:15karolherbst: because it won't change anything really
21:15karolherbst: though maybe it has some advantages inside pointless microbenchmarks...
21:16karolherbst: but anyway, the real answer is: when mesa started there was just ioctl, and moving to io_uring doesn't really pay off.. probably
21:20psykose: it's also usually painful to have to actually maintain 2 ways of doing something at once just for one to be 2% faster
21:20DemiMarie: karolherbst: how can a GPU reset be anything but a kernel or firmware bug?
21:21psykose: unless all old kernel support is dropped you can't remove the non-io-uring one
21:21karolherbst: DemiMarie: ask AMD
21:21karolherbst: on AMD it's either full reset or broken GPU
21:21karolherbst: and if your shader hangs, for whatever reason,it's a full reset
21:21karolherbst: but anyway
21:21karolherbst: you can't validate GPU commands
21:21karolherbst: sooo.. userspace has the power of causing GPU resets
21:22DemiMarie: 1. Why can’t you validate GPU commands?
21:22DemiMarie: 2. Shouldn’t the GPU be robust against invalid commands?
21:22karolherbst: halting problem
21:23mareko: the GPU only has single-tasking for gfx and any gfx app can use an infinite loop, so the kernel has to stop it with a reset, it's a feature
21:23karolherbst: I don't know if I'd call so far calling it a feature, but there is nothing else you can do, soo....
21:23DemiMarie: Are Intel and Nvidia better when it comes to GPU error recovery?
21:23karolherbst: yes
21:23mareko: freeze or reset, your choice
21:23DemiMarie: mareko: to me this is a hardware bug
21:23karolherbst: it is
21:24karolherbst: the hardware in incabable or recovery
21:24karolherbst: *incapable
21:24mareko: DemiMarie: halting problem is a hw bug?
21:24karolherbst: but new gens are supposed to implement some sort of recovery
21:24karolherbst: mareko: no, having to reset the entire GPU is
21:24karolherbst: at least the AMD way
21:24mareko: no we don't
21:25karolherbst: VRAM content is lost, no?
21:25karolherbst: or rather.. why does my display state gets fully reset on GPU resets 🙃
21:25karolherbst: it doesn't have to be this way
21:25mareko: there are levels of resets
21:25karolherbst: other GPUs can just kill a GPU context and move on
21:26DemiMarie: A properly-designed GPU can support two or more mutually distrusting users, such that one user cannot interfere with the other user’s use of the GPU.
21:26karolherbst: yeah.. I agree
21:26mareko: if a shader doesn't finish, all shaders of that process are killed
21:27krushia: ever run into webgl malware (or poorly coded shaders on a web site)? it isn't fun
21:27karolherbst: I generally see more stuff getting killed
21:27DemiMarie: My understanding is that Intel, Nvidia, and Apple GPUs all meet this criterion (modulo implementation bugs).
21:27mareko: karolherbst: that's the next level
21:27karolherbst: yeah.. which I run into today with the wgp issue :)
21:27DemiMarie: mareko: that is arguably a blocker for virtGPU native contexts.
21:28DemiMarie: A VM must not be able to crash the host.
21:28karolherbst: AMD is improving the situation
21:28mareko: DemiMarie: it's also a blocker for virgl
21:28mareko: DemiMarie: if you want to argue that way
21:28karolherbst: but yeah.. it's kinda messy
21:28karolherbst: well.. other vendors are better at it
21:28DemiMarie: mareko: could virgl add explicit preemption checks during compilation?
21:29karolherbst: mhh.. should be possible
21:30karolherbst: but it's not only sahder code
21:30karolherbst: like .. my bug today was the shader just using too many registers causing GPU resets...
21:30mareko: I'll just say that these assumptions and questions make no sense with how things work, virgl can't do anything and virgl is also a security hole for the whole VM
21:30karolherbst: there is a lot of things the kernel/host side would have to deal with
21:31mareko: security and isolation is not something virgl does well
21:31karolherbst: the point is rather, that a guest could constantly DOS the entire host
21:31karolherbst: but yeah...
21:31karolherbst: GPU isolation in generaly is practically impossible problem
21:31karolherbst: *general
21:31karolherbst: though nvidia does support partitioning the GPU on the hardware level
21:31mareko: the native context does isolation properly, venus also does it properly, virgl doesn't
21:32karolherbst: and you can assign isolated parts of the GPU to guests which do not interfer with each other
21:32DemiMarie: mareko: are virtGPU native contexts better at this?
21:32mareko: yes
21:32DemiMarie: Intel also supports SR-IOV on recent iGPUs (Alder Lake, IIRC)
21:32karolherbst: yeah.. SR-IOV is probably the only isolation mode which actually works
21:33DemiMarie: karolherbst: why do virtGPU native contexts not work?
21:33mareko: but the gfx queue can do execute 1 job, so if a job takes 2 seconds, it will block everybody for 2 seconds even if it doesn't trigger the reset
21:33Lynne: I wish amd's gpus didn't fall over for something as simple as doing unchecked out of bounds access
21:34karolherbst: DemiMarie: because you don't properly isolate, but it could be good enough depending on the hardware/driver
21:34DemiMarie: karolherbst: define “properly isolate”
21:34DemiMarie: In a way that does not lead to circular reasoning
21:35karolherbst: hw level isolation of tasks and resources
21:35karolherbst: but stalls is something which is probably impossible to fix, but also not really problematic
21:35karolherbst: well.. depends
21:36karolherbst: don't want to do RT sensitive things on shared GPUs
21:36karolherbst: but I've also seen nvidia not being able to recover properly 🙃 sometimes the GPU just disconnects itself from the PCIe bus
21:37mareko: I love air conditioning
21:38karolherbst: I wished I had some
21:38zmike: air conditioning++
21:38mareko: karolherbst: actually compute and RT has shader-instruction-level preemption on AMD hw, not sure if enabled though
21:39mareko: outside of ROCm
21:40karolherbst: probably not for graphics
21:40karolherbst: but I was considering enabling it for some drivers
21:40karolherbst: it's not nice if compute can stall the gfx pipelines
21:41mareko: surprisingly compute shader-level preemption shouldn't need any userspace changes AFAIK, but you need to use a compute queue and whatever else ROCm is doing
21:42karolherbst: yeah...
21:42karolherbst: I have to figure this stuff out also for intel
21:42karolherbst: i915 is reaping jobs taking more than 0.5seconds or something
21:42karolherbst: and some workloads run into this already
21:42mareko: the kernel implements suspending CU waves to memory
21:43karolherbst: yeah.. makes sense
21:43karolherbst: we really should add a flag on screen creation for this stuff
21:44mareko: CUs have multitasking for running waves, but they can't release resources without suspending waves to memory
21:44mareko: you can run 32 different shaders on a single CU at the same time if they all fit wrt VGPRs and shared mem
21:45karolherbst: yeah.. I think it's similar to nvidia. And by default you can only context switched between shader invocations
21:45karolherbst: and move them into VRAM
21:45karolherbst: (the contexts that is)
22:39DemiMarie: karolherbst: what do you mean by “hw level isolation of tasks and resources”?
22:39karolherbst: well.. work on the GPU takes resources and you can split them up, be it either memory or execution units
22:39DemiMarie: mareko: why not enable preemption for graphics on AMD GPUs?
22:39DemiMarie: yes, one can
22:40karolherbst: performance
22:40DemiMarie: that’s what page tables and scheduling are for IIUC
22:40karolherbst: nah
22:40karolherbst: I meant proper isolation
22:40DemiMarie: how big is the hit?
22:40DemiMarie: define “proper”
22:40karolherbst: splitting physical memory
22:40DemiMarie: hmm?
22:40karolherbst: and assigning execution units to partitions
22:41DemiMarie: why is that stronger than what page tables can provide?
22:41karolherbst: so you eliminate even the chance of reading back leftover state from other guests
22:41DemiMarie: I thought that is just a matter of not having vulnerable drivers
22:41karolherbst: because you don't even have to bother about tracking old use of physical memory
22:41karolherbst: you get a private block of X physical memory assigned
22:42karolherbst: that's the level of partitioning you can do with SR-IOV level virtualization
22:42DemiMarie: Is that how it is implemented on e.g. Intel iGPUs?
22:43karolherbst: I don't know
22:43karolherbst: but nvidia hardware can do this
22:43karolherbst: well.. recent one at least
22:43karolherbst: or rather those which support SR-IOV
22:43DemiMarie: which are unobtanium IIUC
22:44karolherbst: I think ampere can do it already as well
22:44DemiMarie: In my world, if it isn’t something the average user can afford, it doesn’t exist
22:44karolherbst: and generally consumer GPUs
22:44karolherbst: I think
22:44DemiMarie: Are you referring to https://libvf.io?
22:44karolherbst: mhh?
22:44DemiMarie: okay
22:45karolherbst: but anyway, you just have to set up the device partitions somewhere and then you have leanly isolated partitons of the GPU
22:46karolherbst: each with their own page tables and everything
22:46DemiMarie: One question I always have about hardware partitioning is, “How much is actually partitioned, and how much of this is just firmware trickery?”.
22:46DemiMarie: Because the on-device firmware is a significant attack surface.
22:46karolherbst: on nvidia it's almost all in hardware
22:47karolherbst: you still have the firmware for context switching which is shared, but the actual resources used by actual workloads is all split
22:47karolherbst: (in hardware)
22:47karolherbst: there are fields to assign SMs to partitions, can do it even finer grained afaik
22:47karolherbst: same for VRAM
22:48DemiMarie: Even on consumer GPUs?
22:48karolherbst: yeah
22:48DemiMarie: What about the RM API?
22:48karolherbst: dunno if nvidia enables it though in their driver
22:48karolherbst: well.. I think GSP still would run shared as you only have one GSP processor on the GPU still
22:48DemiMarie: Yeah
22:49karolherbst: but that's mostly doing hardware level things
22:49karolherbst: really doens't matter much
22:49DemiMarie: I thought much of the RM API was implemented in firmware.
22:49DemiMarie: Stuff like memory allocation.
22:49karolherbst: nah, that stuff exists outside afaik
22:50karolherbst: that's performance sensitive stuff,
22:51karolherbst: but allocating memory is also very boring
23:33DavidHeidelberg[m]:
23:33DavidHeidelberg[m]: mupuf amdgpu:codename:NAVI21 also down? So I can disable also VALVE farm?