IRC Logs of #radeon on irc.freenode.net for 2024-07-26

07:33 MrCooper: pixelcluster: you're assuming that a process which uses GL cannot have any other purpose than using GL, which is false
07:33 pixelcluster: I'm not
07:34 pixelcluster: that particular case is of course pretty much the worst possible case
07:35 MrCooper: mareko: kindly point us to the spec language which says program termination is an allowed response to a GPU reset
07:37 MrCooper: pixelcluster: you are, you say "GL process" when you really mean "GL context"
07:38 MrCooper: a process may be using other GL contexts and doing other things entirely unrelated to GL
07:38 pixelcluster: MrCooper: yes I know, but there is no way to stop the context from running unless you stop the process from running (i.e. kill it)
07:39 MrCooper: the GL implementation has to be able to just ignore GL API calls for that context, otherwise it can't support robustness anyway
07:39 pixelcluster: why can't it?
07:39 pixelcluster: as for spec language see https://registry.khronos.org/OpenGL/extensions/ARB/ARB_robustness.txt
07:39 MrCooper: because that's what it has to do until the context is destroyed
07:40 MrCooper: the robust context which had a reset
07:40 pixelcluster: ARB_robustness spec says "If robust buffer access is enabled via the OpenGL API, such indexing must not result in abnormal program termination." which implies that if it's not enabled, it may result
07:41 pixelcluster: MrCooper: eh I guess I see what you mean, but silently no-oping or carrying on is always bad imo
07:41 MrCooper: that's about out-of-bounds buffer access, not about GPU resets?
07:41 pixelcluster: my opinion on this is just that reset shouldn't kill non-offending processes so that the whole problem doesn't exist in the first place
07:42 MrCooper: radeonsi shouldn't ever kill the process in response to a GPU reset, because it makes things worse in some cases as reported by kode54
07:43 MrCooper: and it's not justified by the spec
07:43 daniels: mareko: the difference is that the program getting killed hasn't done any OOB access, it's just happened to be killed as a result of collateral damage from a reset
07:45 pixelcluster: > radeonsi shouldn't ever kill the process
07:45 pixelcluster: 100% agree :)
07:46 pixelcluster: I guess the disagreement comes down to how exactly that is supposed to happen, and in the case of responsing to fatal resets it's pretty much "damned if you do, damned if you don't"
07:47 pixelcluster: perhaps yes, killing a process isn't exactly conforming with the spec, but neither is silently nop-ing any commands done by the app
07:47 MrCooper: why not the latter?
07:48 pixelcluster: simplest example is a shader storing data to some buffer that the app later expects to read
07:48 pixelcluster: or perhaps queries, or ...
07:49 pixelcluster: if you don't do anything you can't produce the results the specification says you should
07:49 MrCooper: The spec says: "In this case it is recommended that implementations should not allow loss of context state no matter what events occur. However, this is only a recommendation, and cannot be relied upon by applications."
07:49 MrCooper: i.e. it's best effort, doing nothing is acceptable
07:50 pixelcluster: no, that is not what that footnote means
07:50 pixelcluster: it simply means that whatever the app is doing it ideally shouldn't be able to cause any reset to happen
07:50 pixelcluster: but you cannot rely on not causing any resets
07:52 MrCooper: k, makes sense
07:52 pixelcluster: if you don't enable robustness (which is what the whole discussion is about), the GL spec doesn't give you any room and technically this "no loss of context state, ever" is mandated by the spec
07:55 pixelcluster: that's why I think the per-queue reset direction/trying as hard as possible not to kill non-offending contexts in the first place is the right direction - it's the only possible direction where we can comply with that
07:55 MrCooper: that footnote says it's best-effort though?
07:56 pixelcluster: MrCooper: yeah, the way I read the spec that footnote relaxes this mandate from the core spec to a mere recommendation
07:59 MrCooper: spec issue 8 says "After a reset event, apps should not use a context for any purpose other than determining its reset status, and then destroying it"; can't find any other spec language offhand about what should happen if the app does anything else with the context, seems obvious though that apps won't always react immediately, so the GL implementation has to be able to ignore GL API calls for such a context
13:44 Venemo: agd5f: which queues currently support the new per-queue reset? all of them or is it just graphics and compute?
13:45 agd5f: Venemo, just graphics and compute. Although we are working on engine reset for sdma and vcn, but it's not quite as straight forward
13:45 agd5f: or as fine grained
13:47 Venemo: right, I guess the vast majority of submissions is to gfx and compute, though sdma and vcn hangs are very annoying (especially when they bring down your system)
13:48 Venemo: we were wondering a hypothetical situation. if the kernel supported sdma per-queue reset, would that help gracefully resolve hangs caused by the kernel's own usage of sdma?
13:48 Venemo: we were also wondering how per-queue reset would work in the context of user queues, and what is the status of user queues currently? last we tried using user queues they didn't work at all
13:49 agd5f: user queues are actually the reason per queue reset exists.
13:49 Venemo: can you elaborate please?
13:50 agd5f: The idea is that with user queues each user has their own queue so it you reset it, you only lose the user's jobs. While with kernel queues, you lose everything in the queue
13:50 Venemo: makes sense
13:50 Venemo: but do user queues really work now?
13:54 agd5f: we've been working on them for a while now, but there were a lot of problems to solve with respect to memory management and implicit sync. It's pretty well working on gfx11: https://gitlab.freedesktop.org/contactshashanksharma/userq-integration/-/commits/integration-staging?ref_type=heads
13:54 Ristovski: agd5f: Btw, I assume Steam Decks APU has two SMUs? It appears you _can_ get iGPU power usage on there
13:54 Ristovski: quite interesting
13:54 agd5f: Ristovski, no, just one
13:55 Ristovski: Oh? How come that functionality is exclusive to it then?
13:55 agd5f: platform specific feature maybe?
13:56 Ristovski: Hmm, I assume it's more of a hardware thing rather than SMU firmware thing?
13:57 agd5f: not sure off hand
14:26 Venemo: agd5f: can we expect them to eventually work on older gpus or is it meant to be a gfx11+ only feature?
14:27 agd5f: Venemo, only gfx11 and newer support them for gfx. We could enable them for compute on older parts, but probably not worth it.
14:28 Venemo: right, makes sense
14:29 Venemo: well, once we start using user queues, sounds like the gpu hang bringing down the entire system will be a thing of the past, at least on gfx11+ then
14:30 MrCooper: being nitpicky, even now they should just take down the GUI session, not the whole system, unless there's a bug in the reset handling (which could still happen with user queues)?
14:36 Ristovski: it does indeed bring down only the whole GUI session
14:37 Ristovski: (from the complete GPU reset)
14:37 Ristovski: the only bug I have seen is that on gfx90c a suspend+resume is needed to get the iGPU back into a normal power state, otherwise its stuck sipping tens of watts and nvtop shows 100% usage
14:57 Venemo: MrCooper: I mean, you are technically correct, but what else is there other than the GUI session?
14:58 MrCooper: sounds a bit like "what else is there in a process than the GL context which had a GPU reset?" :/
14:58 MrCooper: there's a world outside of what we're working on, you know
14:59 Venemo: I mean, on a typical day I have 10 apps running inside a GUI session, taking down the GUI session means all my 10 apps disappear
15:00 MrCooper: yep, and you can log in again without restarting the system
15:00 Venemo: technically, there is no difference, the data from my 10 apps are lost either way
15:00 Venemo: sure, you are correct that somebody else may have a different experience, but this is the main complaint I'm reading from the vast majority of the users who experience problems like this
15:01 MrCooper: technically, there can be any number of processes running outside of your session (including other GUI sessions)
15:01 Venemo: I am not arguing that
15:02 MrCooper: don't assume everybody else is the same as you
15:02 Venemo: like I said, this is the main complaint I've been reading from the vast majority of users
15:03 Venemo: I am not talking about me, neither am I talking about everybody. I am simply talking about the complaints I've read, that's it
15:03 Venemo: why are we arguing?
15:04 mareko: the real problem is that the window system doesn't enable robustness for all its components, Windows seems to handle this properly
15:06 mareko: 967869
15:13 mareko: wave32 helps when 1 half of wave64 has empty exec while the other half is busy for too long
15:17 MrCooper: Venemo: using imprecise language inevitably results in confusion
15:19 MrCooper: mareko: I agree everything using robustness is the ideal solution, but it's not happening overnight, and meanwhile radeonsi killing processes after GPU reset is making things worse in some cases
15:22 MrCooper: it's also generating lots of useless bug reports
15:26 Venemo: MrCooper: sometimes I can't tell if you're joking or trolling, but for the sake of peaceful conduct I'll assume you're just joking
15:26 MrCooper: about what?
15:27 MrCooper: not that I ever intentionally troll
15:28 Venemo: the handling of GPU resets is a huge problem for many users and I think user queues + per-queue resets, once we get them working well, will be decent progress for users who can benefit from them
15:29 Venemo: I also agree on Marek's point regarding robustness
15:30 MrCooper: as I said, I agree on the end goal, we can't just ignore reality in the meantime though
15:31 Venemo: feel free to elaborate
15:32 Venemo: It's not clear to me what you are suggesting to do
15:54 MrCooper: revert radeonsi killing processes with a non-robust GL context after a GPU reset
15:54 Venemo: I don't see how that would help
15:57 MrCooper: it should solve kode54's issue, and it would stop the flow of useless downstream bug reports about processes killed by radeonsi
16:05 pixelcluster: downstream = what here?
16:05 pixelcluster: I wouldn't say it solves it, it merely works around the symptoms of the issue
16:10 bnieuwenhuizen: agd5f: would it be possible to have some short note/doc on how memory management works UAPI wise with user mode queues? We have no BO list anymore, so how do we figure what memory needs to be resident (especially for stuff we cannot use VM_BO_ALWAYS_VALID which would seem like the obvious solution)
16:10 Venemo: MrCooper: can you elaborate on what issue is being fixed exactly, in your opinion, by not killing processes after a gpu reset?
16:11 Venemo: I think I am missing some context here
16:12 bnieuwenhuizen: agd5f: IIRC I have some test stuff lying around from when the user mode queues patches first came out, would love to turn that into something for radv but I was assuming the memroy stuff was going to get new UAPI too
16:16 agd5f: memory allocation still works the same, other than being able to request doorbell memory
16:21 pixelcluster: agd5f: how does eviction work though? what if you ring the doorbell while some of your buffers are evicted to TTM_PL_SYSTEM or similar?
16:22 agd5f: pixelcluster, similar to rocm. We can preempt user queues, so if there is memory pressure, we preempt the queues and move the memory then resume them.
16:23 pixelcluster: I suppose that works as long as you don't need to move something to a location where it's inaccessible
16:23 pixelcluster: (e.g. when running out of both VRAM and GTT)
16:23 agd5f: yes, everything need to be where it needs to be before the queues are resumed
16:26 mareko: MrCooper: possible solutions: 1) apps should be robust (worthwhile goal), 2) GPU resets shouldn't happen (worthwhile goal), 3) doing some other random thing when the device and display is lost (arguably not worthwhile)
16:27 MrCooper: radeonsi could just go back to what it was doing for years, which was less troublesome overall
16:27 mareko: it was less troublesome for some, and more troublesome for others
16:30 mareko: we could kill the process if we detect 2 hangs instead of 1, but GL won't be conformant after the first hang due to data corruption
16:35 bnieuwenhuizen: agd5f: so for imported memory we just don't have to list it anywhere after mapping it into the VA?
16:35 bnieuwenhuizen: sounds easy enough, thanks
17:13 mareko: bnieuwenhuizen: imported/exported memory must call the userq_signal ioctl for every IB to fence BOs in the kernel, and waiting code needs to call the userq_wait ioctl to get BO fences from the kernel
17:19 mareko: the userq_wait ioctl is technically not relevant to user queues
17:19 mareko: actually no
17:20 bnieuwenhuizen: in traditional residence management we might not even have fences in the BO yet, no?
17:20 mareko: let me correct what I said: you can still use the existing BO wait ioctl, but if you want to wait in the command buffer, the userq_wait ioctl returns the VM address and value for WAIT_REG_MEM
17:21 bnieuwenhuizen: mareko: is that also for memory management or only for implicit fencing?
17:21 mareko: so the usage for CPU wait is: userq_signal (fence BOs) --> gem_wait_idle
17:22 bnieuwenhuizen: implicit sync*
17:22 mareko: the usage for GPU wait is: userq_signal (fence BOs) -> userq_wait (to get input for WAIT_REG_MEM)
17:25 mareko: bnieuwenhuizen: I think it's just implicit sync, but the BO fence is a dma_fence, so it can be used in the kernel for other stuff
17:25 bnieuwenhuizen: thanks
17:25 bnieuwenhuizen: will try it out
17:30 mareko: bnieuwenhuizen: the gem_va ioctl will also have input fences and an output fence because user queues will have to insert waits for VM updates manually (the output fence), and VM updates will have to wait for any user queue work (input fences)
17:33 mareko: bnieuwenhuizen: the kernel frees memory when all BO fences are signaled, which is typically just the output fence of the gem_va unmap/clear operation
17:37 MrCooper: mareko: unless you can point to the spec language supporting it, killing the process after a GPU reset isn't conformant either
17:42 mareko: it's conformant in the sense that it prevents non-conformant behavior, but I think the spec should explicitly say that program termination is disallowed (not the other way around), which it doesn't do
17:43 mareko: something like this would have to be added into the GL spec: "program termination due to a non-robust lost context is disallowed even if not terminating could result in non-conformant behavior"
17:45 mareko: and non-conformant behavior includes terminating, so we basically end up where we are today
17:48 mareko: it's not possible to say that termination is disallowed and non-conformant behavior is allowed in the same sentence without nullifying the former
17:50 mareko: the same language could be identically rephrased to "program termination is disallowed but undefined behavior can follow, including program termination", which can be simplified to "everything is undefined"
18:25 mareko: so not only is program termination not disallowed, but it's also impossible to add text into the spec that disallows program termination without making the text logically irrelevant, superfluous, and have logically no effect
18:30 Venemo: MrCooper: you keep repeating that you believe not terminating would be better, but you haven't yet explained what problem that would solve, nor your reasoning about why
18:37 mareko: not terminating would allow the system to continue ~50% of the time as if nothing happened, and end up in an infinite hang/reset loop the other ~50% of the time