IRC Logs of #dri-devel on irc.freenode.net for 2024-03-13

00:00 robclark: DemiMarie: so one thing that is currently being enabled is setting up a fixed mapping.. say take an 8GB window, setup an unbacked anon r/o mmap in vmm and map that to guest.. and then on host side dynamically map GEM buffers (backed by whatever) into that window and when they are unmapped overwrite the vmm mmap w/ anon r/o mmap... couldn't you do something like that for vram?
00:04 DemiMarie: robclark: Xen does not (yet) support unmapping emulated BARs via MMU notifier, I think.
00:05 DemiMarie: So the kernel driver can’t unmap anything.
00:06 robclark: tbh, I'm not super familar with xen.. but from a hw standpoint, or if there are x86 vs arm differences in how this works... but shouldn't it be two independent stages of address translation, (va -> ipa -> pa)?
00:07 Lynne: bcheng: correct, only on navi3x, have tried to replicate on navi2x, but haven't seen it happen there
00:37 DemiMarie: robclark: yes, but the problem is that Linux doesn’t have control of the IPA -> PA page tables. That’s Xen’s job.
00:50 robclark: oh, hmm.. because the "host" is in a vm too.. idk, it mostly seems like a sw/xen problem but it seems like amd folks are interested in xen so are better connected to the problem
00:51 Lynne: airlied: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28143
01:31 DemiMarie: robclark: my concern is that the fix will be AMD-specific 😢.
01:38 robclark: I don't think amd's needs are really any different than other dgpu, but I guess that will be a topic to examine when patches are posted
01:50 DemiMarie: I just hope they don’t fix it with a hack in their driver.
08:26 pq: zamundaaa[m], https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#device-hot-unplug etc. should give you enough justification to demand fixes re: gpu reset during page flip. Either the reset works, or the opened DRM device is dead. It cannot be half-broken.
08:27 pq: zamundaaa[m], or maybe the opened DRM device is dead? EBUSY is just the wrong error code for it.
08:39 jani: tzimmermann: can I get that drm-next -> drm-misc-fixes backmerge you promised? I didn't think I'd need it, but apparently I do. for this: https://lore.kernel.org/r/cover.1709913674.git.jani.nikula@intel.com
08:40 tzimmermann: jani, sure. give me a minute
08:40 jani: tzimmermann: hmm the above came out as if I was impatiently waiting for it, and you'd failed to deliver. not at all what I intended! sorry
08:41 tzimmermann: no offence taken
08:41 tzimmermann: but i can only backmerge from drm-fixes into drm-misc-fixes
08:41 jani: right
08:41 jani: not sure if it has everything, let me check
08:42 tzimmermann: if a patch does not apply to -misc-fixes, you can still add it to -misc-next-fixes
08:42 tzimmermann: but let me backmerge first
08:42 jani: it doesn't, actually
08:43 jani:has distinct feeling of -ENOCOFFEE
08:44 jani: okay, it all applies to -misc-next-fixes, shall I shove them there together?
08:45 tzimmermann: let me do a test build of the fixes backmerge and get back to you in a few minutes
08:45 jani: tzimmermann: thanks
08:45 jani: hmm, except exynos, they maintain their own tree don't they? but no replies :(
08:46 tzimmermann: they do, but we occationally have exynos patches in drm-misc. if it's just the fix, it should be ok.
09:10 jani: tzimmermann: roger
09:11 kusma: Not sure what's up with the a630 runners, but they seem to fail everything ATM: https://gitlab.freedesktop.org/mesa/mesa/-/jobs/56224901#L95
09:12 kusma: Looks a bit like something went wrong earlier in the lava process, and some artifact hasn't actually been built, or something?
09:15 kusma: Actually, it seems that artifact *used* to be available, but is no longer?
09:39 tzimmermann: jani, i pushed the backmerge to drm-misc-fixes. it's now at -rc7
09:51 jani: tzimmermann: ack
09:59 tintou: eric_engestrom: FYI deqp-runner isn't supervised by Marge https://gitlab.freedesktop.org/mesa/deqp-runner/-/merge_requests/69 😉
10:00 eric_engestrom: tintou: thanks! I would've noticed eventually xD
10:04 eric_engestrom: kusma: yeah, the bucket was configured to auto-delete the kernels after a couple of months, and apparently it's really hard to configure them better because they've been working on it for many months now 😅
10:05 eric_engestrom: the workaround so far has been for someone who has access to the gfx-ci/linux repo to re-run the jobs to re-compile & upload the kernels
10:05 kusma: OK, uh, can someone please do that so I can merge again? :D
10:06 eric_engestrom: to find the pipeline to re-run, you take the version in the url that's failing and stick it in this search: https://gitlab.freedesktop.org/gfx-ci/linux/-/pipelines?ref=v6.4.12-for-mesa-ci-f6b4ad45f48d
10:06 eric_engestrom: (that's the one for your case for instance)
10:06 eric_engestrom: anyone who has access can re-run the jobs in the latest pipeline on that page
10:07 kusma: Looks like something is running right now?
10:07 kusma: Or was, now it's pending, I guess?
10:07 eric_engestrom: yup, looks like DavidHeidelberg just re-ran them
10:07 eric_engestrom: "Created just now by David Heidelberg"
10:07 kusma: Awesome :)
10:08 eric_engestrom: yeah, last re-run was feb 11, so I think it's exactly every 2 months that someone has to go around and re-run everything
10:08 DavidHeidelberg: I have MR for dropping them, just didn't managed to stress-test it enough, since I know some freedreno problems popped a bit randomly, so hopefully 1. phase soon these will be dropped and then we move new jobs to 4ever-lasting s3 bucket of Images
10:08 eric_engestrom: would be nice if the bucket was no longer configured to delete everything 🙃
10:09 DavidHeidelberg: y, before Daniel left, he throw at us nice prepared bucket for that :D
10:10 eric_engestrom: please ping me on the MR when you merge it, I will need to make sure it's backported to the stable branch(es)
10:56 kusma: Do we not have any windows runners ATM? https://gitlab.freedesktop.org/mesa/mesa/-/jobs/56228076
11:48 zamundaaa[m]: pq: if I can restart the compositor and it works again, then the drm device can't be dead. I think it's just the logic around pageflip timeouts not being built to allow the compositor to recover
11:49 pq: zamundaaa[m], I made a diffence between "DRM device" and "opened DRM device".
11:50 pq: maybe you need to close the device and open it again, but what would tell you to do that...
11:50 zamundaaa[m]: Good idea, that might be a usable workaround for when pageflip timeouts happen
11:51 pq: if you really are supposed to re-open the device, then it would be similar to the case of "hot-unplug + hotplug", which a severe device reset might look like.
11:52 pq: I still think the error must be something else than EBUSY for that case.
11:52 zamundaaa[m]: I don't think compositors are supposed to do that, the kernel is just broken. Hotunplug does have the same problem too, pending pageflips never arrive, which tripped up KWin's logic before
11:53 zamundaaa[m]: With hotunplug of course, it can be easily worked around because you don't need the device anymore. Just not waiting for the pageflip to arrive is fine there and avoids all the issues
11:53 pq: you mean compositors are not supposed to re-open a device?
11:54 zamundaaa[m]: I don't think that should be needed
11:54 pq: it's not a literal re-open though, it's more like the DRM device disappered, and another DRM device appeared, and by luck it might be re-using the DRM device node name or not. So it's really no different from hot-unplug.
11:55 zamundaaa[m]: If the kernel were to actually signal device removal + re-adding it for GPU resets, that would break compositors though
11:55 pq: (maybe the DRM docs said something about re-using device node names, I forget)
11:55 zamundaaa[m]: As in, currently KWin just quits when the compositing GPU gets removed
11:56 pq: Compositors have to be able to handle device removal and additions anyway, because eGPU.
11:56 pq: currently they don't, I think, but they should
11:57 pq: a GPU reset could be harsh enough that it really looks like device removal, e.g. when all VRAM contents are lost.
11:58 zamundaaa[m]: Yes, that would be nice and is planned for KWin, but doesn't change anything for the kernel regression policy. If the primary GPU gets unplugged, the compositor goes down right now, so the kernel can't do that for GPU resets that otherwise work
11:58 pq: but, if the GPU reset is not that harsh, then the kernel really does need to make sure that pending flips are eventually completed.
11:58 pq: what's the regression?
11:59 zamundaaa[m]: KWin quits when it could recover just fine if the GPU reset was just treated like a normal GPU reset
11:59 zamundaaa[m]: From an EGL perspective, the EGLDisplay stays valid during a GPU reset, so actually removing the device would break all the apps too
12:00 pq: all that depends on what the reset does, right?
12:01 pq: I'm saying, there only two possible reasonable outcomes: either the page flip completes eventually, or the device gets effectively removed, depending on the kind of reset.
12:01 zamundaaa[m]: Unless the GPU actually gets hotunplugged, the EGLDisplay always stays valid. Even then it stays valid, you just can't create new contexts with it
12:01 pq: and I think you can use the existing DRM docs to make your case on that for a bug report
12:02 pq: I didn't say anything about EGLDisplay yet.
12:03 pq: I do say, that GPU does not need to get physically unplugged for the DRM device to be removed. A harsh enough reset or failure can do that too.
12:04 zamundaaa[m]: It doesn't get removed even with a full reset or failure. In the latter case it maybe should, but that's not super relevant for me - either way the screens don't get updated anymore
12:05 zamundaaa[m]: But if we were to add new drm uAPI to signal "you might wanna reopen the drm node" then that would be an acceptable solution. Only the actual udev remove event would break stuff
12:05 pq: I'm actually puzzled why restarting kwin works.
12:06 pq: if a page flip is stuck, it should still remain stuck, even if the KMS client goes away, and the new KMS client should still get EBUSY.
12:06 pq: maybe the driver does have a timeout, but it's longer than kwin's timeout?
12:07 pq: or just takes a little bit longer than you wait to recover
12:07 pq: but you said kwin won't recover even if it waits indefinitely?
12:08 zamundaaa[m]: It didn't recover even after waiting for a while
12:08 pq: unless maybe a switch to fbcon in the mean time hammers more resets in, like a full modeset
12:09 pq: quite strange
12:12 zamundaaa[m]: Yes. Either way, whatever is going on, we need a better way to deal with pageflip timeouts. Or any way to deal with them really
12:18 pq: zamundaaa[m], IMO a page flip timing out is one of two things: a driver bug, or the device disappeared (which means you don't get EBUSY if you to flip again, you get a different error).
12:20 pq: Device disappearance is communicated in two ways: udev device removal event, and UAPI returning errors. The errors might not be unique to device disappearance, but I think EBUSY is not it.
12:23 zamundaaa[m]: I agree, but currently it's a driver bug that isn't being taken care of. If the driver can detect the pageflip timing out (all of them do), then it could also do something about it in many cases
12:24 pq: Yes. You have the grounds to complain about a driver bug, and the DRM docs can back you up. :-)
12:25 zamundaaa[m]: I'll make a thread on the mailing list about it
12:25 pq: cool!
13:59 MrCooper: pq zamundaaa[m]: maybe closing the DRM file description happens to clean up whatever was lingering and causing EBUSY
14:02 pq: that would be a bug, IMO
14:03 pq: but of course it could
14:08 MrCooper: indeed it would, just a possible explanation why restarting the compositor worked
15:23 randevouz: I am incapable of reading w3c too much material, i have not entirely given up, the way i see things my own is scarily similar, march 5 dri-devel irc under babylonian nick, i rambled with transition to smaller value, in fact if every alu bank transitions it's also possible to upconvert per alu then transition then branch, which is what i would call a triplet, and that can be done with batched queries, technically based of tests
15:23 randevouz: and logic or common sense that would also have to work, as long as predicate eliminates the odd or even branch correctly. I would have to elaborate that it's definitely possible, but my failure to provide anything useful that straight functions is already pissing off everyone.
15:34 Lynne: tchar: airlied: how is film grain supposed to be signalled as not supported on intel?
15:34 Lynne: the return codes for vkGetPhysicalDeviceVideoCapabilitiesKHR are strictly specified and mention no cases where film grain may not be supported
15:36 dj-death: airlied: since you're getting poked on video stuff, any thought on https://gitlab.freedesktop.org/mesa/mesa/-/issues/10738 ?
15:55 randevouz: I do not want to write new language or duplicate stuff, but i am really in trouble with filtering all the specs, feels like time would be better served if doing from scratch. https://github.com/SmartDataAnalytics/minds , if i put all the links i work/digest on the channel gets full of my text only. I can not process so much info it seems alone and to find a pin from the haystack, is more difficult than to code it my own, which
15:55 randevouz: also takes months to get anywhere, damn it seems like most links are click bates possibly , words are invitingly intriguing but code does not seem right , no compilation or testing done, i think the code looks not correct so far among those projects i have investigated.
16:30 tchar: Lynne: that looks like an oversight in the spec, we never had an implementation to test with that didn't support filmgrain at all, so this likely never came up :(
16:32 tchar:though I understand Intel does support filmgrain, albeit with older generations following a "non-standard" algorithm.
16:34 Lynne: nah
16:34 Lynne: it simply did a shader filmgrain
16:34 Lynne: which vulkan can't do
16:34 Lynne: don't fall for their marketing scams
16:37 tchar: heh, but does it matter how they do the filmgrain in the driver? anything passes as a conformant filmgrain process it would appear
16:38 bcheng: I thought the film grain support reporting is in the video profile
16:38 tchar: bcheng: it is there, but it seems we missed adding an error code for cases where the implementation can't support it when you signal it in the profile
16:40 tchar: maybe we could rely on VK_ERROR_VIDEO_PROFILE_CODEC_NOT_SUPPORTED_KHR, for the AV1 specific codec bits in theory, but it's a bit opaque
16:44 mareko: DemiMarie: you are overestimating the resources that we have - if you want Xen to work perfectly on a specific chip and configuration, you can sign a contract with AMD
17:02 bcheng: tchar: oh, i see...
17:12 DemiMarie: mareko: and that would be far beyond the resources *we* have. We need Xen support to work on past, current, and future chips, too. The only way I know to do that is to make Xen completely transparent to the driver, so that code that works without Xen automatically works with it.
17:32 alyssa: ugh glmark
17:33 alyssa: don't use SRC_ALPHA blending and then just write a constant 1.0 in the FS >.<
18:06 airlied: Lynne: i think you are meant to fail early if user specifies filmgrain and you dont support it
18:07 airlied: i dislike that as i dont think its discoverable or scalable
18:07 Lynne: yeah, when you submit the profile to test, but the error code is what I'm wondering about
18:08 airlied: dj-death: yeah no good ideas there, except use huc fw i think
18:08 Lynne: tchar: their driver may lie to us and do film grain by itself, but mesa wouldn't do that, would it?
18:14 DemiMarie: mareko: there might be a better option, though.
18:30 alyssa: Lynne: "you wouldn't film grain in mesa" meme template here
18:33 mareko: alyssa: st/mesa could optimize that
18:36 alyssa: mareko: It could... not sure if anything other than this goofy benchmark hits it tho
18:41 tchar: Lynne: i hope not! In case you really don't want the driver to do film grain, you can forcibly flag apply_grain off in the std headers and handle it in the application
18:57 tjaalton: gfxstrand: hi, after enabling nvk it seems to be trying to find syn using pkg-config, which seems wrong?
19:04 dj-death: airlied: yeah :(
19:05 dj-death: airlied: or do the nasty kind of tricks we're doing with the compute queue
19:05 dj-death: airlied: when some operation isn't supported, call in another engine to do work for you
19:06 dj-death: airlied: in that case could do a compute shader generating commands for you
19:06 dj-death: airlied: mentaaaaal
20:18 alyssa: dj-death: intel-clc thanks you
20:18 gfxstrand: tjaalton: Yeah, you need to set MESON_PACKAGE_CACHE_DIR to tell it to look at the debian packages
20:29 dj-death: alyssa: it's super slow though
20:57 Calandracas: does the llvmpipe rusticl backend behave meaningfully different from the other backends? radonsi, iris, and nouveau all all giving me the expect (correct) results, but llvmpipe is outputing pure garbage
20:59 Calandracas: ^ could obviously be bad code too, am trying to rule out my code being the problem
21:06 alyssa: dj-death: aww
21:13 mort_: wow panthor is merged, I didn't realize that; this is extremely exciting
21:47 alyssa: =D
21:55 dejavuaround: Mar 13 18:23:19 <randevouz> in theory it's simpler to do the proposed triplets, cause otherwise the compiler analysis goes more difficult, you first upconvert/uptranslate a value in first alu then downtranslate it to smaller, in other words alu has some permutes, and it selects one, but from the value one digit is missing, it kind of clips onto that gap, and the subtraction result exposes the upconvert which needs a predicate that has
21:55 dejavuaround: to be smaller and
21:55 dejavuaround: Mar 13 18:23:19 <randevouz> downconvert, if every alu transitions this way no compiler analysis needs to be done it's space complexity tradeoff, you trade slightly more space by complexity, otherwise compiler has to keep tables of alus that go smaller and bigger those are splitting the instructions into two half's, so it only upconverts when it has to. but the space needed to upconvert is very tiny compared to alu permute banks, so the
21:55 dejavuaround: analysis is not worth
21:55 dejavuaround: Mar 13 18:23:19 <randevouz> it.
21:55 dejavuaround: Mar 13 18:48:25 <randevouz> in theory that is not so complex , since operands are odd and even do not need predicates, branches need predicates, and on the fork maximum alu amount predicates mandate the predicate usage, so if you take shorter branch , you always predicate if you take longer branch you could predicate only if shorter one has already graduated, and also every alu must transition to smaller value again, cause this is for
21:55 dejavuaround: book keeping and so to
21:55 dejavuaround: Mar 13 18:48:25 <randevouz> speak scaffold or garbage elimination, otherwise you immediately get stale or wrong content, and under this assumption that hashes are produced so that every alu transitions, it can always be done in parallel, in theory offers huge throughput and low latency, massive fast computation.
21:55 dejavuaround: Mar 13 19:07:19 <randevouz> and my phone is being hammered by all sorts of personal subjects, with click bates as i am the hero, well my definition is not matching about heroism, some immune system strengths i have , but in reality i would had considered myself a hero if i saved myself in the past from huge trouble i get into, by having been just smarter, and my relations with love of my life had long since finished, she finished it,
21:55 dejavuaround: and i confirm i no
21:55 dejavuaround: Mar 13 19:07:19 <randevouz> longer have plans with that lady, she treated me very bad , and cheated only, the last which is another subject of this flood i get on my phone. Sure she is sorry but she never liked me in real time and i care not about it anymore.
21:55 dejavuaround: Mar 13 19:52:15 <randevouz> Xen programmers are the only ones that fully support vidpn from ms apis, it is Microsoft's Randr, i wanted to do it too, but looked that they managed it pretty good. And of course i was very insulted by humiliations of sexual kind and i do no longer deal with any of the trash people who did it daily basis, and likely they get charged/sanctioned/penalized , but i am not disappointed about technology, i see all
21:55 dejavuaround: being done except my
21:55 dejavuaround: Mar 13 19:52:15 <randevouz> last project, which is upsetting one, in a way that none would want to share such code it puts one on the highlight immediately too, where many do not want to be at. all except the super trooper engines actually have been implemented and it was much harder work, then the engines which would upset the world and cause a lot of trouble.
21:56 dejavuaround: Mar 13 19:53:09 <randevouz> *than
21:56 dejavuaround: Mar 13 20:06:57 <randevouz> so it's indeed possible but i leave now, someone might have it, lot of institutions go down if something alike gets enough popularity and usage, puts a lot of good people at risk, and assaults are granted against the publisher too. Chaotic results likely indeed. So i leave from that research now, and do not share much either for safety.
21:56 dejavuaround: For fuck sakes you do things wrong, does not matter what grain it is, you could do whatever you want if you had brain.
21:56 dejavuaround: So my resignation is given you are all retards. They message me when you get handled all.
21:56 dejavuaround: so they took my voice, fuck you assholes.
22:19 Lynne: airlied: getting decoding issues with https://files.lynne.ee/av1_test_ext.ivf after the first 30 frames or so