IRC Logs of #dri-devel on irc.freenode.net for 2024-01-09

01:47 mareko: alyssa: you can still make it parallel within a single compute workgroup, but it can only be 1 workgroup
01:50 DavidHeidelberg: eric_engestrom: "Please split the AMD common revert into its own MR, so that it gets tested on all the farms 😉" can we assume it's still working same way as few commits before?
01:50 DavidHeidelberg: while I'm the person who was yelling "never mix farm enable/disable", this saves one pipeline without side-effects
01:53 mareko: alyssa: specifically, a loop within a compute shader that runs as 1 workgroup processing the whole index buffer, executing e.g. 1024 invocations for that workgroup
01:56 mareko: a single workgroup likely has 128 execution SIMD lanes, that's a lot of parallelism outside the reduce op you need for primitive restart
01:58 mareko: or prefix sum
02:08 DavidHeidelberg: eric_engestrom: nvm, splitted... aaahgh
02:47 alyssa: mareko: sure :)
08:14 eric_engestrom: DavidHeidelberg: yeah, it's a revert of a revert commit, and it was originally merged in its own MR, so the chances of reverting that commit causing issues are low, but I think it's better to test it anyway
08:14 eric_engestrom: sorry for being annoying 🙃
08:14 eric_engestrom: *revert of a recent comit
08:15 eric_engestrom: *commit (/me is clearly not awake enough to type yet)
08:30 MrCooper: tango_ karolherbst: yeah that's not a kernel issue, GPU virtual addresses are assigned by user space
08:31 MrCooper: tango_: BTW, isn't ROCm OpenCL open source as well? Thought so, not sure though
08:39 any1: Has anyone else run into this weird issue? https://mastodon.social/deck/@andriyngvason/111725065042427990
08:43 any1: It can't be a software issue unless the HDMI signal actually depends on scheduling in the kernel, and that's not the case, right?
08:43 emersion: any1: anything in dmesg?
08:43 any1: emersion: nope
08:44 emersion: in particular, underrun errors?
08:44 any1: none
08:52 any1: Another, perhaps interesting, data point is that if I force enable the hdmi connector (so it thinks it's always connected), yanking and plugging the cable back in is much likelier to result in blank display.
09:57 dolphin: airlied, sima: Any ETA for pulling the last drm-intel-gt-next PR, we could then move towards drm-intel-next-fixes?
10:09 airlied: oh did I miss that? I thought I'd caught them all
10:09 airlied: I'll pull it tomorrow
10:10 airlied: dolphin: https://patchwork.freedesktop.org/patch/572237/ that one
10:10 airlied: ?
10:11 dolphin: yes
10:12 airlied: okay sorry, not sure how I missed it, will pick it up tomorrow
10:12 dolphin: sounds good
10:24 jlawryno: Hi, can someone help me out with dma-buf cache coherency?
10:24 jlawryno: there is this line in drm_gem_shmem_helper: `shmem->map_wc = false; /* dma-buf mappings use always writecombine */`
10:25 jlawryno: that seems a little confusing
10:25 jlawryno: are dma-bufs really always WC?
10:25 jlawryno: looks like dma-buf system heap is cache coherent
10:58 tango_: MrCooper: I know some parts are open source, but don't know how much of it. the big question remains if I should report the issue to Mesa or not given that it seems to arise from a conflict with ROCm
10:58 tango_: my last report about issues with Clover + ROCm wasn't exactly enthusiastically received
12:21 karolherbst: tango_: it sounds like a kernel bug to me honestly
12:22 karolherbst: and if it used to work, it' even a proper regression
12:22 karolherbst: so you could either git bisect it to pin point the commit breaking it or simply report that something broke it
12:40 pepp: tango_: the "bo conflict" issue sounds like a recent kernel regression
12:41 pepp: tango_: see https://gitlab.freedesktop.org/mesa/mesa/-/issues/10303 - there's a kernel patch linked, can you try it?
13:56 Lynne: Venemo mareko: have you had time to look at the mesh shader issue yet?
13:56 Venemo: Lynne: which one?
14:03 javierm: sima: I just noticed that msm basically does the same to what I wanted to do in my alternative patch and they even have a comment mentioning to avoid having the firmware in the initrd
14:03 javierm: sima: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/msm/msm_drv.c#L354
14:04 sima: ugh
14:04 sima: at least some locking
14:04 sima: but still ... ugh
14:05 sima: javierm, the thing is, what do you do if the fw is still not there?
14:05 sima: I think msm goes with "can I offer you an Oops in these trying times" ...
14:05 sima: robclark, ^^
14:06 javierm: sima: you mean with probe deferral or open? Failing to open is what msm does IIUC
14:06 sima: I don't see where it bails out
14:06 sima: hence the snark :-)
14:07 javierm: sima: ah, sorry you are right, load_gpu() has a void return value
14:07 sima: yeah and context_init doesn't check for priv->gpu
14:07 javierm: my patch for powervr makes the open to fail
14:07 sima: msm_postclose does check for lack of ->gpu, but at that point you've already gone boom
14:07 javierm: sima: yeah
14:07 sima: so I mean it ... works
14:08 sima: dri1 also worked in it's fashion
14:08 sima: it's just a pretty gross violation of the linux device model, or at least feels like that to me
14:08 sima: and so I think should be fixed there somewhere
14:08 linusw: My dim update is now failing like this: "Fetching drm-xe (local remote kernel)... git@gitlab.freedesktop.org: Permission denied (publickey)." hmmm?
14:09 sima: rodrigovivi, thellstrom, demarchi ^^
14:10 javierm: sima: I see. Agree with you, I'll explore other alternatives then. Your suggestion of listing the firmwares used by built-in drivers makes a lot of sense indeed
14:11 sima: javierm, I think the officiated probe_defer is also ok, if gregkh acks it
14:11 sima: it looks like that's at least what cros wants/needs
14:11 javierm: or just built as a module and forget this issue :)
14:11 sima: or well, implemented, maybe it was just an accident they picked this one
14:12 thellstrom: linusw: You need to either register an ssh key with gitlab.freedesktop.org, or use the https: URL for the drm-xe repo, (try cd src; git config -e to replace the URL).
14:12 sima: thellstrom, the nightly.conf should list the https: remotes too to make this work automatically
14:12 sima: or more automatically
14:13 sima: hm it's there ...
14:13 thellstrom: sima: It does, but how do you make dim select the https url when creating a remote?
14:13 sima: thellstrom, should we have the https first so it's the one you get set up with automatically?
14:13 linusw: thellstrom: aha this one git is at gitlab instead of freedesktop.org I get it
14:14 Lynne: Venemo: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10360
14:14 sima: thellstrom, at least until more drm repos are on gitlab and so most committers are set up with gitlab ssh keys ...
14:14 linusw: hm it should probably also suggest "drm-xe" as the name for the git remote instead of "kernel"...
14:14 sima: otherwise I think we'll have a lot of fun right now
14:15 linusw: I already have a gitlab setup so I try to fix some proper ssh key.
14:15 sima: linusw, yeah that's an oops, I thought dim goes with the abstract dim repo name "drm-xe" from nightly.conf ...
14:15 thellstrom: sima: Ok, I can change that.
14:15 thellstrom: linusw: Yeah "kernel" isn't a good suggestion, but that requires dim script changes. We're slowly getting to those....
14:16 sima: thellstrom, hm maybe sprinkling pick_protocol_url over a few places in dim might help for this?
14:16 linusw: thellstrom: thanks, I just renamed the remote and it seems to work fine.
14:16 sima: (I haven't caught up on all the dim discussions in the xe thread, just marked those as read)
14:21 thellstrom: sima: We're basically trying to land the xe documentation changes first, and then to the dim script changes.
14:23 sima: thellstrom, yeah sounds good
14:23 sima: the nightly.conf url switch as stop-gap is probably enough for now
14:24 sima: tursulin, ack on calling it drm_fdinfo_gem_bo_is_shared or something like that
14:24 sima: makes sense to limit where it should be used ...
14:25 sima: or whatever you feel is a good name that makes this clear
14:25 linusw: thellstrom: after renaming the remote to drm-xe and adding an SSH key everything works like a charm.
14:27 thellstrom: linusw: Great. Was the renaming required to make it work?
14:32 linusw: thellstrom: I don't know I was too scared to test it without giving it a proper name :D
14:32 thellstrom: ;)
14:33 demarchi: thellstrom: it seems "kernel" is the default suggested by dim since it just uses the url for that
14:33 demarchi: we'd need to pass the repo down to url_to_remote for it to use the name from the manifest
14:34 thellstrom: demarchi: Yes. There are a couple of dim script changes to make things work properly with xe.
14:34 demarchi: thellstrom: are you going to work on that?
14:34 sima: demarchi, yeah it's a bit more surgery to have the repo name there :-/
14:34 thellstrom: sima: There appears to be an unresolved drm-tip conflict between drm-fixes and drm-next
14:34 tursulin: sima: Fdinfo part is fine (although why not stay in the _memory_stats_ namespace?) but no other api currently has "gem_bo" as part of the name. On top of ideas I already came up in the email maybe drm_fdinfo_gem_object_is_shared works for both?
14:35 sima: tursulin, either is fine imo
14:35 sima: gem_bo was just because the usual gem_buffer_object was more typing :-)
14:36 sima: ah it's just gem_object for functions
14:36 tursulin: sure, but afaics there are only ttm_buffer_objects, the gem ones are just gem_object
14:36 thellstrom: demarchi: I haven't started on that, just noted down a couple of changes so if you want to take a look that'd be great. IIRC you are pretty fluent with bash scripting.
14:37 sima: tursulin, yeah I'm honestly not sure why my brain picked up the buffer_object one out of a dusty corner ...
14:37 sima: demarchi, pick_protocol_url in the right places might also help, if you look into this
14:39 demarchi: sima: it also failed to detect the ssh repo as I had it as git@... instead of ssh://git@
14:41 Venemo: Lynne: it doesn't seem to be a shader issue to be honest, so I don't know what we can do about it
14:41 thellstrom: demarchi: Other things: "list-branches" doesn't pick up xe nor the topic branches on drm (only drm-intel and drm-misc are listed), and then there are the cherry-pick-fixes that hardcodes drm-intel.
14:48 Lynne: Venemo: yeah, true, it does sound like a fw issue
14:49 robclark: sima: msm_submitqueue_create() returns an error if you don't have gpu yet.. didn't read whole scrollback but the deferred fw load has been letting us get by with suboptimal initrd's for a long time now
14:51 sima: robclark, but that's already after open has finished and not properly set things up?
14:51 sima: since it's in an ioctl
14:52 sima: robclark, I think cleanest would be to delay drm_dev_register (but maybe that upsets your userspace too much, which I think gregkh wont like)
14:52 sima: but latest open() I'd expect to fail, not some ioctl after that
14:52 robclark: open succeeds, just anything needing a gpu will fail.. it defn used to work and I don't think we've regressed it
14:52 robclark: open() fails, then plymouth fails
14:52 sima: there's a bunch of checks
14:52 sima: ugh
14:53 robclark: and defer loading means no display after lateinit kills "unused" clks/gdsc/etc
14:53 sima: robclark, oh so kms works, it's just gpu command submission that doesn't?
14:53 robclark: right
14:53 robclark: if you don't have combined gpu+kms you don't need to do this.. but if you do, you do..
14:53 sima: ok, then it makes some sense
14:53 sima: still freaks me out you have the context_init in open() which might or might not do much
14:53 robclark: I defn don't _like_ it.. but there isn't really a better option
14:54 sima: robclark, well my only bikeshed now is that I'd do stuff like priv->aspace setup only for the first context
14:54 sima: instead of maybe in open, maybe later
14:54 robclark: I'll double check context_init when I get to my desk, it wasn't there in the very early days, but has been around long enough that I don't _think_ it is broken
14:55 robclark: the submitqueue_create ioctl is optional for old userspace
14:55 sima: robclark, I think context_init is just a bunch of structs
14:55 robclark: we setup the fallback ctx in open()
14:55 robclark: brb
14:55 sima: but ctx->aspace = msm_gpu_create_private_address_space(priv->gpu, current); is the one that has !gpu checks in that function
14:56 sima: and I didn't immediately figure out how you handle the the case where ctx->aspace ends up NULL after open()
14:56 sima: oh msm_submitqueue_init also has a !gpu check
14:57 sima: robclark, i915 context does something with proto ctx, where we do an rcu deref for the fastpath
14:57 sima: and if it's not there, then grab mutex + slowpath lazy ctx init
14:57 sima: would feel more comfy with that pattern for priv->ctx I guess, since it means same path no matter whether deferred fw load or not
14:58 sima: but the main one really was that I missed that it's not needed for kms
14:58 sima: so consider this a bikeshed (assuming you don't find a hole that has crept in over the years)
14:59 sima: javierm, ^^ so I guess img would be different since it means the entire driver doesn't really work without fw
15:00 javierm: sima: yeah, also powervr is DRIVER_RENDER only (in my platform tidss is the DRM/KMS driver that is DRIVER_MODESET)
15:01 javierm: sima: while msm is both DRIVER_MODESET and DRIVER_RENDER, that's why the probe deferral trick won't help there
15:01 sima: yeah, so also not much need for early boot output since tidss takes care of that
15:01 javierm: sima: exactly
15:01 javierm: it doesn't affect having display early on boot
15:01 sima: iirc i915-gem does something similar to msm, for the same reasons
15:13 thellstrom: agd5f: There's an amdgpu conflict rebuilding drm-tip between drm-next and drm-fixes. Which one is the correct resolution for drm-tip?
15:15 sima: thellstrom, might need to manually update drm-rerere and hard-reset anything else away
15:16 sima: airlied pushed a wrong resolution and I fixed it on monday
15:16 sima: if it's still the same conflict
15:16 sima: javierm, you replied to me only in private, was that intentional?
15:17 sima: thellstrom, altough dim rebuild-tip should do that for you ...
15:17 javierm: sima: gah, it was not
15:17 javierm: sima: sorry, that's what I get for trying to answer emails while in a meeting
15:19 agd5f: thellstrom, if it's the mst_state->pbn_div line, the correct line should be mst_state->pbn_div.full = dfixed_const(dm_mst_get_pbn_divider(aconnector->mst_root->dc_link));
15:20 thellstrom: agd5f:, sima: Thanks, I'll try to fix it.
15:20 sima: thellstrom, https://lore.kernel.org/dri-devel/20240105095559.1136737-1-imre.deak@intel.com/ for a nice writeup of that one
15:22 javierm: sima: thanks for pointing that out and sorry for the additional spam :)
15:27 sima: javierm, you need the fw list at build time, not run time
15:28 sima: since without that you can't boot that kernel :-)
15:28 sima: the fw list is not static ...
15:30 sima: javierm, kinda like we have modules.builtin already, which is installed into /lib/modules/$kernel_version
15:30 sima: so probably needs a modules.builtin.firmware file
15:32 javierm: sima: ah, I thought you meant to keep in a list exposed by the kernel for user-space to consume but a modules.builtin.firmware makes more sense indeed
15:33 robclark: sima: for extra fun, the zap fw (on platforms that use that, which is basically everything but chromebooks) tends to be signed with device vendor key, so only thing that really knows what fw needs to be in initrd is the dtb.. haven't come up with a reasonable way to deal with that but we need something better than MODULE_FIRMWARE()
15:35 javierm: sima: let's see what gregkh says, I think that a request_firmware_defer() is reasonable or my patch as is. If not, I guess that will just let this thing go even when my mind refuses too do such thing
15:35 sima: yeah request_firmware_defer() sounds reasonable too
15:35 sima: but your timeout patch clearly shows that this is all a massive minefield
15:36 javierm: yeah
15:36 sima: robclark, ... :-O
15:36 javierm: sima: the timeout for probe deferral is really something that should had been refused IMO
15:36 sima: yeah, and I think I voiced that back then in the thread :-/
15:37 sima: or at least while ranting here
15:37 javierm: it makes the whole "probe deferral is a simple, reliable and consistent" argument not true anymore
15:37 javierm: sima: yeah I think you made that clear as a response to my v1
15:38 javierm: *consitent mechanism
15:50 thellstrom: sima: Simplest solution seemed to be reverting your merge resolution and adding a new one. Not sure why it stopped working. But it seems rebuild-tip works now and the xe URL order has been changed.
15:51 sima: thellstrom, did the context move around?
15:51 sima: or do you have a different git version than me?
15:51 sima: 2.43 here
15:51 thellstrom: Same version here. The resolution worked this morning when I added the xe repo...
15:52 sima: hm I'm confused, but yeah sometimes rerere falls over in peculiar ways
15:52 sima: thellstrom, thanks for sorting it out
15:53 thellstrom: np. Let me know if I somehow broke something.
18:08 demarchi: sima: need some help with gitlab roles. It seems I can't change the role for some members even if I was given owner role in the kernel repo. Do you know why? Most of them seem to be when they were added by another owner, but there are some odd ones
18:09 demarchi: (this is to apply the committers rule to drm-xe as agreed upon @ https://gitlab.freedesktop.org/drm/maintainer-tools/-/merge_requests/24)
18:16 daniels: demarchi: give me a user and repo as an example
18:38 demarchi: daniels: https://gitlab.freedesktop.org/drm/xe/kernel/-/project_members?sort=name_asc&page=3
18:38 demarchi: daniels: user Luís Felipe Strano Moraes @lfelipe
18:39 alyssa: robclark: my MR couldn't possibly affect these tests.. any idea what happened here? https://gitlab.freedesktop.org/mesa/mesa/-/jobs/53533103
18:40 alyssa: (restarted the job so hopefully the pipeline merges. won't show up in the accounting for failed merges, but..)
18:41 daniels: retries do get tracked
18:42 alyssa: good to know :+1:
18:44 daniels: demarchi: ok, that’s because the access is inherited, e.g. lfelipe has maintainer on drm/xe, so then gets the same role on drm/xe/kernel
18:44 daniels: if you change it there then it’ll get reflected; if you remove it then you can re-add at a different level
18:44 demarchi: daniels: ok, thanks. Let me try
18:47 robclark: daniels, alyssa: looks like OOM killer wrecked things, and then the test was retried but didn't complete before timeout?
18:48 alyssa: robclark: Maybe? looked like deqp-runner-level fails
18:48 alyssa: looks like a big chunk of KHR-GL46.* is flaky on a630, maybe?
18:49 jenatali: I saw a hang recovery in the logs
18:49 robclark: it looks like things aren't failing until oom killer killed deqp
18:51 demarchi: daniels: that worked, thanks
18:51 alyssa: robclark: ah, fair
18:53 robclark: looks like there is one of the deqp processes that uses a lot of memory.. but maybe this could be flakey if there is some variation in which tests get grouped together by deqp-runner or something like that?
19:14 robclark: hmm, mesa_logw_once() is still pretty chatty with piglit
19:25 alyssa: robclark: yeah, very possible
19:25 alyssa: not sure if that's possible to solve in general :/
19:27 robclark: on the fence about just removing the mesa_logw_once() vs backporting the missing kernel bit (which would anyways fix some other memobj tests)
19:32 daniels: demarchi: np :)
21:09 tango_: pepp: oh that sounds interesting, might be relevant to https://gitlab.freedesktop.org/drm/amd/-/issues/2960 too
21:10 tango_: pepp: I won't be able to test it before next month though (too busy with this machine)