IRC Logs of #dri-devel on irc.freenode.net for 2023-11-10

00:05 HdkR: kiroma: It covers a lot of things, so it needs to be large. But running subtests can help reduce the workload when you're poking around
01:20 kiroma: All sanity tests were skipped bar one which failed because `Test requires OpenGL compat Shader Language version 1.1, but only 0.0 is available`
01:20 kiroma: I... haven't even ran devenv yet? I'm not sure why this is failing.
01:39 kiroma: Oh wow I was missing waffle-utils, and somehow python was succeeding in executing something that didn't exist
08:41 jfalempe: tzimmermann, there was a mistake in my latency patch for mgag200, I will send a v2 soon. (the vmap parameter used to be a void *, and is now an iosys_map).
08:41 jfalempe: for the conditional, should I keep it under CONFIG_PREEMPT_RT, add a module parameter, or just do it unconditionally, since the performance impact is minor (less than 1% on my testing).
08:44 MrCooper: zamundaaa[m]: FYI, the over-synchronization issue we talked about in A Coruña turned out to be real: https://gitlab.gnome.org/GNOME/mutter/-/issues/3148
09:26 tzimmermann: jfalempe, sorry. i've not been around much in the last weeks
09:26 tzimmermann: i've just read your reply to my review of that patch.
09:29 jfalempe: tzimmermann, ok np. There was not much reaction from rt-kernel either.
09:29 jani: hey all, I'm trying to move some intel docs from drm/intel wiki to a sphinx project. any objections to having the repo under drm/intel-docs in gitlab?
09:31 jfalempe: tzimmermann, but we have some users that are stuck on older kernel because of this, and I've to do something about it.
09:31 jani: the draft is currently in a personal repo at https://gitlab.freedesktop.org/jani/intel-docs-rfc and the docs deployed at https://jani.pages.freedesktop.org/intel-docs-rfc/
09:31 jani: airlied: sima: ^
09:32 sima: jani, want me to create something and make you maintainer? or all the usual intel suspects?
09:32 sima: we unfortuantely can't import access lists from another repo in gitlab, only from another group :-/
09:32 sima: jani, plan B would be to create the intel group, put the docs in there and use that intel group to mass-add people everywhere ...
09:34 jani: sima: idk, might want to go with separate permissions for docs anyway
09:34 sima: jani, aye
09:34 sima: so want me to create drm/intel-docs and hand it to you?
09:34 tzimmermann: jfalempe, but i think you slightly missed the point of my reply. it's not a question whether the matrox is slow. that affected server is IMHO mis-designed for an RT system. even on cold caches, the system should work. one usually gets the jitter out by assuming worst-case execution times, or doing tricks like cache coloring. apparently neither is the case here. so if we paper over the matrox issue today, that problem
09:34 tzimmermann: later comes back within the file system or the memory allocator, or whatever else requires pages. i'm not entirely opposed to the patch, but i'd really want to here some experienced RT dev's opionion on that problem first.
09:35 jani: sima: yeah, I'd like that if it's okay with everyone *waves hand across the channel*
09:35 sima: jani, intel-doc or intel-docs?
09:35 tzimmermann: and how do they deal with system management mode?
09:35 sima: or something else
09:35 jani: sima: docs I guess?
09:37 sima: jani, https://gitlab.freedesktop.org/drm/intel-docs enjoy!
09:37 jfalempe: tzimmermann, those server are running for years with RT loads without issue. Only upgrading the Matrox driver causes them problems.
09:37 sima: also looks like you can be owner of a repo now too, not just a group?
09:37 sima: that feels new ...
09:37 jani:bows
09:37 jani: thanks!
09:37 jfalempe: tzimmermann, for the system management mode, I'm not sure what it is, and how it can impact RT tasks.
09:38 sima: jani, ah looks like the various delete permissions (for issues, mr and project itself) where put into the owner role and hence can also be owner for projects now directly
09:38 sima: I guess that's useful for code of conduct enforcement ...
09:39 kode54: please put back the space bar heating
09:39 tzimmermann: jfalempe, every few milliseconds, os so, your x86 cpu will put aside all work and run some internal tasks for house keeping. that's system management mode. the os cannot predict or avoid that. it makes RT on x86 sort of complicated.
09:39 tzimmermann: https://en.wikipedia.org/wiki/System_Management_Mode
09:41 sima: jani, dolphin, rodrigovivi, tursulin upgraded you all to owner for drm/intel so you can delete stuff
09:41 jfalempe: tzimmermann, I think they use isolated CPU, so they have no other IRQ/task running on those CPU core. Maybe they can make sure the SMM will not run on those core too ?
09:41 sima: agd5f, hwentlan__ upgrade you + christian for drm/amd to owner so you can delete stuff (like issues violating CoC)
09:42 tzimmermann: jfalempe, it's an NMI. the point of SMM is that you cannot avoid it from the OS. maybe the architecture devs know more details
09:43 sima: ivyl, Adrinael done the same for igt for all maintainers (no idea about the irc nicks of all the newer people)
09:44 sima: robclark, seanpaul_ abhinav__ you're also upgraded to owner for drm/msm so you can delete issues/mr if necessary
09:44 javierm: tzimmermann: I don't know how they deal with SMM but until recently, RT completely disabled EFI runtime services, see d9f283ae71af ("efi: Disable runtime services on RT")
09:44 javierm: and a031651ff214 ("efi: Allow to enable EFI runtime services by default on RT")
09:45 sima: karolherbst, same for drm/nouveau (do we need that or can we just make drm/misc happen ...)
09:45 javierm: tzimmermann: since jfalempe's patch is restricted to RT, I don't see why couldn't be merged if it fixes a real issue on that platform
09:45 jfalempe: tzimmermann, I will ask, and see how they managed that.
09:45 tzimmermann: one the argument of "it has worked for years": a correct RT system guarantees to make it's deadlines for the RT tasks; no matter what the best-effort tasks do. it's not a kernel issue. it's an issue of system design.
09:46 sima: jfalempe, the real fix is the printk locking rework rt people are working on
09:46 sima: john oggness is the main contact, #linux-rt here for status
09:47 tzimmermann: sima, thank you
09:47 sima: atm printk is a giantic lock and it's absolute suck
09:47 sima: unless it's not printk being slow, but the hw being really funny
09:47 tzimmermann: right, that's what i heard
09:48 karolherbst: sima: good question...
09:48 jfalempe: tzimmermann, yes I think it's an issue with the way the Matrox is connected to the system. But I don't have a clear answer. it's the only component able to break the RT tasks it seems.
09:48 karolherbst: I think if drm/misc accepts MRs we'd be fine moving to drm/misc entirely
09:49 sima: jfalempe, could you try without that patch, fbcon fully disabled but fbdev enabled and use some userspace fbdev program to write into the framebuffer?
09:49 jfalempe: tzimmermann, but even if it's a hardware bug, that needs a software workaround, I think our driver are full of that.
09:49 sima: that should side-step any fbcon locking issues and allow us to purely observe anything funny by the hw
09:49 sima: fbtest or something should be able to write stuff to the fbdev mmap
09:50 tzimmermann: jfalempe, your patch does not touch the hardware. AFAICT you're flashing the pages of the GEM buffer in system memory
09:50 javierm: jfalempe, sima, tzimmermann: maybe a dmi match table to do the flush?
09:50 javierm: it's a workaround yes, but at least will be constrained to a single platform
09:51 sima: javierm, imo first make sure we don't paper over a fundamental sw issue somewhere else with this
09:51 javierm: sima: fair
09:51 jfalempe: tzimmermann, yes, that the weird thing about it.
09:51 sima: because this looks extremely funny at best :-/
09:52 sima: jfalempe, does a program which has a mutliple of the cpu cache size allocated and just thrashes that in a loop also break things?
09:52 tzimmermann: sima, that's excatly my point. it doesn't look like a kernel bug. it loks like a system design issue that just manifests in the mgag200 driver
09:52 sima: at appropriately low priority ofc
09:52 jfalempe: sima, flooding fbcon with the patch, is working great.
09:52 sima: tzimmermann, yeah smells a bit like cpu cache thrashing
09:53 jfalempe: like doing cat /dev/urandom | base64 in the fbcon terminal.
09:53 tzimmermann: sima, this. i think the system draws to that buffer and flushing the cache simple cleans up the dirty cachelines
09:53 sima: and as long as the damage helper gets run often enough so that only ever a small part of the shadow fb is loaded into cpu cache the hack works
09:53 sima: but if you do a full screen clear then the cpu cache is busted again and flushing it all out wont help
09:53 tzimmermann: and as GEM BOs are large, it's easy to trash the whole cache
09:53 sima: yeah
09:54 sima: if that's the case then disabling fbcon would be the fix
09:54 sima: because that's the only way to make sure fbcon doesn't thrash the cpu cache at a bad point
09:54 jfalempe: sima, flooding fbcon does a full redraw, and a full framebuffer flush.
09:54 sima: jfalempe, yeah but does that break your w/a?
09:54 tzimmermann: hence my comment that a well-designed RT sstem should not behave like that
09:54 jfalempe: sima, no it's working in this case.
09:55 sima: huh
09:55 sima: this just went straight to wtf territory ...
09:56 tzimmermann: interesting
09:56 sima: I think even more reasons to test with userspace mmap and see whether that makes any difference
09:56 sima: and also whether just thrashing cpu caches in general is the issue or not
09:57 tzimmermann: jfalempe, can you test full-screen updates with a randome-access pattern?
09:57 jfalempe: sima, ok, I will ask for more tests, since I don't have access to these servers.
09:58 tzimmermann: such as filling pixels in random locations on the screen
09:58 tzimmermann: jfalempe, doing linear access might trigger HW-internal optimizations
09:58 jfalempe: what they have done is filling the terminal with " cat /dev/urandom | base64", that should be close to random pixels ?
09:59 tzimmermann: jfalempe, no you're still writing the framebuffer memory from top to bottom
09:59 jfalempe: hum, ok you want random damage in the framebuffer ?
09:59 tzimmermann: what i means is to really access random pixels
09:59 tzimmermann: or at least random characters
10:00 sima: jfalempe, oh was that just on the console? I thought the issue was printk
10:00 tzimmermann: yes, to avoid linear access.
10:00 jfalempe: sima, yes using the console also leads to this problem.
10:00 sima: hm ...
10:01 sima: otoh you can still run into console_lock contention
10:01 jfalempe: even the blinking cursor is enough to make the RT tasks fails (even if that's a very small amount of pixels).
10:01 sima: jfalempe, direct fbdev mmap with fbtest or similar would still be interesting, since that bypasses console_lock
10:01 sima: jfalempe, huh
10:03 sima: jfalempe, might be good to jump over to #linux-rt and ask for debug ideas/tools there too
10:03 sima: maybe after we've figured out whether it's related to console_lock in any way or not
10:03 sima: since a few cachelines for redrawing the cursor really shouldn't make anything else hit a deadline
10:03 sima: unless the deadline is way too close already
10:04 jfalempe: the thing is the RT task is running on a dedicated CPU core, there is almost no linux kernel code running on it. But for some reason other CPU are affected by the framebuffer draw.
10:04 tzimmermann: jfalempe, as you've noted yesterday. we're doing quite a bit of vmap/vunmapo in the kernel address space. IDK maybe that has an impact on RT as well
10:04 sima: tzimmermann, yeah but it doesn't seem to be the lack of vmap/vunmap, just the flushing that makes a difference ...
10:05 jfalempe: tzimmermann, surprisingly that was beneficial for the RT tasks,
10:05 tzimmermann: jfalempe, but it's not the RT task that does the print, right?
10:05 sima: jfalempe, small userspace tool which simulates the cursor drawing access pattern would also be interesting ...
10:05 jfalempe: tzimmermann, yes it's on another core, the print can't run on the RT core.
10:05 sima: yeah if the rt task does any printk it's game over with the current console locking
10:07 jfalempe: I don't think it's an issue with locking, it's mostly the cache or external bus, that can affect other cores like this.
10:07 tzimmermann: jfalempe, it that a NUMA system? do some of the CPU cores share some of the memory bus or L2/L3 caches?
10:07 tzimmermann: that could be a cause for interference
10:08 sima: jfalempe, the cursor is 2 cachelines redrawn once per second ...
10:08 sima: or maybe 4 if it's crossing over
10:08 sima: so 256 bytes
10:09 sima: if that's enough, then something very funny is going on ...
10:09 sima: that's like a few % at most of a modern cpu's L1
10:12 sima: jfalempe, btw if you don't have fbtest or similar handy on that server then just writing directly into the fbdev /dev node should work too
10:12 jfalempe: tzimmermann, I didn't find which server they are using on logs. I think it's some standard one.
10:12 sima: don't need to mmap
10:13 jfalempe: sima, I think at some point I ask them to write directly to /dev/fb0
10:13 sima: so 1. completely disable fbcon in .config and 2. write stuff to /dev/fb/0 to simulate what fbcon would do
10:14 sima: jfalempe, yeah but need to make sure fbcon is completely out of the picture, otherwise it's not very interesting experiment
10:14 jfalempe: sima, ok, I'm trying to summarize that, and that will take a few days before having the answer.
10:15 jfalempe: Thanks tzimmermann and sima, I hope this will shed some light on this issue.
10:29 tzimmermann: jfalempe, by NUMA, i mean the memory topology: https://en.wikipedia.org/wiki/Non-uniform_memory_access
10:32 jfalempe: tzimmermann, using isolated CPU seems to require NUMA https://access.redhat.com/documentation/fr-fr/red_hat_enterprise_linux_for_real_time/7/html/tuning_guide/isolating_cpus_using_tuned-profiles-realtime
10:34 javierm: jfalempe: git://git.kernel.org/pub/scm/linux/kernel/git/geert/fbtest.git has tests for random access AFAIK
10:36 jfalempe: javierm, thanks, I will see we can make use of it.
11:08 zamundaaa[m]: MrCooper: interesting. Is there anything that can be done to avoid that with KMS though, short of fixing the kernel? It's not like we can disable implicit sync for atomic commits, right?
11:14 emersion: does IN_FENCE_FD disable implicit sync?
11:14 emersion: GL has exts to disable implicit sync but not supported by Mesa sadly
11:31 jani: sima: I'm positively surprised that having set up the gitlab CI in my personal repo, just pushing it to the new one made it all work
11:31 jani: sima: even though most of it was just cargo culted :D
11:39 MrCooper: zamundaaa[m]: yes, IN_FENCE_FD
11:40 MrCooper: the issue reporter successfully tested a proof-of-concept patch for that
11:41 emersion: nice
11:41 zamundaaa[m]: Cool. I'll hook that up in KWin too then
11:46 swick[m]: emersion: btw, thanks for pushing dma-buf heaps. I really think they are how we can solve the generic allocation stuff in the long term...
12:00 javierm: emersion: thanks a lot for your r-b. I wasn't sure if got all the terminology correct :)
12:01 javierm: emersion, swick[m]: did you see https://lore.kernel.org/lkml/20230911023038.30649-4-yong.wu@mediatek.com/ ? Mediatek is also interested in exporting some of the dma-buf heap symbols
12:28 sima: jani, sometimes technology is amazing :-)
12:55 jani: sima: yeah, emphasis on sometimes :)
12:55 sima: very
12:56 javierm: sima, jani: non-tech savy folks get surprised when I tell them that I'm surprised when things do work :)
12:57 sima: yeah if you know how the sausage is made ...
12:57 javierm: sima: exactly
13:02 jani: :)
13:11 emersion: swick[m]: glad to hear that!
13:12 emersion: javierm: yup, was linked from the thread
13:13 javierm: emersion: ah sorry, missed that
13:13 emersion: they added the stuff to remove a heap as well
13:14 javierm: emersion: yeah, noticed. I also wondered if you would need dma_heap_get_name() if need more than a heap, and so can't hardcode to "vc4"
13:14 emersion: we don't need to get back the name from an existing heap
13:15 emersion: not for my purposes at least
13:15 javierm: emersion: I see
13:21 sima: mlankhorst, drm-misc-fixes stuck at 6.6-rc1 is a bit bad ...
13:21 sima: would be good to at least occasionally roll the thing forward
13:23 javierm: emersion: and I agree with swick[m], also believe that your approach is the correct way to handle this
13:23 emersion: ♥
13:59 agd5f: anyone here going to LPC next week?
14:45 MrCooper: sima: mutter will need some kind of feedback from the kernel to decide by when it needs to call the atomic commit ioctl, see e.g. https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/3373#note_1913780
14:46 enunes: emersion: one thought I have in mind with the dma_heap solution is, we might still need to carry the current solution/workaround in mesa for a while even if we land that right? otherwise the driver will stop working in kernel versions before the one which has that
14:46 emersion: yes, we will
14:51 enunes: too bad we still wont be able to get rid of it
15:02 jfalempe: tzimmermann, here is the documentation on how to measure and mitigate the latency introduced by SMM: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/9/html-single/optimizing_rhel_9_for_real_time_for_low_latency_operation/index#assembly_running-and-interpreting-hardware-and-firmware-latency-tests_optimizing-RHEL9-for-real-time-for-low-latency-operation
15:04 jfalempe: tzimmermann, and for the NUMA question, they have only 1 NUMA node on the server, so no NUMA involved in this case.
15:26 tzimmermann: ok
20:29 Hazematman: Could someone with the right permissions add `llvmpipe` & `lavapipe` labels to my MR? https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26153