IRC Logs of #dri-devel on irc.freenode.net for 2024-03-01

03:48 Lynne: what are the build instructions for nvk these days? only used in in pre-rust days
03:49 Lynne: meson errors out because syn is not installed, but it's not a binary package, so it cannot be installed manually
04:07 psykose: there's a .wrap for it and a meson.build in subprojects/packagefiles/syn/meson.build
04:07 psykose: same for the other three deps
04:08 psykose: without meson fetching it you have to fetch them yourself into subprojects/
04:08 psykose: syn, quote, proc-macro2, unicode-ident
04:08 psykose: see the .wrap files for the version/url
04:10 psykose: if you don't have something like --wrap-mode nofallback/nodownload then meson does it automatically, with nodownload you have to download it first, with nofallback i have no idea how you'd make it work (never tried)
04:11 Lynne: subprojects where? it's not in mesa, and I don't see it in meson's wrap list
04:12 psykose: subprojects/ the folder, toplevel
04:12 Lynne: there isn't any such in mesa
04:12 psykose: which has the .wrap files in it
04:12 psykose: hmm
04:12 psykose: but it is https://gitlab.freedesktop.org/mesa/mesa/-/tree/main/subprojects?ref_type=heads
04:13 Lynne: deleted it while testing
04:15 airlied: yeah it should just happen at meson time unless you turn if otff
04:22 Lynne: have to say, build systems are generally the worst pieces of software ever written
04:22 psykose: meson is pretty good
04:22 Lynne: I'm sure there were a lot of bad options, and the least bad was to have some unholy amalgamation of meson and cargo
04:23 psykose: you don't need cargo for this pretty sure
04:31 tjaalton: gfxstrand-web: yes?
05:57 airlied: mripard: CC [M] drivers/gpu/drm/msm/msm_debugfs.o
05:57 airlied: CC [M] drivers/gpu/drm/msm/dp/dp_debug.o
05:57 airlied: /home/airlied/devel/kernel/dim/src/drivers/gpu/drm/sun4i/sun4i_hdmi_enc.c: In function ‘sun4i_hdmi_connector_atomic_check’:
05:57 airlied: /home/airlied/devel/kernel/dim/src/drivers/gpu/drm/sun4i/sun4i_hdmi_enc.c:191:17: error: implicit declaration of function ‘drm_atomic_get_new_connector_state’; did you mean ‘drm_atomic_helper_connector_reset’? [-Werror=implicit-function-declaration]
05:57 airlied: 191 | drm_atomic_get_new_connector_state(state, connector);
05:57 airlied: | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
05:57 airlied: | drm_atomic_helper_connector_reset
05:57 airlied: /home/airlied/devel/kernel/dim/src/drivers/gpu/drm/sun4i/sun4i_hdmi_enc.c:191:17: warning: initialization of ‘struct drm_connector_state *’ from ‘int’ makes pointer from integer without a cast [-Wint-conversion]
05:57 airlied: cc1: some warnings being treated as errors
05:57 airlied: seeing that after merging drm-next MR
05:58 airlied: I'm guessing a missing include
05:58 gfxstrand: tjaalton: Just wanted to make sure you saw we dropped the -experimental from NVK so you should plan to turn it on as soon as you pick up Mesa 24.1.
06:38 tjaalton: gfxstrand: yep, noted
07:34 dolphin: airlied, sima: duh, I was living wrong day of the week yesterday, will send the drm-intel-fixes PR
07:43 mripard: airlied: which config are you using? I don't see it with drm-misc-arm
07:48 airlied: mripard: seeing on my x86 builds
07:48 airlied: https://paste.centos.org/view/2d426c73 that config
07:51 mripard: thanks
08:05 mripard: airlied: you have a mail, I assume you will merge it in drm/next directly?
09:07 airlied: mripard: yes I'll grab it
09:11 jfalempe: sima: regarding https://paste.debian.net/hidden/9be7656c/, would it be better to have the lock in the plane struct instead ? so if your card have multiple output, they will have separate locks, and won't slows down each other ?
09:12 sima: jfalempe, yeah that would be an option and might indeed be cleaner
09:12 jfalempe: Also I find it simpler to have the get_scanout_buffer() function in struct drm_plane_funcs instead of the device modeconfig.
09:12 jfalempe: that allows to handle multiple output cleanly.
09:12 sima: since iterating over planes or whatever we need to look at in struct drm_device is safe if we register/unregister the panic notifier in drm_dev_register/unregister
09:13 sima: so you don't need the spinlock for those parts
09:13 sima: jfalempe, yeah agreed the hooks are best in the plane functions
09:13 sima: jfalempe, I'm typing some more words in the commit message for john ogness and then I'll send it out as an rfc, does that sound good?
09:14 jfalempe: sima: yes that's sound good.
09:15 jfalempe: also for the debugfs test, since using the notifier is a bit clumsy, another way to do it would be to loop through all drm devices, and all planes with a get_scanout_buffer() function ?
09:20 sima: jfalempe, my personal feel is that the notifier feels better, but we should move this testing infrastructure into common panic.c code
09:21 sima: maybe behind a Kconfig
09:21 sima: I've also been annoying john ogness whether he can create something like that for the panic flow
09:21 sima: since with his new console_lock replacement you can limit to only the safe panic lock takeovers, and so we could safely run the entire panic code as part of ci
09:22 sima: jfalempe, so maybe split out that patch standalone and submit it as an rfc for how to best do that?
09:22 sima: since it's not really drm specific at all right now and could be put into kernel/panic.c
09:22 sima: and if that holds it up for too long I think we could create a per drm_device trigger in debugfs, which would be entirely drm specific
09:23 sima: as a stop-gap solution
09:23 jfalempe: sima: yes, I can split it from the rest, but also I'm not sure what the other panic notifier are doing, so it may not make sense for them to be called from debugfs.
09:23 sima: hm yeah ...
09:23 sima: otoh testing is good, and there's a real effort to make the panic code not so fragile
09:24 sima: so my suggestion would be to split the panic notifier testing patch out as standalone and move the code to kernel/panic.c with a Kconfig or so
09:24 jfalempe: but, if there is a solution to test the panic in ci, that would be very good.
09:24 sima: and have the discussion with experts how we best simulate panics for testing
09:24 sima: and in parallel we do a debugfs on drm_device in the drm debugfs directories?
09:25 sima: and that drm trigger just walks over all planes and does the panic handling on each of them
09:25 jfalempe: sima: so we can trigger the panic code for each device independantly ?
09:25 sima: jfalempe, yeah
09:25 sima: jfalempe, the rfc with core folks would also be to have the discussion what we need to wrap that call with to simulate panic contexts the best
09:25 jfalempe: yes, that sounds good
09:26 sima: like we definitely want to disable hardirq handling
09:26 sima: ideally even get into nmi context, since that's the worst panic context
09:26 sima: to make sure our panic code _really_ works in the worst case panic situation
09:26 sima: if we just run it from the debugfs write function in full process context there's a lot of issues we won't catch
09:26 sima: like sleep, or taking locks accidentally and all that
09:27 sima: we want to make sure any sleep or even mutex_trylock blows up in test when kernel debugging is enabled
09:27 jfalempe: sima: yes that would be the best way to test it reliably
09:27 sima: so even if the core panic debugfs doesn't go anywhere, we need the rfc to have that discussion
09:28 sima: jfalempe, ^^ can you include that open question in the patch commit message to get this started?
09:28 jfalempe: sima: yes let me start a thread about this.
09:28 sima: well two opens: is a core panic test infra a good idea? and what is the best way to simulate panic context (ideally nmi) without actually panicking the system, so that it can be used in ci?
09:28 sima: jfalempe, thanks a lot!
09:30 vsyrjala: is there some idea how to make the panic stuff work if the hardware is in the middle of a commit when the panic occurs?
09:31 sima: vsyrjala, probably not
09:31 sima: but I think it can be made to work with my rfc patch
09:32 sima: if the driver opts to protect the mmio writes to the scanout registers with the raw spinlock
09:32 vsyrjala: at least on i915 the mmio writes will not even be done by the cpu in the future
09:32 sima: and then reads back the actual register state to figure out where the current fb is that the hw actually scans out
09:33 sima: ofc, if you can't trylock the spinlock then you're screwed and it's best to not do anything, since the hw is in a ill-defined state
09:33 sima: and the commit work might be running in parallel, wreaking havoc
09:33 sima: vsyrjala, that case is easier I think, since the fw/gpu won't die in panic()
09:33 sima: so you can limit yourself to looking at sw state (with a minimal race window protected by the raw spinlock)
09:34 sima: safe in the knowledge that maybe the display doesn't show the new buffer yet, but once the fw has done it's job, it will
09:34 sima: even when the kernel is long dead at that point
09:34 sima: but yeah fundamentally there's a race, and my rfc has a fairly big window, but you can make it much smaller with driver code
09:35 sima: but it's never going to be zero
09:35 jfalempe: yes the panic handling is a best effort approach, we can't guarantee the panic screen will be displayed 100% of the time.
09:36 airlied: as long as some shitty mga or ast driver doesn't stop my serial port from getting it :-P
09:36 vsyrjala: i think the only safe way would be to have the panic handler wait for the hardware to finish its commit. othereise you could get all kinds of funny mmio faults and whatnot when the two register updates fight each other
09:36 vsyrjala: *iommu faults
09:38 sima: airlied, yeah that's really the primary goal, and why I think the standard design should lean _extremely_ heavily towards safety
09:38 sima: vsyrjala, pls no, hw can hang
09:38 sima: no, absolutely no waiting or spinning in panic context
09:39 sima: because on any reasonable system there's a bunch of other ways to dump out panics, so really good chances you just make it much, much worse
09:39 sima: for real console there's two steps actually for this reason: 1. only do the absolute safe stuff, over all panic outputs
09:39 sima: 2. try harder and pray
09:40 sima: unfortunately panic notifiers aren't that great yet, but could be added easily
09:40 vsyrjala: i guess we just don't use this then
09:40 sima: but 2 must be done only after all of 1 has finished
09:41 sima: vsyrjala, seems a bit drastic take when I just typed out what it'd take to make it happen like you want ...
09:42 vsyrjala: don't see how to make it work when the hardware is busy writing registers in parallel. we'd potentially just create more explosions. we could do it when we know the thing is idle though
09:42 sima: vsyrjala, panic code by default doesn't touch any display state at all
09:43 sima: we just overwrite whatever is currently being scanned out
09:43 sima: exactly because touching display state is pretty much impossible
09:43 sima: so the new panic code has code to write into yuv and could also write into tiled buffers
09:43 sima: so that you don't have to touch any fifo state or anything really tricky like that
09:44 sima: and the only hard part is making sure you pick the right buffer
09:44 vsyrjala: and actually having cpu acccess to said buffer
09:44 sima: amd has peek/poke registers
09:44 sima: shit hw is shit hw, can't help that
09:45 sima: ofc writing a few mb with peek/poke is going to be extremely slow, but that doesn't matter
09:45 sima: if you have a gart, I guess you could reserve one pte
09:46 sima: and probably need to protect the tlb flush with the panic spinlock to avoid lolz
09:46 vsyrjala: hmm. yeah, i suppose that could work
09:46 sima: if you have nothing, well it just sucks then
09:47 vsyrjala: going to be a slight pita to write all the manual tiling stuff though
09:47 sima: yeah
09:47 sima: and ccs clearing
09:47 sima: but we've tried the other approach of trying to reprogram hw state to be easier, and that defo doesn't work well enough beyond tech demo
09:48 sima: the entire thing being real pita for a ccs tiled buffer is also why I really want the debugfs interface
09:48 vsyrjala: yeah. i thought it was still just some kind of 'let's just update just the scanout address' approach
09:48 sima: vsyrjala, you could do a bit a mix, like clear the tiling bits
09:49 sima: I think amdgpu folks want to do that
09:49 sima: but no-no when the gpu fw pushes out the flips ofc :-(
09:49 vsyrjala: yeah
09:49 sima: but anything more my gut feeling is that it's just too easy to kill the hw because you programmed terrible watermarks
09:51 sima: vsyrjala, also like I said, if we improve the panic notifiers to have the same feature set as john ogness is adding for full blown consoles
09:52 sima: then you get a lot of nifty tools to take over from the driver and a 2nd attempt where you can go risky
09:52 vsyrjala: psr/fbc/etc. might also be a pain. but i think we should have sufficient ways to kick those somewhat safely
09:52 sima: ofc the complexity should still be as close to taking over an uart, because this code runs in the absolute worst context
09:52 sima: yeah
09:53 sima: vsyrjala, the biggest with all of these is that beyond the panic raw spinlock that common code will trylock for you
09:53 sima: you cannot take any locks
09:53 sima: even spin_trylock is no-go because of -rt and nmi context
09:53 sima: jfalempe, btw just realized that per-plane spinlock might not be a good idea
09:54 sima: for hw with global resources like the peek/poke register, where the spinlock needs to be for the entire device
09:54 sima: so I'm leaning towards spinlock per drm_device again more
09:55 jfalempe: sima: the lock is to protect access to the state framebuffer, device should have its own lock for its resources ?
09:56 sima: jfalempe, they can't
09:56 sima: panic you get one, and only one raw spinlock, that you trylock
09:57 sima: otherwise we deviate too much from the new console_lock design, and I think we don't want that because there's a bunch of good reasons to make panic notifiers more like panic-only consoles
09:57 sima: see the entire discussion above, plus what I've just added to my rfc
09:58 jfalempe: but that means the driver will need to take the panic_lock each time it programs the hw, wouldn't that be too much lock contention ?
09:59 sima: jfalempe, the example I have is for protecting the peek/poke registers that e.g. amd has
09:59 sima: which is strictly for debugging only, and an _extremely_ slow way to access vram
09:59 sima: so adding a raw spinlock doesn't matter
10:00 sima: jfalempe, anther example would be protecting the go bit, or the scanout address register
10:00 sima: which should just be one mmio write per display flip
10:00 jfalempe: if it's used only by the panic code, there shouldn't be a race condition for peek/poke.
10:00 sima: so again entirely ok, the mmio will be much slower than the raw spinlock/unlock anyway
10:00 sima: jfalempe, it's for debug in general
10:01 sima: iirc they expose it through debugfs too as a debug tool
10:01 sima: it's good to figure out issues when you don't trust your gpu pagetables
10:06 jfalempe: sima: if the panic occurs when you are already messing up with gpu vram from userspace, that's kind of a corner case, I'm not sure we can support anyway.
10:08 sima: jfalempe, sure, but the drm_panic_lock around that will make sure it won't blow up badly
10:09 sima: I really want to make sure that this new panic design is really safe, so we need to have an idea how to make these things work too
10:09 sima: ofc if you're race really badly, then no panic output for you
10:09 sima: but also the console takeoverlock would help with these
10:09 jfalempe: sima: may it crash the machine, or it may just corrupt the output ?
10:09 sima: jfalempe, crash, no
10:10 sima: because then an output later on that would work doesn't
10:10 sima: and that's the case we absolutely need to avoid
10:10 sima: failing to print is ok, corrupted display screen is ok, crashing or killing the hw, not ok at all
10:10 jfalempe: sima: that's what I think too.
10:11 sima: that's why we need the drm_panic_lock so that drivers can protect this additional pieces they might need in their panic code
10:11 sima: like when there's no way to write into the buffer reliably because unmapped vram
10:11 sima: except with these peek/poke registers
10:12 jfalempe: ok and it won't be practical to trylock all planes panic_lock in this case.
10:12 sima: yeah
10:13 sima: or would just add more potential failure paths and issues
10:13 sima: or people trying to use spin_trylock because hey it works in hardirq context (but not in nmi)
10:14 jfalempe: sima: ok so I will leave the panic_lock at device level.
11:54 pq: Is drm_fixp2int_round() really ok? What is it supposed to do?
11:54 pq: in kernel
11:56 pq: it's certainly not rounding the way I understand rounding
12:27 tnt: pq: DRM_FIXED_POINT_HALF looks weird to me. Should be DRM_FIXED_POINT AFAICT.
12:28 pq: that would make more sense
12:37 dolphin: mripard: I think your MUA might have some problem (or mine), you seem to have replied to a mail that I initially never got and it appears as a reply to HDMI connector thread
12:51 mripard: dolphin: yeah, I screwed up on wednesday
12:51 mripard: it shouldn't be a problem anymore, but all the mails I've sent then have the same msg-id
12:52 dolphin: right, that explains the mayhem in 'alot' view
13:00 mripard: jani: thank you so much for the b4 integration in dim :)
14:43 sima: jfalempe, I'm not sure we need any panic KCONFIG
14:44 sima: like even if people enable both fbcon and drm_panic the mess should be limited
14:45 sima: for one, fbcon might not run on all drm drivers, or it's disabled because a compositor is running and so wont show anything
14:45 sima: on the other side, drivers need to write explicit support for drm-panic anyway
14:45 jfalempe: sima: ok that's fine. just asking because in an ideal world if you don't enable drm panic, you don't want to lock/unlock when updating the plane state.
14:45 sima: so in practice I don't expect much conflicts
14:45 sima: and even if you get them, panic goes through all of them in a loop
14:46 sima: so in the end, either drm-panic of fbcon wins, even if they both write into the same fb
14:46 sima: I think at least ...
14:46 sima: jfalempe, I don't think you can measure that, and we shouldn't add complexity with no benefit
14:46 sima: and every kconfig we add has a cost
14:46 jfalempe: I have a small workaround to disable fbcon when drm_panic runs, to avoid the graphic mess.
14:47 jfalempe: but I prefer to have a clean drm_panic merged first.
14:49 sima: jfalempe, graphical session should be enough to disable fbcon, or not?
14:50 jfalempe: sima: I didn't have issue with graphical session, but only tested with matrox and simpledrm
14:52 jfalempe: sima: I tested the conflict between fbcon and drm_panic on my arm device, with imx driver. (and this one don't have a graphical session).
15:14 DemiMarie: sima: When it comes to panic outputs, I think your expectations for “reasonable” don’t match with reality.
15:16 sima: ... what are my expectations? thus far I only locked at the locking, not once what it actually shows
15:18 DemiMarie: That there will be other ways to get the data out.
15:18 sima: uh that's not my expectation
15:19 sima: but we absolutely need to make sure that if there are other options, we don't go crash&burn in the drm panic handler and prevent those others from even having a chance
15:19 DemiMarie: Fair
15:19 sima: this is why the new console panic code is two stage, first it does everything which is safe
15:19 sima: and then it goes for the last ditch options
15:19 sima: and then it goes to fbcon to burn it all down :-)
15:20 DemiMarie: So on a typical client system you have two options: EFI pstore and DRM panic.
15:20 sima: only issue is that the panic notifiers drm-panic will use currently only have the first stage, but that can be fixed
15:20 sima: DemiMarie, netcon for desktops is pretty good too
15:20 DemiMarie: sima: not in my world, where the host is running completely offline
15:20 sima: or if it's network on a thunberbolt extension box for laptops
15:20 sima: DemiMarie, I think _that_ is unusual :-)
15:21 sima: plus you can just airgap a second machine like a rpi just to record netcon
15:21 sima: been there, done that
15:21 DemiMarie: sima: Unusual? Sadly, yes. Unreasonable? No. And we have both network and USB assigned to guests.
15:21 sima: also uart over usb cables works pretty well I've heard
15:22 DemiMarie: USB is assigned to guests too
15:22 sima: yeah that one needs special setup
15:22 sima: and special cable
15:22 sima: and an uart dongle to another machine
15:23 sima: but if all else fails, it tends to work very well, since it's just dead slow mmio writes
15:23 DemiMarie: So what I am saying is that your code may well be the only way to get messages off.
15:24 sima: yeah, but also: I've seen way to many pstore dumps that just show drm fbcon dying in panic
15:24 sima: so we really have to avoid that as the first goal
15:26 sima: DemiMarie, but otherwise I'm fully on board with you, which is why the new drm panic should be able to get stuff out even when you watch a video with yuv scanout
15:26 sima: the old one was just crash&burn in that case
15:30 DemiMarie: sima: A crash kernel would be awesome, *if* it could be made to work with LUKS-encrypted storage.
15:31 sima: DemiMarie, hm didn't mjg59 do some very fancy demo with tpm secrete shuffling to make that work?
15:31 sima: but extremely far away from where I have clue
15:39 DemiMarie: sima: could there be special handling for cases where the panic was in process context?
15:39 DemiMarie: I’m thinking of stuff like, “Attempted to kill init!”, which is a userspace bug.
15:41 DemiMarie: vsyrjala: why will the firmware be doing the writes?
15:41 vsyrjala: which firmware?
15:41 sima: jfalempe, oh btw, why are you not using kms_dump_register? that looks a lot more like the thing we want ...
15:42 DemiMarie: vsyrjala: whatever does the MMIO writes
15:43 sima: jfalempe, I guess I forgot why we're picking the panic notifier and not kmsg_dumper? since the latter is what pstore also uses ...
15:43 sima: DemiMarie, not sure what you'd gain in process context, the scheduler refuses service anyway?
15:43 sima: you can trylock more locks, but that's about it I think
15:44 DemiMarie: sima: maybe that particular panic (and others that are not actual kernel bugs) should do some stuff first.
15:44 DemiMarie: But that is getting off-topic
15:44 sima: yeah, maybe it could oops first and then panic
15:45 vsyrjala: DemiMarie: oh that. it's just a small dma engine thingy. it doesn't have firmware. though eventually there will probably be firmware pain also added to the whole mix
15:45 sima: and with kmsg_dump we could have a knob so that we also print stuff on oops
15:48 sima: jfalempe, oh I've found your mail, I think we should switch over to kmsg_dumper
15:48 sima: I totally forgot about that again
15:49 sima: noralf's og drm-panic also used kmsg_dumper, so you should be able to steal code from there
15:50 sima: jfalempe, the other reason for kmsg_dumper is that at that point the panic output isn't even complete, so we definitely have to use that and not the notifier
15:51 DemiMarie: How does Windows display its BSODs?
15:58 sima: tbh no idea, but the kernels are fairly fundamentally different in so many ways I don't think the design would translate at all
16:13 tleydxdy: does anyone know why occasionally (10s-1min) drm would stop putting any jobs onto the HW for ~20ms? from gpuvis I still see that ioctls are coming in but no jobs are being scheduled and no dma_fence is coming back
16:25 tleydxdy: just tried, this can be easily reproduced by tracing vkcube with VK_PRESENT_MODE_IMMEDIATE_KHR
16:42 jenatali: gfxstrand: Is there a generic pass that removes out-of-bounds loads/stores? vars_to_ssa does it for locals, and loop unrolling attempts to do it in some cases
16:42 jenatali: And if I was going to write one, should that be backend-specific? Seems like it should be general
16:44 gfxstrand: No, there's nothing general.
16:44 gfxstrand: What are you thinking?
16:45 gfxstrand: For NVK, we need something that gives us some sort of bounds checking behavior for indirect scratch access because the GPU just faults and kills your context and lots of apps seem to be hitting that.
16:46 jenatali: gfxstrand: For function_temp and shared, our backend wants them as derefs, since DXIL uses LLVM GEPs which are basically the same thing
16:47 jenatali: But if you have a GEP with a literal out-of-bounds index, that fails to validate, even if it's in code that never executes, so I need to remove those
16:47 gfxstrand: ah
16:47 gfxstrand: Yeah, that pass doesn't exist
16:47 jenatali: Worth being general for pre-io-lowering?
16:48 gfxstrand: IDK
16:48 jenatali: That sounds like a no. Makes my life easier. We can always move it later if someone else wants it
16:49 gfxstrand: Sounds good
17:29 alyssa: jenatali: going with no... statically invalid but unreachable code is very much a layered driver specific problem
17:29 alyssa: so unless zink wants it, probably doesn't matter
17:30 jenatali: Yep, fair enough
17:55 jfalempe: sima: with kmsg_dumper, you don't even have the panic reason. so you need to parse the ksmg to retrieve some useful output which is not great at all.
17:56 jfalempe: also I target drm_panic for average user of Linux distribution, so I want to have only one or two lines of text. All debug info can then go in a qr_code, so you can open a bug and have some info directly there.
17:57 jfalempe: I find it better than a blurry picture of an fbcon output.