IRC Logs of #dri-devel on irc.freenode.net for 2023-12-07

08:52 mripard: lumag: you had a r-b already?
10:31 lumag: mripard, but doesn't this need an ack from the core team? I should go, reread the drm-misc docs.
10:32 mripard: lumag: a-b then
10:32 lumag: already pushed :-(
12:00 DavidHeidelberg: lumag: quick check, can your Mesa patch backfire for anything outside of rusticl usage?
12:01 DavidHeidelberg: s/Mesa/Mesa rusticl/
12:03 lumag: DavidHeidelberg, which one? fdo rusticl? It might as I have been changing ir3. But it is far from being complete.
12:03 lumag: Or do you mean the BO patchset?
12:04 DavidHeidelberg: it's not about OpenCL functionality, if it's about interference with other loads
12:04 lumag: Yes. OpenCL jobs might take a while, so we might be hitting the hangcheck.
12:07 DavidHeidelberg: lumag: when Marge will be free, I'll re-run pipeline on A***; I was testing your MR + my fix on Alpine gitlab and they kinda quickly merged it. The CL is guarded by RUSTICL_ENABLE, so that's safe, but if it alter stuff outside rusticl that would be tricky (and I would send rather revert then)
12:08 lumag: DavidHeidelberg, !25840 it is a draft on purpose, it should not be merged
12:09 DavidHeidelberg: kk, I'll send revert asap :)
12:10 lumag: There are generic fixes there, but it should be more complete. I never tested it against full GL/VK testsuite
12:10 lumag: So those fixes might break something else.
12:30 tzimmermann: if a PCI device is disabled, it's child devices are implicitly disabled as well. right?
12:30 tzimmermann: 'its'
13:07 mripard: jani: that discussion seems to be going around in circles. We agreed here on what the solution was, it's what you were arguing for the whole time, but somehow it's not enough?
14:11 jani: mripard: we agreed? I certainly didn't agree on BUG/BUG_ON, and the direction in kernel is to remove all of them
14:13 mripard: https://oftc.irclog.whitequark.org/dri-devel/2023-12-05#32721575;
14:14 mripard: so if you don't want to do proper error checking, or warning, or explicit panic
14:14 mripard: what do you want to do?
14:16 emersion: jani: i think the BUG/BUG_ON issue is not about the concept but about this particular implementation?
14:16 emersion: like, WARN/WARN_ON is fine?
14:17 emersion: hm maybe i'm misremembering
14:17 emersion: do you have a link explaining the BUG/BUG_ON shortcomings?
14:20 emersion: this? https://yarchive.net/comp/linux/BUG.html
14:21 emersion: the tl;dr is indeed that WARN is preferred over BUG, it seems
14:21 mripard: emersion: Linus' PoV is that the kernel should never really panic for debugging stuff
14:23 jani: mripard: my preference is not to do anything, but if you insist on adding something, then if (WARN_ON(foo)) return -EINVAL;
14:24 jani: the reason for this opinion is not to be lax about error checking
14:24 jani: it's because people really do cargo cult and copy-paste stuff all over the place
14:25 jani: they see this pattern, and soon you'll have a bunch of functions doing this. it's just how it is
14:25 jani: you really do want to be careful about the examples you set
14:26 jani: imo not every function needs to check the input *iff* that input does not come from the user or and can't be impacted by the user
14:27 emersion: i've seen the WARN_ON pattern a lot in core DRM
14:27 vsyrjala: i found quite a lot of spurious null check cargo-cult when i was playing around with the __attribute__((nonnull))
14:28 vsyrjala: maybe i should make proper patches to nuke some of it
14:28 jani: emersion: a lot of places you can't get to with NULL drm_device anyway
14:28 jani: emersion: so why do you need to check it?
14:29 emersion: i don't think it makes sense to go and add NULL checks for drm_device everywhere
14:29 jani: that's pretty much the point I'm trying to make here
14:29 emersion: but e.g. when the driver provides something, maybe it makes sense to check
14:31 emersion: like for instance, when the driver creates the IN_FORMATS blob
14:31 emersion: we have a WARN_ON for zero formats
14:31 emersion: and this makes a lot of sense to me
14:34 jani: also drm_connector_init() checks if (drm_WARN_ON(dev, !(funcs && funcs->destroy))) and I also think that makes sense. it can lead to hard to debug errors down the line
14:36 mripard: jani: and the point I'm trying to make is that doing that implicitly, inconsistently, and without it being documented is bad.
14:36 mripard: I've suggested the warn_on approach too, but vsyrjala was against it
14:37 mripard: so here I am, trying to fix something that looks super consensual to me, but with everyone pulling in different directions
14:39 jani: wrt consistency, the question remains: where do you draw the line?
14:40 jani: say, strcmp will happily null deref its arguments
14:40 mripard: I don't care at this point, you tell me. But it should be the same everywhere.
14:41 jani: my argument has been that if it can actually be NULL, check it. if it's a programming error to pass NULL to the function, just NULL deref it
14:44 mripard: can funcs be null or not?
14:46 pq: maybe "the same everywhere" is the problem - not everything is the same
14:46 emersion: i think it's pretty hard to come up with a generic rule that works everywhere
14:46 jani: mripard: no
14:46 mripard: jani: ok, so the code you pasted above that was making sense to you is wrong then?
14:47 emersion: in the end it's subjective, and IMHO boils down to "is the driver likely to get this wrong? if it gets it wrong, will this silently succeed and then blow up later on?"
14:47 jani: mripard: you could drop the funcs != NULL part out
14:47 mripard: emersion: funcs are mandatory for planes, CRTC and connectors, not for encoders
14:48 jani: mripard: point was about managed having destroy callback
14:48 mripard: can a driver author get it wrong?
14:49 jani: if it oopses on first try, no :p
14:50 mripard: an oops is effectively a panic on all the systems I have, it's equivalent to BUG_ON you were against
14:50 jani: one of the reasons I'm a bit sensitive to this is the GEM_BUG_ON macro in i915 gem. it's a kind of assertion that behaves different depending on a kconfig. people aren't sure about stuff, so they throw in a check. and another. and another. and now we have 1.2k instances of GEM_BUG_ON in the driver. you don't want to go there.
14:51 mripard: anyway, I'm dropping all those patches
14:51 mripard: I'll just ignore the issue going forward
14:51 mripard: and hope for the entire thing to be rewritten in Rust so we can stop discussing it to no end
14:52 emersion: you still get the same issues with Rust :o
14:52 emersion: well, maybe less issues, but still some
14:52 mripard: no, because you don't get to have an opinion over whether a pointer is valid or not
14:52 mripard: and whether you should check your values or not
14:52 jani: mripard: I'm sorry I got us into this loop
14:52 emersion: but you get the zero format listy length issue still, for instance
14:53 mripard: sure
14:53 mripard: I'm not sure that one was controversial though
14:56 jani: mripard: idk I've been told by akpm et al that there's no point in BUG_ON checks if it hits a null deref anyway
14:56 emersion: yeah, i would agree
14:57 emersion: except if the deref is delayed, for instance
14:57 emersion: then failing earlier is better
14:57 mripard: again, we would disagree on that one, but it doesn't matter
14:58 emersion: otherwise all functions grow a preamble with null checks for each arg and it's messy with no real gain
14:58 mripard: it's super user-hostile to chase down what you did wrong
14:58 mripard: especially when it's like 5 levels down
14:59 mripard: when it could have been an error message and an error returned
14:59 mripard: from a functional PoV, it might be better, but it's terrible from an ergonomics one
15:54 mripard: sima: could you have a look at https://lore.kernel.org/all/20231204121707.3647961-4-mripard@kernel.org/ (and the other related patches) ?
16:03 sima: uh yeah this needs an update
16:06 pq: mripard, I'm curious; if you have an arbitrary function returning an error somewhere in the kernel, how do you figure out where the error came from? I mean for things that are unrelated to userspace; I've seen DRM debug prints telling what userspace did wrong.
16:07 emersion: in general, logging
16:08 emersion: often times it's missing though
16:08 pq: yeah, wouldn't that also be totally flooding the kernel logs if it was done everywhere?
16:09 emersion: there are scopes
16:09 jani: I think dmesg logging of invalid user input is frowned on
16:09 jani: DoS?
16:09 emersion: jani: oh no it's not
16:09 vsyrjala: tracking down a silent 'return -EINVAL' can be super painful imo. null deref is trivial to see from the oops
16:09 emersion: it's just disabled by default
16:09 calebccff: lumag: added prepare_prev_first flag to the oneplus 6 panel driver, as of 6.7-rc3 the fbdev console always fails (after like 5 tries). We have a framebuffer app called unl0kr for handling full disk encryption and it seems there's a <50% chance that when it starts it kicks the display into life
16:10 emersion: drm_dbg_atomic(), for instance
16:10 lumag: calebccff, anything in the logs?
16:11 lumag: also please use #freedreno
16:11 calebccff: lumag: when i was debugging before I noticed that the IRQ (frame done? or maybe vsync?) would not be firing, so it isnt' pushing data to the panel at all
16:11 calebccff: ah will move there
16:12 pq: As a userspace dev, I can well imagine a function returning an error, and then realise there are 3 different calls in it that could fail, and it branches out 5 levels deep. That's painful to debug even with gdb. I guess rr could make it easy though. But if it segfaults 15 levels deep in the call stack, I can immediately see the stack of what lead to the failure.
18:44 mandelstays: the procedure without bit shifting or multiply as well as divide and without over nor underflow and not even signed arithmetic nor control flow, only with sub/add and lookup tables is already pretty complex variation of the presented..you add two times index and values if that overflows its maximum number as result, hence the indexes need to be taken for the first 4bits of 32bit value , as there are 16 combinations in 4 bit, the indexes need to be
18:44 mandelstays: taken from bits 5 to 21 so twice the size of those need to be added to the original value, now if it overflows to one, situation is clear all indexes were either all out and all other bits were in, or the low order bits were X and all higher bits were in, so we could hence safely work on bits 23 to 27 cause 28 to 32 are too big , the former 23-27 are always unchanged of the indexes , so we add sum of 6to22 and remove 5 to 22, anyways now you
18:44 mandelstays: understand where i am aiming already, 5 to 21 is index and is subtract from 23-27 as value, after the bits of 23-27 is found out its possible to find out low order bits, and highest bits are all revealed after this and inverted and added to compressed format and reverted back
18:51 mandelstays: i am working on this, but i think it's not innovative, it is known stuff in cryptography though it appears. so reinvented, but i can see a way to handle overflow and underflow and all the flags like this too
21:04 gfxstrand: rodrigovivi: When you do the Xe PR, are you planning to maintain any development history or squash everything?
21:05 rodrigovivi: gfxstrand: I'm planing to send a regular PR of what we have on drm-xe-next... unless sima or airlied tells me the extra squash is better...
21:06 gfxstrand: rodrigovivi: Where does that branch live? I'm not seeing it in drm-intel
21:06 rodrigovivi: there's the mega initial squash on the bottom from when we went public and the history on top... that we maintained so far with rebases and fixup patches
21:06 rodrigovivi: gfxstrand: https://gitlab.freedesktop.org/drm/xe/kernel/-/tree/drm-xe-next?ref_type=heads
21:07 gfxstrand: Silly me! I was looking in cgit.
21:07 gfxstrand: I'm incredibly pleased to see that it's NOT in cgit. :)
21:08 rodrigovivi: :) yeap, but you were right.. we initially put there when we went public.. but we immediately fix that mistake and opened up the gitlab
21:08 rodrigovivi: gfxstrand: but please notice that this is not in the final shape yet... we have to adjust the patches to absorb this: https://patchwork.freedesktop.org/series/127496/
21:09 gfxstrand: Yeah, that's fine.
21:09 rodrigovivi: and also to get rid of 2 ugly i915/display FIXME patches...
21:09 gfxstrand: I'm not planning a full review or anything. Just really excited to see it finally land.
21:09 rodrigovivi: oh, and 2 uapi changes coming... 1 around vm_bind and another around wait_user_fence...
21:09 gfxstrand: It's been a looooong time coming. :)
21:09 rodrigovivi: yeap, a long time indeed
21:11 rodrigovivi: I'm also glad that everything is aligning right now... it is getting harder and harder to work with all the drm stuff like drm-scheduler, gpuvm, drm-exec and be out of the tree rebasing and doing fixup patches
21:11 gfxstrand: Yeah
21:15 airlied: in hindsight we probably should have merged a non-functional driver earlier :-P
21:15 airlied: but then we'd just have distros trying to enable it all the time
21:15 gfxstrand: airlied: Oh, like the Fedora people are trying to do with NVK?
21:16 airlied: yes, but at least there I can smack ppl
21:16 airlied: considering merging non-functional rust driver bits early so we can grow something in-tree
21:22 rodrigovivi: yeap, but then we would have a bigger discussion around that STAGING vs drm again
21:23 daniels: airlied: haters will say we have at least three non-functional DRM drivers in tree
21:35 airlied: daniels: three just mean they haven't searched enough :-P
21:35 emersion: the better question probably is, how many functional drivers do we have in-tree?
21:36 airlied: if Intel had a track record of working on making code common I'd probably have been a bit more flexible, but after realising how much common code we've managed to pull out since xe started, it was probably all godo
21:37 airlied: would be nice if we could get hostptr support more common, and I'm sure there are other areas
21:37 kchibisov: what is the irc channel for intel support stuff?
21:37 emersion: #intel-gfx
21:45 DemiMarie: mareko: I did not know that Android proprietary drivers supported virtio-GPU native contexts. If so, that is great (though obviously open drivers would still be better).
21:48 DemiMarie: airlied: Rust for GPU drivers would be a nice win for Qubes
21:54 HdkR: Oh, is xe-drm getting close to merging?
21:56 HdkR: I see, two uapi changes still incoming. I'll run it through my struct layout verifier when that occurs
22:21 Company: gfxstrand: what's the ETA on nvk in Fedora you think? A year?
22:22 Company: I need to somehow ensure RH's desktop team has at least some people with nvidia when that happens - we have a somewhat serious problem with everyone having Intel and not testing on anything else
22:24 Company: there's sooo much experimental code currently that falls over with the AMD 256 stride already and I'd like to avoid similar issues when the 3rd big driver lands
22:26 Company: ("desktop team" meaning the application developers, could also call it "Gnome")
23:04 airlied: Company: rawhide in January
23:04 airlied: maybe February if things slip
23:05 airlied: so probably f40, maybe with f39 backports if things line up
23:16 Company: that's earlier than I thought
23:16 Company: should start on nudging people towards getting nvidia gpus then
23:53 DavidHeidelberg: CI migrated from Linux 6.4.x to 6.6.4, it's pretty tested, but id you see some suspicious fail somewhere feel free to throw link at me
23:53 gfxstrand: Company: I would normally say that airlied is being optimistic and signing me up for too much but I really think his prediction is about accurate.
23:54 gfxstrand: I don't know that we're going to get Vulkan 1.3 conformance for Christmas but things are moving FAST now that the new compiler has landed.
23:54 Company: that's nice
23:54 gfxstrand: Like, 120+ patches in the last week fast.
23:55 gfxstrand: (Some of those are because I make mistakes and have no code review and then have to fix them.)
23:55 Company: how's performance coming along?
23:56 gfxstrand: Lots of stuff is very playable. Older games are often 60 FPS+.
23:57 gfxstrand: There's still a couple of major performance improvements that need to happen. Then it's a lot of analysis.
23:57 gfxstrand: I have no idea where we're at compared to the blob right now. I'm kind-of hoping Michael decides to benchmark again soon.
23:57 gfxstrand: I'm a bit surprised he hasn't been benchmarking monthly, TBH.
23:57 Company: hehe
23:57 Company: nvk landing should be a good enough reason for him I guess
23:58 gfxstrand: He did run a bunch right after we landed but he didn't have GSP so it was shit perf.
23:58 gfxstrand: Like < 1% of blob in a bunch of cases
23:58 Company: yeah, I enjoyed that post a lot
23:58 gfxstrand: But between GSP and making UBOs not entirely stupid, we're in okay(ish) shape right now.
23:58 gfxstrand: It's still early days but I don't think I look like a total clown.
23:59 Company: I'm mostly trying to convince enough Gnome and RH desktop developers to get GPUs that aren't Intel
23:59 gfxstrand: Yeah....
23:59 gfxstrand: I really need to get a new laptop this next year.
23:59 Company: so that now that we use GPUs more than just blitting cairo-drawn stuff together, we actually have a somewhat wider test coverage